My research interests lie in data mining and machine learning using big data, with an emphasis on
health-related data. My research focuses on the design, analysis,
and application of learning algorithms for both health and medical data.
The primary goal of my research is to explore both principled methodologies and innovative applications
with highly practical performance that can be used to understand the overwhelmingly large and complex
health data collected from our daily life.
Exploring health data, including but not limited to, electronic health records (EHR),
public health communities, mobile and sensor data, and medical knowledge bases,
has clearly shown the potential to significantly improve people's health and provide better
With the immense accumulation of EHR data being available,
the analysis of such data enables researchers and healthcare providers
to get closer to the goal of personalized medicine.
However, it is hard to mine knowledge from raw EHR data because the data
usually has high dimensionality, temporality, sparsity, irregularity and bias.
These challenges dramatically increase the difficulty of directly applying
traditional machine learning or statistical models to predict patients'
potential diseases, which is an extremely important task in medical domain.
To tackle these challenges, our research mainly focuses on exploring characteristics
of EHR data per se [KDD17, SDM20, KDD20] and
incorporating external information [KDD18, CIKM18, BIBM18].
KDD17: Dipole: Diagnosis Prediction in Healthcare via Attention-based Bidirectional Recurrent Neural Networks
KDD18: Risk Prediction on Electronic Health Records with Prior Medical Knowledge
CIKM18: KAME: Knowledge-based Attention Model for Diagnosis Prediction in Healthcare
BIBM18: A General Framework for Diagnosis Prediction via Incorporating Medical Code Descriptions
SDM20: Rare Disease Prediction by Generating Quality-Assured Electronic Health Records
KDD20: HiTANet: Hierarchical Time-Aware Attention Networks for Risk Prediction on Electronic Health Records
Besides EHR data, there are many crowdsourced question answering websites
in the application of healthcare. For example, on websites such as healthbords.com,
users contribute their answers to medical-related questions.
However, the "true" information usually hides in a massive amount of noisy or
even conflicting crowdsourced data.
To automatically extract medical knowledge (i.e., true facts) from these noisy
crowd-provided answers, we first need to address a challenge.
That is, different users have different reliability levels,
and there is usually neither prior knowledge or training data for
the derivation of user reliability. In light of this challenge,
we developed unsupervised approaches by jointly estimating user
reliability and inferring true facts (i.e. truths) from crowdsourced data
without any supervision [KDD15, KDD16, KDD17, KDD18, KDD19].
KDD15: FaitCrowd: Fine Grained Truth Discovery for Crowdsourced Data Aggregation
KDD16: Towards Confidence in the Truth: A Bootstrapping based Truth Discovery Approach
KDD17: Unsupervised Discovery of Drug Side-Effects From Heterogeneous Data Sources
KDD18: TextTruth: An Unsupervised Approach to Discover Trustworthy Information from Multi-Sourced Text Data
KDD19: Optimize the Wisdom of the Crowd: Inference, Learning, and Teaching
Medical knowledge is valuable and significantly useful for various tasks in medical domain.
Medical text data carries invaluable information about the current and previous medical history,
current symptoms and severity of condition as well as physicians clinical judgment.
How to extract correct medical knowledge from large-scale and unstructured medical text data is a major challenge.
Since it is extremely difficult to directly extract medical knowledge from raw data,
we focus on the following challenges: medical relation extraction [WWW19],
medical fact generation [SML17, US Patent18], and multi-grained named entity recognition [ACL19].
WWW19: MCVAE: Margin-based Conditional Variational Autoencoder for Relation Classification and Pattern Generation
SML17: Long-Term Memory Networks for Question Answering
US Patent18: Long-Term Memory Networks for Knowledge Extraction from Text and Publications
ACL19: Multi-grained Named Entity Recognition