My Focus


My research interests lie in data mining and machine learning using big data, with an emphasis on health-related data. My research focuses on the design, analysis, and application of learning algorithms for both health and medical data. The primary goal of my research is to explore both principled methodologies and innovative applications with highly practical performance that can be used to understand the overwhelmingly large and complex health data collected from our daily life. Exploring health data, including but not limited to, electronic health records (EHR), public health communities, mobile and sensor data, and medical knowledge bases, has clearly shown the potential to significantly improve people's health and provide better healthcare delivery.

Machine Learning for Mining Electronic Health Records

With the immense accumulation of EHR data being available, the analysis of such data enables researchers and healthcare providers to get closer to the goal of personalized medicine. However, it is hard to mine knowledge from raw EHR data because the data usually has high dimensionality, temporality, sparsity, irregularity and bias. These challenges dramatically increase the difficulty of directly applying traditional machine learning or statistical models to predict patients' potential diseases, which is an extremely important task in medical domain. To tackle these challenges, our research mainly focuses on exploring characteristics of EHR data per se [KDD17, SDM20, KDD20] and incorporating external information [KDD18, CIKM18, BIBM18].

Key References:

  • KDD17: Dipole: Diagnosis Prediction in Healthcare via Attention-based Bidirectional Recurrent Neural Networks
  • KDD18: Risk Prediction on Electronic Health Records with Prior Medical Knowledge
  • CIKM18: KAME: Knowledge-based Attention Model for Diagnosis Prediction in Healthcare
  • BIBM18: A General Framework for Diagnosis Prediction via Incorporating Medical Code Descriptions
  • SDM20: Rare Disease Prediction by Generating Quality-Assured Electronic Health Records
  • KDD20: HiTANet: Hierarchical Time-Aware Attention Networks for Risk Prediction on Electronic Health Records
  • Reliable Medical Diagnosis from Crowdsourced Data

    Besides EHR data, there are many crowdsourced question answering websites in the application of healthcare. For example, on websites such as, users contribute their answers to medical-related questions. However, the "true" information usually hides in a massive amount of noisy or even conflicting crowdsourced data. To automatically extract medical knowledge (i.e., true facts) from these noisy crowd-provided answers, we first need to address a challenge. That is, different users have different reliability levels, and there is usually neither prior knowledge or training data for the derivation of user reliability. In light of this challenge, we developed unsupervised approaches by jointly estimating user reliability and inferring true facts (i.e. truths) from crowdsourced data without any supervision [KDD15, KDD16, KDD17, KDD18, KDD19].

    Key References:

  • KDD15: FaitCrowd: Fine Grained Truth Discovery for Crowdsourced Data Aggregation
  • KDD16: Towards Confidence in the Truth: A Bootstrapping based Truth Discovery Approach
  • KDD17: Unsupervised Discovery of Drug Side-Effects From Heterogeneous Data Sources
  • KDD18: TextTruth: An Unsupervised Approach to Discover Trustworthy Information from Multi-Sourced Text Data
  • KDD19: Optimize the Wisdom of the Crowd: Inference, Learning, and Teaching
  • Medical Knowledge Extraction

    Medical knowledge is valuable and significantly useful for various tasks in medical domain. Medical text data carries invaluable information about the current and previous medical history, current symptoms and severity of condition as well as physicians clinical judgment. How to extract correct medical knowledge from large-scale and unstructured medical text data is a major challenge. Since it is extremely difficult to directly extract medical knowledge from raw data, we focus on the following challenges: medical relation extraction [WWW19], medical fact generation [SML17, US Patent18], and multi-grained named entity recognition [ACL19].

    Key References:

  • WWW19: MCVAE: Margin-based Conditional Variational Autoencoder for Relation Classification and Pattern Generation
  • SML17: Long-Term Memory Networks for Question Answering
  • US Patent18: Long-Term Memory Networks for Knowledge Extraction from Text and Publications
  • ACL19: Multi-grained Named Entity Recognition