My current research focuses on a range of related problems in Bayesian mixture model development for classification , variable selection, design and model selection, structured and hierarchical non-parametric Bayesian methods, rare event detection, and statistical computation involving simulation and optimization. Much of this is linked to problems involving large data sets, with originating motivations in the analysis of high throughput biological data in immunology and vaccine research.

Data sets of increasing scale and complexity pose challenges to standard statistical methods, and these are exemplified in areas where one goal is classification and discrimination of subpopulations. A key general question we have explored is to identify subsets of variables that play roles in discrimination of subpopulations in the context of multivariate mixture modeling. We introduced a new discriminative information measure utilizing the concordance between multivariate mixture component densities, and have developed and applied it to these general variable selection/design questions in flow cytometry applications (and others). The method is both effective and computationally attractive for routine use in assessing and prioritizing subsets of variables according to their roles in discriminating subpopulation structure.

As the number of measured variables grows, there is an increasing need to consider structured, hierarchical models to enable sensitive inference on subpopulation structure. Moreover, as sample sizes increase we often face problems of masking of subtler substructure; model fitting can often lack the ability to identify “rare events” due to the dominance of much of the data. Our work has introduced novel, hierarchical nonparametric Bayesian mixture models that address both problems. The key idea is to first partition the outcome variables into a set of subsets, typically involving substantive contextual information. We then apply Bayesian nonparametric mixture models to the reduced-dimensional distribution of one selected subset of variables; this delivers classification/clustering in that marginal space. This naturally then induces partitions of the data based on the marginal classification, and a second level of mixture modeling applies—in parallel—to a second subset of variables within each of the partitioned data sets. This can be repeated, hierarchically defining an overall product-mixture model within which each modeling exercise is developed in lower dimensions and with smaller data subsets. These latter features enable more sensitive isolation of fine substructure and a focus on rare subpopulations, in particular.

Data in high dimensions is often difficult to understand and visualize. Graphical models are frequently used to address these problems, taking advantage of the (conditional) independencies between subsets of variables based on their representations using a graph. Unlike previous graphical model approaches, we proposed here a new Bayesian mixture model using binary trees, constructed with the goal of modeling the data structure from each individual dimension. The dependencies among the dimensions are captured by the tree structure. Kingman's coalescent is utilized as a prior for the tree structure. An efficient MCMC algorithm is developed for posterior inference on model parameters and tree structures.

Cell populations in blood and tissue are not homogeneous; even clonotypes of individual cells can exist in different biochemical states that define measurable functional differences between them. This single-cell heterogeneity is informative, but lost in assays that measure cell mixtures. Our work has introduced a state-of-the-art Bayesian hierarchical method for multivariate modeling of single-cell assays that improves the detection of biologically relevant, and potentially small, changes across one or more cell subsets, which directly addresses major hurdles in the analysis of high-dimensional single-cell data. Our approach jointly model cell subsets, taking into account their dependence structure, and combine information across subjects to detect subject-specific treatment effects (e.g., response to vaccination), which significantly increase power and avoid multiple comparisons across subsets. Our model can automatically select differentially expressed cell subsets using a sparse variable selection prior. We also explore models that can combine the number of expressed cells with their level of expression to detect biological changes.