My statistical research focuses on the development and evaluation of risk prediction models as well as the analysis of high dimensional data. During my doctoral studies, I developed novel statistical methodology that utilizes machine-learning techniques to quantify a subject’s risk for disease based on a large number of genetic markers as well as environmental and clinical predictors. As a post-doctoral research fellow at the Fred Hutchinson Cancer Research Institute, I collaborated with geneticists and epidemiologists on the analysis of a vast amount of genomic and epidemiological data collected by multiple institutes to study colorectal cancer pathogenesis. I have also collaborated with researchers at Harvard Medical School and Brigham and Women’s Hospital in Boston, MA, on analyses of diverse types of high dimensional data such as electronic medical records, as well as multiple large genome-wide association studies. My connections and collaborations with clinicians and scientists in the Knight Cardiovascular Institute and Knight Cancer Institute encourage my interest in factors contributing to cardiovascular disease, stroke and cancer, as well as disease prevention and treatment strategies.
Statistical Methodology for Genetic Risk Prediction and High Dimensional
Analysis of high-dimensional data often seeks to identify a subset of important features and to assess the effects of these features on outcomes. Traditional statistical inference procedures based on standard regression methods often fail in the presence of high-dimensional features. In recent years, regularization methods have emerged as promising tools for analyzing high-dimensional data. Statistical inference on these models is challenging. In my graduate studies I developed methods for estimating the distribution of regularized regression coefficients. This method, justified by asymptotic theory, provides a simple way to estimate the covariance matrix and confidence regions, thereby providing accurate inference on these commonly used models. In addition to inference problems, I have worked on methods for risk prediction and classification based on a large number of predictors. One setting for which these models are useful is the burgeoning field of genomics. Genetic studies of complex traits have uncovered only a small number of risk markers explaining a small fraction of heritability and adding little improvement to disease risk prediction. Standard single marker methods may lack power in selecting informative markers or estimating effects. Most existing methods also typically do not account for non-linearity, which could result in loss of prediction accuracy when the underlying effects are non-linear. With my collaborators, I have developed risk prediction models that bridge high dimensional statistical methodology with powerful and flexible machine learning models. Our models relate genetic markers to disease risk by taking advantage of known gene-set structures. We provide a prediction model framework that is flexible for many types of genomics studies.
Prediction and Classification with Electronic Medical Records
Electronic Medical Record (EMR) data marts provide a rich and vast set of data from which to characterize population and individual disease progression. It is challenging to harness this data in a meaningful way and hence building accurate prediction models with this data is difficult. My collaborators at Harvard Medical School and Brigham and Women’s Hospital in Boston, MA, have developed natural language processing (NLP) methods to glean informative features from EMR data. Together with my collaborators and my PhD advisor, Professor Tianxi Cai, we developed statistical methods to incorporate NLP terms with codified and clinical data to accurately predict disease risk. We incorporated regularized regression and resampling techniques to build the models. These models were successfully validated in larger data marts in order to identify eligible cohorts for future epidemiological studies.
Statistical Genetics and Genetic Epidemiology
The field of genetics and genomics is growing rapidly. In order to answer scientific questions of interest it is imperative to form an interdisciplinary team including geneticists, epidemiologists, and biostatisticians. As a member of such teams, I have studied of genetic phenomenon as genetic instability and gene-environment interactions. I have collaborated with research teams at Harvard and the Fred Hutchinson Cancer Research Center to study these topics in various large genomic data sets. My main contributions were providing statistical guidance in study design, performing data analyses, and conducting relevant simulation studies to answer the research questions of interest.