Hosted by the Department of Biostatics
Abstract: In many real-world applications, obtaining labeled data is a significant challenge due to high costs and technical limitations. This scarcity of labeled outcomes presents a major obstacle for traditional statistical inference. To address this, we introduce a model-free approach for constructing prediction regions for new target outcomes. Our method leverages a labeled source distribution, which is different from the target but related through a distributional shift, to overcome the lack of target labels. When target data are fully unlabeled, our predictions rely entirely on the rich source data; when some labels are available, we seamlessly integrate them to boost efficiency. A key innovation in this new approach lies in how we handle the complexities of different data distributions. We tackle non-exchangeability and non-identifiability by estimating the likelihood ratio through a novel technique: matching the covariate distributions of the source and target domains using a B-spline basis. This powerful approach allows us to accommodate complex error structures, including asymmetry and multimodality. To this end, we construct the highest predictive density sets using a new weight-adjusted conditional density estimator. This estimator models the source conditional density and then transforms it through a weighting scheme to accurately approximate the target conditional density. We will discuss the theoretical guarantees of our method and demonstrate its strong performance. We validate our approach through comprehensive simulation studies and a compelling real-world application using the MIMIC-III clinical database. This is a joint work with Menghan Yi and Yanlin Tang.
Bio: Huixia Wang completed her Ph.D. in Statistics from the University of Illinois at Urbana-Champaign in 2006 and joined North Carolina State University as an assistant professor. She served in that role until 2012 when she moved to The George Washington University to serve as a professor in the Statistics department. She served as its department chair for three years. From 2018 to 2022, she also served as a program director in the Division of Mathematical Sciences at the National Science Foundation where she co-managed several large multi-disciplinary and cross-agency programs. She also managed many national initiatives in mathematics, data science, and artificial intelligence, and spearheaded several workforce development programs in these areas.
Her research focuses on building mathematical and statistical models to solve complex biomedical and environmental problems. One of her current projects involves developing statistical methods and scalable computing methods to analyze complex datasets with the goal of identifying socio-economic factors contributing to chronic health conditions.
She is also laying the mathematical groundwork for building human digital twins—dynamic, data-driven, virtual models that mirror an individual’s brain physiology and pathology. These digital twins are designed to uncover root causes and support personalized treatments for autism spectrum disorder and other neurodevelopmental disorders. The platform will allow physicians and parents to track how children with autism respond to different stimuli and interventions in real time.
Additionally, she is collaborating with environmental researchers and engineers to develop statistical models that predict the vulnerability of specific geographical areas to flooding and assess the impact of hurricanes on water quality.