Statistics Seminar

Abhishek ChakraborttyUniversity of Pennsylvania

Semi-Supervised Inference with Large and High Dimensional Data: A Semi-Parametric Perspective

Friday, December 14, 2018 - 4:15pm

Biotech G01

Abstract: The abundance of large and complex datasets in the current big data era has also created a host of novel statistical challenges for properly harnessing such rich (but often incomplete) information. One such challenge includes statistical inference in semi-supervised (SS) settings, where apart from a moderate sized supervised data (L), one also has a much larger sized unsupervised data (U) available. Such datasets arise naturally when the response, unlike the covariates, is difficult and/or expensive to obtain, a frequent scenario in modern studies involving large databases, including biomedical data like electronic health records (EHR). It is natural to investigate whether and how the information from U can be exploited to improve efficiency over a given supervised approach.

In this talk, I will consider SS inference for a class of standard Z-estimation problems. I will discuss first the subtleties and associated challenges that necessitate a semi-parametric perspective. I will then demonstrate a family of SS Z-estimators that are robust and adaptive, thus ensuring that they are always as efficient as the supervised estimator and more efficient (optimal in some cases) when the information from U actually relates to the parameter of interest. These properties are crucial for advocating ‘safe’ use of unlabeled data and are often unaddressed. Our framework provides a much needed unified understanding of these problems. Multiple EHR data applications are also presented to exhibit the practical benefits of our estimator. In the later part of the talk, I consider SS inference in high dimensional settings, and demonstrate the remarkable benefits the unlabeled data provides in seamlessly obtaining a family of SS estimators with asymptotic linear expansions, without directly requiring any sparsity conditions or debiasing needed in supervised settings. This, in particular, facilitates high dimensional inference under minimal assumptions.