Yang Feng

Yang Feng
Yang Feng

Associate Professor

Professional overview

Yang Feng is an Associate Professor of Biostatistics. He received his B.S. in mathematics from the University of Science and Technology of China and his Ph.D. in Operations Research from Princeton University.

Dr. Feng's research interests include machine learning with applications to public health, high-dimensional statistics, network models, nonparametric statistics, and bioinformatics. He has published in The Annals of Statistics, Journal of the American Statistical Association, Journal of the Royal Statistical Society Series B, Journal of Machine Learning Research, International Journal of Epidemiology, and Science Advances. Feng serves on the editorial boards of the Journal of Business & Economic Statistics, Statistica Sinica, Stat, and Statistical Analysis and Data Mining: The ASA Data Science Journal.

Prior to joining NYU, Feng was an Associate Professor of Statistics and an affiliated member in the Data Science Institute at Columbia University. He is an elected member of the International Statistical Institute and a recipient of the NSF CAREER award.

Please visit Dr. Yang Feng's website and Google Scholar page from more information.

Education

B.S. in Mathematics, University of Science and Technology of China, Hefei, China
Ph.D. in Operations Research, Princeton University, Princeton, NJ

Areas of research and study

Bioinformatics
Biostatistics
High-dimensional data analysis/integration
Machine learning
Modeling Social and Behavioral Dynamics
Nonparametric statistics

Publications

Publications

Analytical performance of lateral flow immunoassay for SARS-CoV-2 exposure screening on venous and capillary blood samples

Black, M. A., Shen, G., Feng, X., Garcia Beltran, W. F., Feng, Y., Vasudevaraja, V., Allison, D., Lin, L. H., Gindin, T., Astudillo, M., Yang, D., Murali, M., Iafrate, A. J., Jour, G., Cotzia, P., & Snuderl, M.

Publication year

2021

Journal title

Journal of Immunological Methods

Volume

489
Abstract
Abstract
Objectives: We validate the use of a lateral flow immunoassay (LFI) intended for rapid screening and qualitative detection of anti-SARS-CoV-2 IgM and IgG in serum, plasma, and whole blood, and compare results with ELISA. We also seek to establish the value of LFI testing on blood obtained from a capillary blood sample. Methods: Samples collected by venous blood draw and finger stick were obtained from patients with SARS-CoV-2 detected by RT-qPCR and control patients. Samples were tested with Biolidics 2019-nCoV IgG/IgM Detection Kit lateral flow immunoassay, and antibody calls were compared with ELISA. Results: Biolidics LFI showed clinical sensitivity of 92% with venous blood at 7 days after PCR diagnosis of SARS-CoV-2. Test specificity was 92% for IgM and 100% for IgG. There was no significant difference in detecting IgM and IgG with Biolidics LFI and ELISA at D0 and D7 (p = 1.00), except for detection of IgM at D7 (p = 0.04). Capillary blood of SARS-CoV-2 patients showed 93% sensitivity for antibody detection. Conclusions: Clinical performance of Biolidics 2019-nCoV IgG/IgM Detection Kit is comparable to ELISA and was consistent across sample types. This provides an opportunity for decentralized rapid testing and may allow point-of-care and longitudinal self-testing for the presence of anti-SARS-CoV-2 antibodies.

Comparison of solid tissue sequencing and liquid biopsy accuracy in identification of clinically relevant gene mutations and rearrangements in lung adenocarcinomas

Lin, L. H., Allison, D. H., Feng, Y., Jour, G., Park, K., Zhou, F., Moreira, A. L., Shen, G., Feng, X., Sabari, J., Velcheti, V., Snuderl, M., & Cotzia, P.

Publication year

2021

Journal title

Modern Pathology

Volume

34

Issue

12

Page(s)

2168-2174
Abstract
Abstract
Screening for therapeutic targets is standard of care in the management of advanced non-small cell lung cancer. However, most molecular assays utilize tumor tissue, which may not always be available. “Liquid biopsies” are plasma-based next generation sequencing (NGS) assays that use circulating tumor DNA to identify relevant targets. To compare the sensitivity, specificity, and accuracy of a plasma-based NGS assay to solid-tumor-based NGS we retrospectively analyzed sequencing results of 100 sequential patients with lung adenocarcinoma at our institution who had received concurrent testing with both a solid-tissue-based NGS assay and a commercially available plasma-based NGS assay. Patients represented both new diagnoses (79%) and disease progression on treatment (21%); the majority (83%) had stage IV disease. Tissue-NGS identified 74 clinically relevant mutations, including 52 therapeutic targets, a sensitivity of 94.8%, while plasma-NGS identified 41 clinically relevant mutations, a sensitivity of 52.6% (p < 0.001). Tissue-NGS showed significantly higher sensitivity and accuracy across multiple patient subgroups, both in newly diagnosed and treated patients, as well as in metastatic and nonmetastatic disease. Discrepant cases involved hotspot mutations and actionable fusions including those in EGFR, ALK, and NTRK1. In summary, tissue-NGS detects significantly more clinically relevant alterations and therapeutic targets compared to plasma-NGS, suggesting that tissue-NGS should be the preferred method for molecular testing of lung adenocarcinoma when tissue is available. Plasma-NGS can still play an important role when tissue testing is not possible. However, given its low sensitivity, a negative result should be confirmed with a tissue-based assay.

Imbalanced classification: A paradigm-based review

Feng, Y., Zhou, M., & Tong, X.

Publication year

2021

Journal title

Statistical Analysis and Data Mining

Volume

14

Issue

5

Page(s)

383-406
Abstract
Abstract
A common issue for classification in scientific research and industry is the existence of imbalanced classes. When sample sizes of different classes are imbalanced in training data, naively implementing a classification method often leads to unsatisfactory prediction results on test data. Multiple resampling techniques have been proposed to address the class imbalance issues. Yet, there is no general guidance on when to use each technique. In this article, we provide a paradigm-based review of the common resampling techniques for binary classification under imbalanced class sizes. The paradigms we consider include the classical paradigm that minimizes the overall classification error, the cost-sensitive learning paradigm that minimizes a cost-adjusted weighted type I and type II errors, and the Neyman–Pearson paradigm that minimizes the type II error subject to a type I error constraint. Under each paradigm, we investigate the combination of the resampling techniques and a few state-of-the-art classification methods. For each pair of resampling techniques and classification methods, we use simulation studies and a real dataset on credit card fraud to study the performance under different evaluation metrics. From these extensive numerical experiments, we demonstrate under each classification paradigm, the complex dynamics among resampling techniques, base classification methods, evaluation metrics, and imbalance ratios. We also summarize a few takeaway messages regarding the choices of resampling techniques and base classification methods, which could be helpful for practitioners.

Mediation effect selection in high-dimensional and compositional microbiome data

Zhang, H., Chen, J., Feng, Y., Wang, C., Li, H., & Liu, L.

Publication year

2021

Journal title

Statistics in Medicine

Volume

40

Issue

4

Page(s)

885-896
Abstract
Abstract
The microbiome plays an important role in human health by mediating the path from environmental exposures to health outcomes. The relative abundances of the high-dimensional microbiome data have an unit-sum restriction, rendering standard statistical methods in the Euclidean space invalid. To address this problem, we use the isometric log-ratio transformations of the relative abundances as the mediator variables. To select significant mediators, we consider a closed testing-based selection procedure with desirable confidence. Simulations are provided to verify the effectiveness of our method. As an illustrative example, we apply the proposed method to study the mediation effects of murine gut microbiome between subtherapeutic antibiotic treatment and body weight gain, and identify Coprobacillus and Adlercreutzia as two significant mediators.

Model Averaging for Nonlinear Regression Models

Feng, Y., Liu, Q., Yao, Q., & Zhao, G.

Publication year

2021

Journal title

Journal of Business and Economic Statistics
Abstract
Abstract
This article considers the problem of model averaging for regression models that can be nonlinear in their parameters and variables. We consider a nonlinear model averaging (NMA) framework and propose a weight-choosing criterion, the nonlinear information criterion (NIC). We show that up to a constant, NIC is an asymptotically unbiased estimator of the risk function under nonlinear settings with some mild assumptions. We also prove the optimality of NIC and show the convergence of the model averaging weights. Monte Carlo experiments reveal that NMA leads to relatively lower risks compared with alternative model selection and model averaging methods in most situations. Finally, we apply the NMA method to predicting the individual wage, where our approach leads to the lowest prediction errors in most cases.

RaSE: A Variable Screening Framework via Random Subspace Ensembles

Tian, Y., & Feng, Y.

Publication year

2021

Journal title

Journal of the American Statistical Association
Abstract
Abstract
Variable screening methods have been shown to be effective in dimension reduction under the ultra-high dimensional setting. Most existing screening methods are designed to rank the predictors according to their individual contributions to the response. As a result, variables that are marginally independent but jointly dependent with the response could be missed. In this work, we propose a new framework for variable screening, random subspace ensemble (RaSE), which works by evaluating the quality of random subspaces that may cover multiple predictors. This new screening framework can be naturally combined with any subspace evaluation criterion, which leads to an array of screening methods. The framework is capable to identify signals with no marginal effect or with high-order interaction effects. It is shown to enjoy the sure screening property and rank consistency. We also develop an iterative version of RaSE screening with theoretical support. Extensive simulation studies and real-data analysis show the effectiveness of the new screening framework.

RaSE: Random subspace ensemble classification

Tian, Y., & Feng, Y.

Publication year

2021

Journal title

Journal of Machine Learning Research

Volume

22
Abstract
Abstract
We propose a exible ensemble classification framework, Random Subspace Ensemble (RaSE), for sparse classification. In the RaSE algorithm, we aggregate many weak learners, where each weak learner is a base classifier trained in a subspace optimally selected from a collection of random subspaces. To conduct subspace selection, we propose a new criterion, ratio information criterion (RIC), based on weighted Kullback-Leibler divergence. The theoretical analysis includes the risk and Monte-Carlo variance of the RaSE classifier, establishing the screening consistency and weak consistency of RIC, and providing an upper bound for the misclassification rate of the RaSE classifier. In addition, we show that in a high-dimensional framework, the number of random subspaces needs to be very large to guarantee that a subspace covering signals is selected. Therefore, we propose an iterative version of the RaSE algorithm and prove that under some specific conditions, a smaller number of generated random subspaces are needed to find a desirable subspace through iteration. An array of simulations under various models and real-data applications demonstrate the effectiveness and robustness of the RaSE classifier and its iterative version in terms of low misclassification rate and accurate feature ranking. The RaSE algorithm is implemented in the R package RaSEn on CRAN.

Targeting Predictors Via Partial Distance Correlation With Applications to Financial Forecasting

Yousuf, K., & Feng, Y.

Publication year

2021

Journal title

Journal of Business and Economic Statistics
Abstract
Abstract
High-dimensional time series datasets are becoming increasingly common in various fields of economics and finance. Given the ubiquity of time series data, it is crucial to develop efficient variable screening methods that use the unique features of time series. This article introduces several model-free screening methods based on partial distance correlation and developed specifically to deal with time-dependent data. Methods are developed both for univariate models, such as nonlinear autoregressive models with exogenous predictors (NARX), and multivariate models such as linear or nonlinear VAR models. Sure screening properties are proved for our methods, which depend on the moment conditions, and the strength of dependence in the response and covariate processes, amongst other factors. We show the effectiveness of our methods via extensive simulation studies and an application on forecasting U.S. market returns.

The Interplay of Demographic Variables and Social Distancing Scores in Deep Prediction of U.S. COVID-19 Cases

Tang, F., Feng, Y., Chiheb, H., & Fan, J.

Publication year

2021

Journal title

Journal of the American Statistical Association

Volume

116

Issue

534

Page(s)

492-506
Abstract
Abstract
With the severity of the COVID-19 outbreak, we characterize the nature of the growth trajectories of counties in the United States using a novel combination of spectral clustering and the correlation matrix. As the United States and the rest of the world are still suffering from the effects of the virus, the importance of assigning growth membership to counties and understanding the determinants of the growth is increasingly evident. For the two communities (faster versus slower growth trajectories) we cluster the counties into, the average between-group correlation is 88.4% whereas the average within-group correlations are 95.0% and 93.8%. The average growth rate for one group is 0.1589 and 0.1704 for the other, further suggesting that our methodology captures meaningful differences between the nature of the growth across various counties. Subsequently, we select the demographic features that are most statistically significant in distinguishing the communities: number of grocery stores, number of bars, Asian population, White population, median household income, number of people with the bachelor’s degrees, and population density. Lastly, we effectively predict the future growth of a given county with a long short-term memory (LSTM) recurrent neural network using three social distancing scores. The best-performing model achieves a median out-of-sample R 2 of 0.6251 for a four-day ahead prediction and we find that the number of communities and social distancing features play an important role in producing a more accurate forecasting. This comprehensive study captures the nature of the counties’ growth in cases at a very micro-level using growth communities, demographic factors, and social distancing performance to help government agencies utilize known information to make appropriate decisions regarding which potential counties to target resources and funding to. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement.

Visceral adipose tissue in patients with COVID-19: risk stratification for severity

Chandarana, H., Dane, B., Mikheev, A., Taffel, M. T., Feng, Y., & Rusinek, H.

Publication year

2021

Journal title

Abdominal Radiology

Volume

46

Issue

2

Page(s)

818-825
Abstract
Abstract
Purpose: To assess visceral (VAT), subcutaneous (SAT), and total adipose tissue (TAT) estimates at abdominopelvic CT in COVID-19 patients with different severity, and analyze Body Mass Index (BMI) and CT estimates of fat content in patients requiring hospitalization. Methods: In this retrospective IRB approved HIPPA compliant study, 51 patients with SARS-CoV-2 infection with abdominopelvic CT were included. Patients were stratified based on disease severity as outpatient (no hospital admission) and patients who were hospitalized. Subset of hospitalized patient required mechanical ventilation (MV). A radiologist blinded to the clinical outcome evaluated single axial slice on CT at L3 vertebral body for VATL3, SATL3, TATL3, and VAT/TATL3. These measures along with age, gender, and BMI were compared. A clinical model that included age, sex, and BMI was compared to clinical + CT model that also included VATL3 to discriminate hospitalized patients from outpatients. Results: There were ten outpatients and 41 hospitalized patients. 11 hospitalized patients required MV. There were no significant differences in age and BMI between the hospitalized and outpatients (all p > 0.05). There was significantly higher VATL3 and VAT/TATL3 in hospitalized patients compared to the outpatients (all p < 0.05). Area under the curve (AUC) of the clinical + CT model was higher compared to the clinical model (AUC 0.847 versus 0.750) for identifying patients requiring hospitalization. Conclusion: Higher VATL3 was observed in COVID-19 patients that required hospitalization compared to the outpatients, and addition of VATL3 to the clinical model improved AUC in discriminating hospitalized from outpatients in this preliminary study.

A projection-based conditional dependence measure with applications to high-dimensional undirected graphical models

Fan, J., Feng, Y., & Xia, L.

Publication year

2020

Journal title

Journal of Econometrics

Volume

218

Issue

1

Page(s)

119-139
Abstract
Abstract
Measuring conditional dependence is an important topic in econometrics with broad applications including graphical models. Under a factor model setting, a new conditional dependence measure based on projection is proposed. The corresponding conditional independence test is developed with the asymptotic null distribution unveiled where the number of factors could be high-dimensional. It is also shown that the new test has control over the asymptotic type I error and can be calculated efficiently. A generic method for building dependency graphs without Gaussian assumption using the new test is elaborated. We show the superiority of the new method, implemented in the R package pgraph, through simulation and real data studies.

Accounting for incomplete testing in the estimation of epidemic parameters

Betensky, R. A., & Feng, Y.

Publication year

2020

Journal title

International Journal of Epidemiology

Volume

49

Issue

5

Page(s)

1419-1426

Nested model averaging on solution path for high-dimensional linear regression

Feng, Y., & Liu, Q.

Publication year

2020

Journal title

Stat

Volume

9

Issue

1
Abstract
Abstract
We study the nested model averaging method on the solution path for a high-dimensional linear regression problem. In particular, we propose to combine model averaging with regularized estimators (e.g., lasso, elastic net, and Sorted L-One Penalized Estimation [SLOPE]) on the solution path for high-dimensional linear regression. In simulation studies, we first conduct a systematic investigation on the impact of predictor ordering on the behaviour of nested model averaging, and then show that nested model averaging with lasso, elastic net and SLOPE compares favourably with other competing methods, including the infeasible lasso, elastic, net and SLOPE with the tuning parameter optimally selected. A real data analysis on predicting the per capita violent crime in the United States shows outstanding performance of the nested model averaging with lasso.

Neyman-pearson classification: Parametrics and sample size requirement

Tong, X., Xia, L., Wang, J., & Feng, Y.

Publication year

2020

Journal title

Journal of Machine Learning Research

Volume

21
Abstract
Abstract
The Neyman-Pearson (NP) paradigm in binary classification seeks classifiers that achieve a minimal type II error while enforcing the prioritized type I error controlled under some user-specified level α. This paradigm serves naturally in applications such as severe disease diagnosis and spam detection, where people have clear priorities among the two error types. Recently, Tong et al. (2018) proposed a nonparametric umbrella algorithm that adapts all scoring-type classification methods (e.g., logistic regression, support vector machines, random forest) to respect the given type I error (i.e., conditional probability of classifying a class 0 observation as class 1 under the 0-1 coding) upper bound α with high probability, without specific distributional assumptions on the features and the responses. Universal the umbrella algorithm is, it demands an explicit minimum sample size requirement on class 0, which is often the more scarce class, such as in rare disease diagnosis applications. In this work, we employ the parametric linear discriminant analysis (LDA) model and propose a new parametric thresholding algorithm, which does not need the minimum sample size requirements on class 0 observations and thus is suitable for small sample applications such as rare disease diagnosis. Leveraging both the existing nonparametric and the newly proposed parametric thresholding rules, we propose four LDA-based NP classifiers, for both low- and high-dimensional settings. On the theoretical front, we prove NP oracle inequalities for one proposed classifier, where the rate for excess type II error benefits from the explicit parametric model assumption. Furthermore, as NP classifiers involve a sample splitting step of class 0 observations, we construct a new adaptive sample splitting scheme that can be applied universally to NP classifiers, and this adaptive strategy reduces the type II error of these classifiers. The proposed NP classifiers are implemented in the R package nproc.

On the estimation of correlation in a binary sequence model

Weng, H., & Feng, Y.

Publication year

2020

Journal title

Journal of Statistical Planning and Inference

Volume

207

Page(s)

123-137
Abstract
Abstract
We consider a binary sequence generated by thresholding a hidden continuous sequence. The hidden variables are assumed to have a compound symmetry covariance structure with a single parameter characterizing the common correlation. We study the parameter estimation problem under such one-parameter models. We demonstrate that maximizing the likelihood function does not yield consistent estimates for the correlation. We then formally prove the nonestimability of the parameter by deriving a non-vanishing minimax lower bound. This counter-intuitive phenomenon provides an interesting insight that one-bit information of each latent variable is not sufficient to consistently recover their common correlation. On the other hand, we further show that trinary data generated from the hidden variables can consistently estimate the correlation with parametric convergence rate. Thus we reveal a phase transition phenomenon regarding the discretization of latent continuous variables while preserving the estimability of the correlation. Numerical experiments are performed to validate the conclusions.

On the sparsity of Mallows model averaging estimator

Feng, Y., Liu, Q., & Okui, R.

Publication year

2020

Journal title

Economics Letters

Volume

187
Abstract
Abstract
We show that Mallows model averaging estimator proposed by Hansen (2007) can be written as a least squares estimation with a weighted L1 penalty and additional constraints. By exploiting this representation, we demonstrate that the weight vector obtained by this model averaging procedure has a sparsity property in the sense that a subset of models receives exactly zero weights. Moreover, this representation allows us to adapt algorithms developed to efficiently solve minimization problems with many parameters and weighted L1 penalty. In particular, we develop a new coordinate-wise descent algorithm for model averaging. Simulation studies show that the new algorithm computes the model averaging estimator much faster and requires less memory than conventional methods when there are many models.

A Kronecker Product Model for Repeated Pattern Detection on 2D Urban Images

Liu, J., Psarakis, E. Z., Feng, Y., & Stamos, I.

Publication year

2019

Journal title

IEEE Transactions on Pattern Analysis and Machine Intelligence

Volume

41

Issue

9

Page(s)

2266-2272
Abstract
Abstract
Repeated patterns (such as windows, balconies, and doors) are prominent and significant features in urban scenes. Therefore, detection of these repeated patterns becomes very important for city scene analysis. This paper attacks the problem of repeated pattern detection in a precise, efficient and automatic way, by combining traditional feature extraction with a Kronecker product based low-rank model. We introduced novel algorithms that extract repeated patterns from rectified images with solid theoretical support. Our method is tailored for 2D images of building façades and tested on a large set of façade images.

Likelihood adaptively modified penalties

Feng, Y., Li, T., & Ying, Z.

Publication year

2019

Journal title

Applied Stochastic Models in Business and Industry

Volume

35

Issue

2

Page(s)

330-353
Abstract
Abstract
A new family of penalty functions, ie, adaptive to likelihood, is introduced for model selection in general regression models. It arises naturally through assuming certain types of prior distribution on the regression parameters. To study the stability properties of the penalized maximum-likelihood estimator, 2 types of asymptotic stability are defined. Theoretical properties, including the parameter estimation consistency, model selection consistency, and asymptotic stability, are established under suitable regularity conditions. An efficient coordinate-descent algorithm is proposed. Simulation results and real data analysis show that the proposed approach has competitive performance in comparison with the existing methods.

Regularization after retention in ultrahigh dimensional linear regression models

Weng, H., Feng, Y., & Qiao, X.

Publication year

2019

Journal title

Statistica Sinica

Volume

29

Issue

1

Page(s)

387-407
Abstract
Abstract
In ultrahigh dimensional setting, independence screening has been both theoretically and empirically proved a useful variable selection framework with low computation cost. In this work, we propose a two-step framework using marginal information in a different fashion than independence screening. In particular, we retain significant variables rather than screening out irrelevant ones. The method is shown to be model selection consistent in the ultrahigh dimensional linear regression model. To improve the finite sample performance, we then introduce a three-step version and characterize its asymptotic behavior. Simulations and data analysis show advantages of our method over independence screening and its iterative variants in certain regimes.

The restricted consistency property of leave-nV-out cross-validation for high-dimensional variable selection

Feng, Y., & Yu, Y.

Publication year

2019

Journal title

Statistica Sinica

Volume

29

Issue

3

Page(s)

1607-1630
Abstract
Abstract
Cross-validation (CV) methods are popular for selecting the tuning parameter in high-dimensional variable selection problems. We show that a misalignment of the CV is one possible reason for its over-selection behavior. To fix this issue, we propose using a version of leave-nv-out CV (CV(nv)) to select the optimal model from a restricted candidate model set for high-dimensional generalized linear models. By using the same candidate model sequence and a proper order for the construction sample size nc in each CV split, CV(nv) avoids potential problems when developing theoretical properties. CV(nv) is shown to exhibit the restricted model-selection consistency property under mild conditions. Extensive simulations and a real-data analysis support the theoretical results and demonstrate the performance of CV(nv) in terms of both model selection and prediction.

A crowdsourced analysis to identify ab initio molecular signatures predictive of susceptibility to viral infection

Failed generating bibliography.

Publication year

2018

Journal title

Nature communications

Volume

9

Issue

1
Abstract
Abstract
The response to respiratory viruses varies substantially between individuals, and there are currently no known molecular predictors from the early stages of infection. Here we conduct a community-based analysis to determine whether pre- or early post-exposure molecular factors could predict physiologic responses to viral exposure. Using peripheral blood gene expression profiles collected from healthy subjects prior to exposure to one of four respiratory viruses (H1N1, H3N2, Rhinovirus, and RSV), as well as up to 24 h following exposure, we find that it is possible to construct models predictive of symptomatic response using profiles even prior to viral exposure. Analysis of predictive gene features reveal little overlap among models; however, in aggregate, these genes are enriched for common pathways. Heme metabolism, the most significantly enriched pathway, is associated with a higher risk of developing symptoms following viral exposure. This study demonstrates that pre-exposure molecular predictors can be identified and improves our understanding of the mechanisms of response to respiratory viruses.

Model Selection for High-Dimensional Quadratic Regression via Regularization

Hao, N., Feng, Y., & Zhang, H. H.

Publication year

2018

Journal title

Journal of the American Statistical Association

Volume

113

Issue

522

Page(s)

615-625
Abstract
Abstract
Quadratic regression (QR) models naturally extend linear models by considering interaction effects between the covariates. To conduct model selection in QR, it is important to maintain the hierarchical model structure between main effects and interaction effects. Existing regularization methods generally achieve this goal by solving complex optimization problems, which usually demands high computational cost and hence are not feasible for high-dimensional data. This article focuses on scalable regularization methods for model selection in high-dimensional QR. We first consider two-stage regularization methods and establish theoretical properties of the two-stage LASSO. Then, a new regularization method, called regularization algorithm under marginality principle (RAMP), is proposed to compute a hierarchy-preserving regularization solution path efficiently. Both methods are further extended to solve generalized QR models. Numerical results are also shown to demonstrate performance of the methods. Supplementary materials for this article are available online.

Neyman-Pearson classification algorithms and NP receiver operating characteristics

Tong, X., Feng, Y., & Li, J. J.

Publication year

2018

Journal title

Science Advances

Volume

4

Issue

2
Abstract
Abstract
In many binary classification applications, such as disease diagnosis and spam detection, practitioners commonly face the need to limit type I error (that is, the conditional probability of misclassifying a class 0 observation as class 1) so that it remains below a desired threshold. To address this need, the Neyman-Pearson (NP) classification paradigm is a natural choice; it minimizes type II error (that is, the conditional probability of misclassifying a class 1 observation as class 0)while enforcing an upper bound, a, on the type I error.Despite its century-long history in hypothesis testing, the NP paradigm has not been well recognized and implemented in classification schemes. Common practices that directly limit the empirical type I error to no more than a do not satisfy the type I error control objective because the resulting classifiers are likely to have type I errors much larger than a, and the NP paradigm has not been properly implemented in practice. We develop the first umbrella algorithm that implements the NP paradigm for all scoringtype classification methods, such as logistic regression, support vector machines, and random forests. Powered by this algorithm, we propose a novel graphical tool for NP classification methods: NP receiver operating characteristic (NP-ROC) bands motivated by the popular ROC curves. NP-ROC bands will help choose a in a data-adaptive way and compare different NP classifiers. We demonstrate the use and properties of the NP umbrella algorithm and NP-ROC bands, available in the R package nproc, through simulation and real data studies.

Nonparametric independence screening via favored smoothing bandwidth

Feng, Y., Wu, Y., & Stefanski, L. A.

Publication year

2018

Journal title

Journal of Statistical Planning and Inference

Volume

197

Page(s)

1-14
Abstract
Abstract
We propose a flexible nonparametric regression method for ultrahigh-dimensional data. As a first step, we propose a fast screening method based on the favored smoothing bandwidth of the marginal local constant regression. Then, an iterative procedure is developed to recover both the important covariates and the regression function. Theoretically, we prove that the favored smoothing bandwidth based screening possesses the model selection consistency property. Simulation studies as well as real data analysis show the competitive performance of the new procedure.

Penalized weighted least absolute deviation regression

Gao, X., & Feng, Y.

Publication year

2018

Journal title

Statistics and its Interface

Volume

11

Issue

1

Page(s)

79-89
Abstract
Abstract
In a linear model where the data is contaminated or the random error is heavy-tailed, least absolute deviation (LAD) regression has been widely used as an alternative approach to least squares (LS) regression. However, it is well known that LAD regression is not robust to outliers in the explanatory variables. When the data includes some leverage points, LAD regression may perform even worse than LS regression. In this manuscript, we propose to improve LAD regression in a penalized weighted least absolute deviation (PWLAD) framework. The main idea is to associate each observation with a weight reflecting the degree of outlying and leverage effect and obtain both the weight and coefficient vector estimation simultaneously and adaptively. The proposed PWLAD is able to provide regression coefficients estimate with strong robustness, and perform outlier detection at the same time, even when the random error does not have finite variances. We provide sufficient conditions under which PWLAD is able to identify true outliers consistently. The performance of the proposed estimator is demonstrated via extensive simulation studies and real examples.

Contact

yang.feng@nyu.edu 708 Broadway 7FL New York, NY, 10003