Yang Feng
Professor of Biostatistics
-
Professional overview
-
Yang Feng is a Professor and Ph.D. Program Director of Biostatistics in the School of Global Public Health and an affiliate faculty in the Center for Data Science at New York University. He obtained his Ph.D. in Operations Research at Princeton University in 2010.
Feng's research interests encompass the theoretical and methodological aspects of machine learning, high-dimensional statistics, social network models, and nonparametric statistics, leading to a wealth of practical applications, including Alzheimer's disease, cancer classification, and electronic health records. His research has been funded by multiple grants from the National Institutes of Health (NIH) and the National Science Foundation (NSF), notably the NSF CAREER Award.
He is currently an Associate Editor for the Journal of the American Statistical Association (JASA), the Journal of Business & Economic Statistics (JBES), Journal of Computational & Graphical Statistics (JCGS), and the Annals of Applied Statistics (AoAS). His professional recognitions include being named a fellow of the American Statistical Association (ASA) and the Institute of Mathematical Statistics (IMS), as well as an elected member of the International Statistical Institute (ISI).
Please visit Dr. Yang Feng's website and Google Scholar page from more information.
-
Education
-
B.S. in Mathematics, University of Science and Technology of China, Hefei, ChinaPh.D. in Operations Research, Princeton University, Princeton, NJ
-
Areas of research and study
-
BioinformaticsBiostatisticsHigh-dimensional data analysis/integrationMachine learningModeling Social and Behavioral DynamicsNonparametric statistics
-
Publications
Publications
Super RaSE: Super Random Subspace Ensemble Classification
Zhu, J., & Feng, Y. (n.d.).Publication year
2021Journal title
Journal of Risk and Financial ManagementVolume
14Issue
12AbstractWe propose a new ensemble classification algorithm, named super random subspace ensemble (Super RaSE), to tackle the sparse classification problem. The proposed algorithm is motivated by the random subspace ensemble algorithm (RaSE). The RaSE method was shown to be a flexible framework that can be coupled with any existing base classification. However, the success of RaSE largely depends on the proper choice of the base classifier, which is unfortunately unknown to us. In this work, we show that Super RaSE avoids the need to choose a base classifier by randomly sampling a collection of classifiers together with the subspace. As a result, Super RaSE is more flexible and robust than RaSE. In addition to the vanilla Super RaSE, we also develop the iterative Super RaSE, which adaptively changes the base classifier distribution as well as the subspace distribution. We show that the Super RaSE algorithm and its iterative version perform competitively for a wide range of simulated data sets and two real data examples. The new Super RaSE algorithm and its iterative version are implemented in a new version of the R package RaSEn.The Interplay of Demographic Variables and Social Distancing Scores in Deep Prediction of U.S. COVID-19 Cases
Tang, F., Feng, Y., Chiheb, H., & Fan, J. (n.d.).Publication year
2021Journal title
Journal of the American Statistical AssociationVolume
116Issue
534Page(s)
492-506AbstractWith the severity of the COVID-19 outbreak, we characterize the nature of the growth trajectories of counties in the United States using a novel combination of spectral clustering and the correlation matrix. As the United States and the rest of the world are still suffering from the effects of the virus, the importance of assigning growth membership to counties and understanding the determinants of the growth is increasingly evident. For the two communities (faster versus slower growth trajectories) we cluster the counties into, the average between-group correlation is 88.4% whereas the average within-group correlations are 95.0% and 93.8%. The average growth rate for one group is 0.1589 and 0.1704 for the other, further suggesting that our methodology captures meaningful differences between the nature of the growth across various counties. Subsequently, we select the demographic features that are most statistically significant in distinguishing the communities: number of grocery stores, number of bars, Asian population, White population, median household income, number of people with the bachelor’s degrees, and population density. Lastly, we effectively predict the future growth of a given county with a long short-term memory (LSTM) recurrent neural network using three social distancing scores. The best-performing model achieves a median out-of-sample R 2 of 0.6251 for a four-day ahead prediction and we find that the number of communities and social distancing features play an important role in producing a more accurate forecasting. This comprehensive study captures the nature of the counties’ growth in cases at a very micro-level using growth communities, demographic factors, and social distancing performance to help government agencies utilize known information to make appropriate decisions regarding which potential counties to target resources and funding to. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement.Visceral adipose tissue in patients with COVID-19: risk stratification for severity
Chandarana, H., Dane, B., Mikheev, A., Taffel, M. T., Feng, Y., & Rusinek, H. (n.d.).Publication year
2021Journal title
Abdominal RadiologyVolume
46Issue
2Page(s)
818-825AbstractPURPOSE: To assess visceral (VAT), subcutaneous (SAT), and total adipose tissue (TAT) estimates at abdominopelvic CT in COVID-19 patients with different severity, and analyze Body Mass Index (BMI) and CT estimates of fat content in patients requiring hospitalization.METHODS: In this retrospective IRB approved HIPPA compliant study, 51 patients with SARS-CoV-2 infection with abdominopelvic CT were included. Patients were stratified based on disease severity as outpatient (no hospital admission) and patients who were hospitalized. Subset of hospitalized patient required mechanical ventilation (MV). A radiologist blinded to the clinical outcome evaluated single axial slice on CT at L3 vertebral body for VAT L3, SAT L3, TAT L3, and VAT/TAT L3. These measures along with age, gender, and BMI were compared. A clinical model that included age, sex, and BMI was compared to clinical + CT model that also included VAT L3 to discriminate hospitalized patients from outpatients. RESULTS: There were ten outpatients and 41 hospitalized patients. 11 hospitalized patients required MV. There were no significant differences in age and BMI between the hospitalized and outpatients (all p > 0.05). There was significantly higher VAT L3 and VAT/TAT L3 in hospitalized patients compared to the outpatients (all p < 0.05). Area under the curve (AUC) of the clinical + CT model was higher compared to the clinical model (AUC 0.847 versus 0.750) for identifying patients requiring hospitalization. CONCLUSION: Higher VAT L3 was observed in COVID-19 patients that required hospitalization compared to the outpatients, and addition of VAT L3 to the clinical model improved AUC in discriminating hospitalized from outpatients in this preliminary study.A projection-based conditional dependence measure with applications to high-dimensional undirected graphical models
Fan, J., Feng, Y., & Xia, L. (n.d.).Publication year
2020Journal title
Journal of EconometricsVolume
218Issue
1Page(s)
119-139AbstractMeasuring conditional dependence is an important topic in econometrics with broad applications including graphical models. Under a factor model setting, a new conditional dependence measure based on projection is proposed. The corresponding conditional independence test is developed with the asymptotic null distribution unveiled where the number of factors could be high-dimensional. It is also shown that the new test has control over the asymptotic type I error and can be calculated efficiently. A generic method for building dependency graphs without Gaussian assumption using the new test is elaborated. We show the superiority of the new method, implemented in the R package pgraph, through simulation and real data studies.Accounting for incomplete testing in the estimation of epidemic parameters
Betensky, R. A., & Feng, Y. (n.d.).Publication year
2020Journal title
International Journal of EpidemiologyVolume
49Issue
5Page(s)
1419-1426Nested model averaging on solution path for high-dimensional linear regression
Feng, Y., & Liu, Q. (n.d.).Publication year
2020Journal title
StatVolume
9Issue
1AbstractWe study the nested model averaging method on the solution path for a high-dimensional linear regression problem. In particular, we propose to combine model averaging with regularized estimators (e.g., lasso, elastic net, and Sorted L-One Penalized Estimation [SLOPE]) on the solution path for high-dimensional linear regression. In simulation studies, we first conduct a systematic investigation on the impact of predictor ordering on the behaviour of nested model averaging, and then show that nested model averaging with lasso, elastic net and SLOPE compares favourably with other competing methods, including the infeasible lasso, elastic, net and SLOPE with the tuning parameter optimally selected. A real data analysis on predicting the per capita violent crime in the United States shows outstanding performance of the nested model averaging with lasso.Neyman-pearson classification: Parametrics and sample size requirement
Tong, X., Xia, L., Wang, J., & Feng, Y. (n.d.).Publication year
2020Journal title
Journal of Machine Learning ResearchVolume
21AbstractThe Neyman-Pearson (NP) paradigm in binary classification seeks classifiers that achieve a minimal type II error while enforcing the prioritized type I error controlled under some user-specified level α. This paradigm serves naturally in applications such as severe disease diagnosis and spam detection, where people have clear priorities among the two error types. Recently, Tong et al. (2018) proposed a nonparametric umbrella algorithm that adapts all scoring-type classification methods (e.g., logistic regression, support vector machines, random forest) to respect the given type I error (i.e., conditional probability of classifying a class 0 observation as class 1 under the 0-1 coding) upper bound α with high probability, without specific distributional assumptions on the features and the responses. Universal the umbrella algorithm is, it demands an explicit minimum sample size requirement on class 0, which is often the more scarce class, such as in rare disease diagnosis applications. In this work, we employ the parametric linear discriminant analysis (LDA) model and propose a new parametric thresholding algorithm, which does not need the minimum sample size requirements on class 0 observations and thus is suitable for small sample applications such as rare disease diagnosis. Leveraging both the existing nonparametric and the newly proposed parametric thresholding rules, we propose four LDA-based NP classifiers, for both low- and high-dimensional settings. On the theoretical front, we prove NP oracle inequalities for one proposed classifier, where the rate for excess type II error benefits from the explicit parametric model assumption. Furthermore, as NP classifiers involve a sample splitting step of class 0 observations, we construct a new adaptive sample splitting scheme that can be applied universally to NP classifiers, and this adaptive strategy reduces the type II error of these classifiers. The proposed NP classifiers are implemented in the R package nproc.On the estimation of correlation in a binary sequence model
Weng, H., & Feng, Y. (n.d.).Publication year
2020Journal title
Journal of Statistical Planning and InferenceVolume
207Page(s)
123-137AbstractWe consider a binary sequence generated by thresholding a hidden continuous sequence. The hidden variables are assumed to have a compound symmetry covariance structure with a single parameter characterizing the common correlation. We study the parameter estimation problem under such one-parameter models. We demonstrate that maximizing the likelihood function does not yield consistent estimates for the correlation. We then formally prove the nonestimability of the parameter by deriving a non-vanishing minimax lower bound. This counter-intuitive phenomenon provides an interesting insight that one-bit information of each latent variable is not sufficient to consistently recover their common correlation. On the other hand, we further show that trinary data generated from the hidden variables can consistently estimate the correlation with parametric convergence rate. Thus we reveal a phase transition phenomenon regarding the discretization of latent continuous variables while preserving the estimability of the correlation. Numerical experiments are performed to validate the conclusions.On the sparsity of Mallows model averaging estimator
Feng, Y., Liu, Q., & Okui, R. (n.d.).Publication year
2020Journal title
Economics LettersVolume
187AbstractWe show that Mallows model averaging estimator proposed by Hansen (2007) can be written as a least squares estimation with a weighted L1 penalty and additional constraints. By exploiting this representation, we demonstrate that the weight vector obtained by this model averaging procedure has a sparsity property in the sense that a subset of models receives exactly zero weights. Moreover, this representation allows us to adapt algorithms developed to efficiently solve minimization problems with many parameters and weighted L1 penalty. In particular, we develop a new coordinate-wise descent algorithm for model averaging. Simulation studies show that the new algorithm computes the model averaging estimator much faster and requires less memory than conventional methods when there are many models.A Kronecker Product Model for Repeated Pattern Detection on 2D Urban Images
Liu, J., Psarakis, E. Z., Feng, Y., & Stamos, I. (n.d.).Publication year
2019Journal title
IEEE Transactions on Pattern Analysis and Machine IntelligenceVolume
41Issue
9Page(s)
2266-2272AbstractRepeated patterns (such as windows, balconies, and doors) are prominent and significant features in urban scenes. Therefore, detection of these repeated patterns becomes very important for city scene analysis. This paper attacks the problem of repeated pattern detection in a precise, efficient and automatic way, by combining traditional feature extraction with a Kronecker product based low-rank model. We introduced novel algorithms that extract repeated patterns from rectified images with solid theoretical support. Our method is tailored for 2D images of building façades and tested on a large set of façade images.Likelihood adaptively modified penalties
Feng, Y., Li, T., & Ying, Z. (n.d.).Publication year
2019Journal title
Applied Stochastic Models in Business and IndustryVolume
35Issue
2Page(s)
330-353AbstractA new family of penalty functions, ie, adaptive to likelihood, is introduced for model selection in general regression models. It arises naturally through assuming certain types of prior distribution on the regression parameters. To study the stability properties of the penalized maximum-likelihood estimator, 2 types of asymptotic stability are defined. Theoretical properties, including the parameter estimation consistency, model selection consistency, and asymptotic stability, are established under suitable regularity conditions. An efficient coordinate-descent algorithm is proposed. Simulation results and real data analysis show that the proposed approach has competitive performance in comparison with the existing methods.Regularization after retention in ultrahigh dimensional linear regression models
Weng, H., Feng, Y., & Qiao, X. (n.d.).Publication year
2019Journal title
Statistica SinicaVolume
29Issue
1Page(s)
387-407AbstractIn ultrahigh dimensional setting, independence screening has been both theoretically and empirically proved a useful variable selection framework with low computation cost. In this work, we propose a two-step framework using marginal information in a different fashion than independence screening. In particular, we retain significant variables rather than screening out irrelevant ones. The method is shown to be model selection consistent in the ultrahigh dimensional linear regression model. To improve the finite sample performance, we then introduce a three-step version and characterize its asymptotic behavior. Simulations and data analysis show advantages of our method over independence screening and its iterative variants in certain regimes.The restricted consistency property of leave-nV-out cross-validation for high-dimensional variable selection
Feng, Y., & Yu, Y. (n.d.).Publication year
2019Journal title
Statistica SinicaVolume
29Issue
3Page(s)
1607-1630AbstractCross-validation (CV) methods are popular for selecting the tuning parameter in high-dimensional variable selection problems. We show that a misalignment of the CV is one possible reason for its over-selection behavior. To fix this issue, we propose using a version of leave-nv-out CV (CV(nv)) to select the optimal model from a restricted candidate model set for high-dimensional generalized linear models. By using the same candidate model sequence and a proper order for the construction sample size nc in each CV split, CV(nv) avoids potential problems when developing theoretical properties. CV(nv) is shown to exhibit the restricted model-selection consistency property under mild conditions. Extensive simulations and a real-data analysis support the theoretical results and demonstrate the performance of CV(nv) in terms of both model selection and prediction.A crowdsourced analysis to identify ab initio molecular signatures predictive of susceptibility to viral infection
Failed generating bibliography.AbstractPublication year
2018Journal title
Nature communicationsVolume
9Issue
1AbstractThe response to respiratory viruses varies substantially between individuals, and there are currently no known molecular predictors from the early stages of infection. Here we conduct a community-based analysis to determine whether pre- or early post-exposure molecular factors could predict physiologic responses to viral exposure. Using peripheral blood gene expression profiles collected from healthy subjects prior to exposure to one of four respiratory viruses (H1N1, H3N2, Rhinovirus, and RSV), as well as up to 24 h following exposure, we find that it is possible to construct models predictive of symptomatic response using profiles even prior to viral exposure. Analysis of predictive gene features reveal little overlap among models; however, in aggregate, these genes are enriched for common pathways. Heme metabolism, the most significantly enriched pathway, is associated with a higher risk of developing symptoms following viral exposure. This study demonstrates that pre-exposure molecular predictors can be identified and improves our understanding of the mechanisms of response to respiratory viruses.Model Selection for High-Dimensional Quadratic Regression via Regularization
Hao, N., Feng, Y., & Zhang, H. H. (n.d.).Publication year
2018Journal title
Journal of the American Statistical AssociationVolume
113Issue
522Page(s)
615-625AbstractQuadratic regression (QR) models naturally extend linear models by considering interaction effects between the covariates. To conduct model selection in QR, it is important to maintain the hierarchical model structure between main effects and interaction effects. Existing regularization methods generally achieve this goal by solving complex optimization problems, which usually demands high computational cost and hence are not feasible for high-dimensional data. This article focuses on scalable regularization methods for model selection in high-dimensional QR. We first consider two-stage regularization methods and establish theoretical properties of the two-stage LASSO. Then, a new regularization method, called regularization algorithm under marginality principle (RAMP), is proposed to compute a hierarchy-preserving regularization solution path efficiently. Both methods are further extended to solve generalized QR models. Numerical results are also shown to demonstrate performance of the methods. Supplementary materials for this article are available online.Neyman-Pearson classification algorithms and NP receiver operating characteristics
Tong, X., Feng, Y., & Li, J. J. (n.d.).Publication year
2018Journal title
Science AdvancesVolume
4Issue
2AbstractIn many binary classification applications, such as disease diagnosis and spam detection, practitioners commonly face the need to limit type I error (that is, the conditional probability of misclassifying a class 0 observation as class 1) so that it remains below a desired threshold. To address this need, the Neyman-Pearson (NP) classification paradigm is a natural choice; it minimizes type II error (that is, the conditional probability of misclassifying a class 1 observation as class 0) while enforcing an upper bound, α, on the type I error. Despite its century-long history in hypothesis testing, the NP paradigm has not been well recognized and implemented in classification schemes. Common practices that directly limit the empirical type I error to no more than α do not satisfy the type I error control objective because the resulting classifiers are likely to have type I errors much larger than α, and the NP paradigm has not been properly implemented in practice. We develop the first umbrella algorithm that implements the NP paradigm for all scoring-type classification methods, such as logistic regression, support vector machines, and random forests. Powered by this algorithm, we propose a novel graphical tool for NP classification methods: NP receiver operating characteristic (NP-ROC) bands motivated by the popular ROC curves. NP-ROC bands will help choose α in a data-adaptive way and compare different NP classifiers. We demonstrate the use and properties of the NP umbrella algorithm and NP-ROC bands, available in the R package nproc, through simulation and real data studies.Nonparametric independence screening via favored smoothing bandwidth
Feng, Y., Wu, Y., & Stefanski, L. A. (n.d.).Publication year
2018Journal title
Journal of Statistical Planning and InferenceVolume
197Page(s)
1-14AbstractWe propose a flexible nonparametric regression method for ultrahigh-dimensional data. As a first step, we propose a fast screening method based on the favored smoothing bandwidth of the marginal local constant regression. Then, an iterative procedure is developed to recover both the important covariates and the regression function. Theoretically, we prove that the favored smoothing bandwidth based screening possesses the model selection consistency property. Simulation studies as well as real data analysis show the competitive performance of the new procedure.Penalized weighted least absolute deviation regression
Gao, X., & Feng, Y. (n.d.).Publication year
2018Journal title
Statistics and its InterfaceVolume
11Issue
1Page(s)
79-89AbstractIn a linear model where the data is contaminated or the random error is heavy-tailed, least absolute deviation (LAD) regression has been widely used as an alternative approach to least squares (LS) regression. However, it is well known that LAD regression is not robust to outliers in the explanatory variables. When the data includes some leverage points, LAD regression may perform even worse than LS regression. In this manuscript, we propose to improve LAD regression in a penalized weighted least absolute deviation (PWLAD) framework. The main idea is to associate each observation with a weight reflecting the degree of outlying and leverage effect and obtain both the weight and coefficient vector estimation simultaneously and adaptively. The proposed PWLAD is able to provide regression coefficients estimate with strong robustness, and perform outlier detection at the same time, even when the random error does not have finite variances. We provide sufficient conditions under which PWLAD is able to identify true outliers consistently. The performance of the proposed estimator is demonstrated via extensive simulation studies and real examples.SIS: An R package for sure independence screening in ultrahigh-dimensional statistical models
Saldana, D. F., & Feng, Y. (n.d.).Publication year
2018Journal title
Journal of Statistical SoftwareVolume
83AbstractWe revisit sure independence screening procedures for variable selection in generalized linear models and the Cox proportional hazards model. Through the publicly available R package SIS, we provide a unified environment to carry out variable selection using iterative sure independence screening (ISIS) and all of its variants. For the regularization steps in the ISIS recruiting process, available penalties include the LASSO, SCAD, and MCP while the implemented variants for the screening steps are sample splitting, data-driven thresholding, and combinations thereof. Performance of these feature selection techniques is investigated by means of real and simulated data sets, where we find considerable improvements in terms of model selection and computational time between our algorithms and traditional penalized pseudo-likelihood methods applied directly to the full set of covariates.Binary switch portfolio
Li, T., Chen, K., Feng, Y., & Ying, Z. (n.d.).Publication year
2017Journal title
Quantitative FinanceVolume
17Issue
5Page(s)
763-780AbstractWe propose herein a new portfolio selection method that switches between two distinct asset allocation strategies. An important component is a carefully designed adaptive switching rule, which is based on a machine learning algorithm. It is shown that using this adaptive switching strategy, the combined wealth of the new approach is a weighted average of that of the successive constant rebalanced portfolio and that of the 1/N portfolio. In particular, it is asymptotically superior to the 1/N portfolio under mild conditions in the long run. Applications to real data show that both the returns and the Sharpe ratios of the proposed binary switch portfolio are the best among several popular competing methods over varying time horizons and stock pools.How Many Communities Are There?
Saldaña, D. F., Yu, Y., & Feng, Y. (n.d.).Publication year
2017Journal title
Journal of Computational and Graphical StatisticsVolume
26Issue
1Page(s)
171-181AbstractStochastic blockmodels and variants thereof are among the most widely used approaches to community detection for social networks and relational data. A stochastic blockmodel partitions the nodes of a network into disjoint sets, called communities. The approach is inherently related to clustering with mixture models; and raises a similar model selection problem for the number of communities. The Bayesian information criterion (BIC) is a popular solution, however, for stochastic blockmodels, the conditional independence assumption given the communities of the endpoints among different edges is usually violated in practice. In this regard, we propose composite likelihood BIC (CL-BIC) to select the number of communities, and we show it is robust against possible misspecifications in the underlying stochastic blockmodel assumptions. We derive the requisite methodology and illustrate the approach using both simulated and real data. Supplementary materials containing the relevant computer code are available online.JDINAC: Joint density-based non-parametric differential interaction network analysis and classification using high-dimensional sparse omics data
Ji, J., He, D., Feng, Y., He, Y., Xue, F., & Xie, L. (n.d.).Publication year
2017Journal title
BioinformaticsVolume
33Issue
19Page(s)
3080-3087AbstractMotivation A complex disease is usually driven by a number of genes interwoven into networks, rather than a single gene product. Network comparison or differential network analysis has become an important means of revealing the underlying mechanism of pathogenesis and identifying clinical biomarkers for disease classification. Most studies, however, are limited to network correlations that mainly capture the linear relationship among genes, or rely on the assumption of a parametric probability distribution of gene measurements. They are restrictive in real application. Results We propose a new Joint density based non-parametric Differential Interaction Network Analysis and Classification (JDINAC) method to identify differential interaction patterns of network activation between two groups. At the same time, JDINAC uses the network biomarkers to build a classification model. The novelty of JDINAC lies in its potential to capture non-linear relations between molecular interactions using high-dimensional sparse data as well as to adjust confounding factors, without the need of the assumption of a parametric probability distribution of gene measurements. Simulation studies demonstrate that JDINAC provides more accurate differential network estimation and lower classification error than that achieved by other state-of-the-art methods. We apply JDINAC to a Breast Invasive Carcinoma dataset, which includes 114 patients who have both tumor and matched normal samples. The hub genes and differential interaction patterns identified were consistent with existing experimental studies. Furthermore, JDINAC discriminated the tumor and normal sample with high accuracy by virtue of the identified biomarkers. JDINAC provides a general framework for feature selection and classification using high-dimensional sparse omics data.Post selection shrinkage estimation for high-dimensional data analysis
Gao, X., Ahmed, S. E., & Feng, Y. (n.d.).Publication year
2017Journal title
Applied Stochastic Models in Business and IndustryVolume
33Issue
2Page(s)
97-120AbstractIn high-dimensional data settings where p ≫ n, many penalized regularization approaches were studied for simultaneous variable selection and estimation. However, with the existence of covariates with weak effect, many existing variable selection methods, including Lasso and its generations, cannot distinguish covariates with weak and no contribution. Thus, prediction based on a subset model of selected covariates only can be inefficient. In this paper, we propose a post selection shrinkage estimation strategy to improve the prediction performance of a selected subset model. Such a post selection shrinkage estimator (PSE) is data adaptive and constructed by shrinking a post selection weighted ridge estimator in the direction of a selected candidate subset. Under an asymptotic distributional quadratic risk criterion, its prediction performance is explored analytically. We show that the proposed post selection PSE performs better than the post selection weighted ridge estimator. More importantly, it improves the prediction performance of any candidate subset model selected from most existing Lasso-type variable selection methods significantly. The relative performance of the post selection PSE is demonstrated by both simulation studies and real-data analysis.Rejoinder to ‘Post-selection shrinkage estimation for high-dimensional data analysis’
Gao, X., Ejaz Ahmed, S., & Feng, Y. (n.d.).Publication year
2017Journal title
Applied Stochastic Models in Business and IndustryVolume
33Issue
2Page(s)
131-135AbstractRejoinder to the paper entitled ‘Post-selection shrinkage estimation for high-dimensional data analysis’ discusses different aspects of the study. One fundamental ingredient of the work is to formally split the signals into strong and weak ones. The rationale is that the usual one-step method such as the least absolute shrinkage and selection operator (LASSO) may be very effective in detecting strong signals while failing to identify some weak ones, which in turn has a significant impact on the model fitting, as well as prediction. The discussions of both Fan and QYY contain very interesting comments on the separation of the three sets of variables.A survey on Neyman-Pearson classification and suggestions for future research
Tong, X., Feng, Y., & Zhao, A. (n.d.).Publication year
2016Journal title
Wiley Interdisciplinary Reviews: Computational StatisticsVolume
8Issue
2Page(s)
64-81AbstractIn statistics and machine learning, classification studies how to automatically learn to make good qualitative predictions (i.e., assign class labels) based on past observations. Examples of classification problems include email spam filtering, fraud detection, market segmentation. Binary classification, in which the potential class label is binary, has arguably the most widely used machine learning applications. Most existing binary classification methods target on the minimization of the overall classification risk and may fail to serve some real-world applications such as cancer diagnosis, where users are more concerned with the risk of misclassifying one specific class than the other. Neyman-Pearson (NP) paradigm was introduced in this context as a novel statistical framework for handling asymmetric type I/II error priorities. It seeks classifiers with a minimal type II error subject to a type I error constraint under some user-specified level. Though NP classification has the potential to be an important subfield in the classification literature, it has not received much attention in the statistics and machine learning communities. This article is a survey on the current status of the NP classification literature. To stimulate readers’ research interests, the authors also envision a few possible directions for future research in NP paradigm and its applications.