Yang Feng
Professor of Biostatistics
-
Professional overview
-
Yang Feng is an Associate Professor of Biostatistics. He received his B.S. in mathematics from the University of Science and Technology of China and his Ph.D. in Operations Research from Princeton University.
Dr. Feng's research interests include machine learning with applications to public health, high-dimensional statistics, network models, nonparametric statistics, and bioinformatics. He has published in The Annals of Statistics, Journal of the American Statistical Association, Journal of the Royal Statistical Society Series B, Journal of Machine Learning Research, International Journal of Epidemiology, and Science Advances. Feng serves on the editorial boards of the Journal of Business & Economic Statistics, Statistica Sinica, Stat, and Statistical Analysis and Data Mining: The ASA Data Science Journal.
Prior to joining NYU, Feng was an Associate Professor of Statistics and an affiliated member in the Data Science Institute at Columbia University. He is an elected member of the International Statistical Institute and a recipient of the NSF CAREER award.
Please visit Dr. Yang Feng's website and Google Scholar page from more information.
-
Education
-
B.S. in Mathematics, University of Science and Technology of China, Hefei, ChinaPh.D. in Operations Research, Princeton University, Princeton, NJ
-
Areas of research and study
-
BioinformaticsBiostatisticsHigh-dimensional data analysis/integrationMachine learningModeling Social and Behavioral DynamicsNonparametric statistics
-
Publications
Publications
Tuning-parameter selection in regularized estimations of large covariance matrices
Fang, Y., Wang, B., & Feng, Y. (n.d.).Publication year
2016Journal title
Journal of Statistical Computation and SimulationVolume
86Issue
3Page(s)
494-509AbstractRecently many regularized estimators of large covariance matrices have been proposed, and the tuning parameters in these estimators are usually selected via cross-validation. However, there is a lack of consensus on the number of folds for conducting cross-validation. One round of cross-validation involves partitioning a sample of data into two complementary subsets, a training set and a validation set. In this manuscript, we demonstrate that if the estimation accuracy is measured in the Frobenius norm, the training set should consist of majority of the data; whereas if the estimation accuracy is measured in the operator norm, the validation set should consist of majority of the data. We also develop methods for selecting tuning parameters based on the bootstrap and compare them with their cross-validation counterparts. We demonstrate that the cross-validation methods with ‘optimal’ choices of folds are more appropriate than their bootstrap counterparts.Variable selection and prediction with incomplete high-dimensional data
Liu, Y., Wang, Y., Feng, Y., & Wall, M. M. (n.d.).Publication year
2016Journal title
Annals of Applied StatisticsVolume
10Issue
1Page(s)
418-450AbstractWe propose a Multiple Imputation Random Lasso (MIRL) method to select important variables and to predict the outcome for an epidemiological study of Eating and Activity in Teens. In this study 80% of individuals have at least one variable missing. Therefore, using variable selection methods developed for complete data after listwise deletion substantially reduces prediction power. Recent work on prediction models in the presence of incomplete data cannot adequately account for large numbers of variables with arbitrary missing patterns. We propose MIRL to combine penalized regression techniques with multiple imputation and stability selection. Extensive simulation studies are conducted to compare MIRL with several alternatives. MIRL outperforms other methods in high-dimensional scenarios in terms of both reduced prediction error and improved variable selection performance, and it has greater advantage when the correlation among variables is high and missing proportion is high. MIRL is shown to have improved performance when comparing with other applicable methods when applied to the study of Eating and Activity in Teens for the boys and girls separately, and to a subgroup of low social economic status (SES) Asian boys who are at high risk of developing obesity.Functional and Parametric Estimation in a Semi-and Nonparametric Model with Application to Mass-Spectrometry Data
Ma, W., Feng, Y., Chen, K., & Ying, Z. (n.d.).Publication year
2015Journal title
International Journal of BiostatisticsVolume
11Issue
2Page(s)
285-303AbstractMotivated by modeling and analysis of mass-spectrometry data, a semi-and nonparametric model is proposed that consists of linear parametric components for individual location and scale and a nonparametric regression function for the common shape. A multi-step approach is developed that simultaneously estimates the parametric components and the nonparametric function. Under certain regularity conditions, it is shown that the resulting estimators is consistent and asymptotic normal for the parametric part and achieve the optimal rate of convergence for the nonparametric part when the bandwidth is suitably chosen. Simulation results are presented to demonstrate the effectiveness and finite-sample performance of the method. The method is also applied to a SELDI-TOF mass spectrometry data set from a study of liver cancer patients.APPLE: Approximate path for penalized likelihood estimators
Yu, Y., & Feng, Y. (n.d.).Publication year
2014Journal title
Statistics and ComputingVolume
24Issue
5Page(s)
803-819AbstractIn high-dimensional data analysis, penalized likelihood estimators are shown to provide superior results in both variable selection and parameter estimation. A new algorithm, APPLE, is proposed for calculating the Approximate Path for Penalized Likelihood Estimators. Both convex penalties (such as LASSO) and folded concave penalties (such as MCP) are considered. APPLE efficiently computes the solution path for the penalized likelihood estimator using a hybrid of the modified predictor-corrector method and the coordinate-descent algorithm. APPLE is compared with several well-known packages via simulation and analysis of two gene expression data sets.Modified Cross-Validation for Penalized High-Dimensional Linear Regression Models
Yu, Y., & Feng, Y. (n.d.).Publication year
2014Journal title
Journal of Computational and Graphical StatisticsVolume
23Issue
4Page(s)
1009-1027AbstractIn this article, for Lasso penalized linear regression models in high-dimensional settings, we propose a modified cross-validation (CV) method for selecting the penalty parameter. The methodology is extended to other penalties, such as Elastic Net. We conduct extensive simulation studies and real data analysis to compare the performance of the modified CV method with other methods. It is shown that the popular K-fold CV method includes many noise variables in the selected model, while the modified CV works well in a wide range of coefficient and correlation settings. Supplementary materials containing the computer code are available online.Regularized principal components of heritability
Fang, Y., Feng, Y., & Yuan, M. (n.d.).Publication year
2014Journal title
Computational StatisticsVolume
29Issue
3Page(s)
455-465AbstractIn family studies with multiple continuous phenotypes, heritability can be conveniently evaluated through the so-called principal-component of heredity (PCH, for short; Ott and Rabinowitz in Hum Hered 49:106-111, 1999). Estimation of the PCH, however, is notoriously difficult when entertaining a large collection of phenotypes which naturally arises in dealing with modern genomic data such as those from expression QTL studies. In this paper, we propose a regularized PCH method to specifically address such challenges. We show through both theoretical studies and data examples that the proposed method can accurately assess the heritability of a large collection of phenotypes.A road to classification in high dimensional space: The regularized optimal affine discriminant
Fan, J., Feng, Y., & Tong, X. (n.d.).Publication year
2012Journal title
Journal of the Royal Statistical Society. Series B: Statistical MethodologyVolume
74Issue
4Page(s)
745-771AbstractFor high dimensional classification, it is well known that naively performing the Fisher discriminant rule leads to poor results due to diverging spectra and accumulation of noise. Therefore, researchers proposed independence rules to circumvent the diverging spectra, and sparse independence rules to mitigate the issue of accumulation of noise. However, in biological applications, often a group of correlated genes are responsible for clinical outcomes, and the use of the covariance information can significantly reduce misclassification rates. In theory the extent of such error rate reductions is unveiled by comparing the misclassification rates of the Fisher discriminant rule and the independence rule. To materialize the gain on the basis of finite samples, a regularized optimal affine discriminant (ROAD) is proposed. The ROAD selects an increasing number of features as the regularization relaxes. Further benefits can be achieved when a screening method is employed to narrow the feature pool before applying the ROAD method. An efficient constrained co-ordinate descent algorithm is also developed to solve the associated optimization problems. Sampling properties of oracle type are established. Simulation studies and real data analysis support our theoretical results and demonstrate the advantages of the new classification procedure under a variety of correlation structures. A delicate result on continuous piecewise linear solution paths for the ROAD optimization problem at the population level justifies the linear interpolation of the constrained co-ordinate descent algorithm.Nonparametric independence screening in sparse ultra-high-dimensional additive models
Fan, J., Feng, Y., & Song, R. (n.d.).Publication year
2011Journal title
Journal of the American Statistical AssociationVolume
106Issue
494Page(s)
544-557AbstractA variable screening procedure via correlation learning was proposed by Fan and Lv (2008) to reduce dimensionality in sparse ultra-highdimensional models. Even when the true model is linear, the marginal regression can be highly nonlinear. To address this issue, we further extend the correlation learning to marginal nonparametric learning. Our nonparametric independence screening (NIS) is a specific type of sure independence screening. We propose several closely related variable screening procedures. We show that with general nonparametric models, under some mild technical conditions, the proposed independence screening methods have a sure screening property. The extentto which the dimensionality can be reduced by independence screening is also explicitly quantified. As a methodological extension, we also propose a data-driven thresholding and an iterative nonparametric independence screening (INIS) method to enhance the finite- sample performance for fitting sparse additive models. The simulation results and a real data analysis demonstrate that the proposed procedure works well with moderate sample size and large dimension and performs better than competing methods.Nonparametric estimation of genewise variance for microarray data
Fan, J., Feng, Y., & Niu, Y. S. (n.d.).Publication year
2010Journal title
Annals of StatisticsVolume
38Issue
5Page(s)
2723-2750AbstractEstimation of genewise variance arises from two important applications in microarray data analysis: selecting significantly differentially expressed genes and validation tests for normalization of microarray data. We approach the problem by introducing a two-way nonparametric model, which is an extension of the famous Neyman-Scott model and is applicable beyond microarray data. The problem itself poses interesting challenges because thenumber of nuisance parameters is proportional to the sample size and it is not obvious how the variance function can be estimated when measurements are correlated. In such a high-dimensional nonparametric problem, we proposed two novel nonparametric estimators for genewise variance function and semiparametric estimators for measurement correlation, via solving a system of nonlinear equations. Their asymptotic normality is established. The finite sample property is demonstrated by simulation studies. The estimators also improve the power of the tests for detecting statistically differentially expressed genes. The methodology is illustrated by the data from microarray quality control (MAQC) project.The Microarray Quality Control (MAQC)-II study of common practices for the development and validation of microarray-based predictive models
Shi, L., Campbell, G., Jones, W. D., Campagne, F., Wen, Z., Walker, S. J., Su, Z., Chu, T. M., Goodsaid, F. M., Pusztai, L., Shaughnessy, J. D., Oberthuer, A., Thomas, R. S., Paules, R. S., Fielden, M., Barlogie, B., Chen, W., Du, P., Fischer, M., … Wolfinger, R. D. (n.d.).Publication year
2010Journal title
Nature BiotechnologyVolume
28Issue
8Page(s)
827-838AbstractGene expression data from microarrays are being applied to predict preclinical and clinical endpoints, but the reliability of these predictions has not been established. In the MAQC-II project, 36 independent teams analyzed six microarray data sets to generate predictive models for classifying a sample with respect to one of 13 endpoints indicative of lung or liver toxicity in rodents, or of breast cancer, multiple myeloma or neuroblastoma in humans. In total, >30,000 models were built using many combinations of analytical methods. The teams generated predictive models without knowing the biological meaning of some of the endpoints and, to mimic clinical reality, tested the models on data that had not been used for training. We found that model performance depended largely on the endpoint and team proficiency and that different approaches generated models of similar performance. The conclusions and recommendations from MAQC-II should be useful for regulatory agencies, study committees and independent investigators that evaluate methods for global gene expression analysis.Local quasi-likelihood with a parametric guide
Fan, J., Wu, Y., & Feng, Y. (n.d.).Publication year
2009Journal title
Annals of StatisticsVolume
37Issue
6Page(s)
4153-4183AbstractGeneralized linear models and the quasi-likelihood method extend the ordinary regression models to accommodate more general conditional distributions of the response. Nonparametric methods need no explicit parametric specification, and the resulting model is completely determined by the data themselves. However, nonparametric estimation schemes generally have a slower convergence rate such as the local polynomial smoothing estimation of nonparametric generalized linear models studied in Fan, Heckman and Wand [J. Amer. Statist. Assoc. 90.Network exploration via the adaptive LASSO and SCAD penalties
Fan, J., Feng, Y., & Wu, Y. (n.d.).Publication year
2009Journal title
Annals of Applied StatisticsVolume
3Issue
2Page(s)
521-541AbstractGraphical models are frequently used to explore networks, such as genetic networks, among a set of variables. This is usually carried out via exploring the sparsity of the precision matrix of the variables under consideration. Penalized likelihood methods are often used in such explorations. Yet, positive-definiteness constraints of precision matrices make the optimization problem challenging. We introduce nonconcave penalties and the adaptive LASSO penalty to attenuate the bias problem in the network estimation. Through the local linear approximation to the nonconcave penalty functions, the problem of precision matrix estimation is recast as a sequence of penalized likelihood problems with a weighted L 1 penalty and solved using the efficient algorithm of Friedman et al. [Biostatistics 9 (2008) 432-441]. Our estimation schemes are applied to two real datasets. Simulation experiments and asymptotic theory are used to justify our proposed methods.