Yang Feng

Yang Feng
Yang Feng
Scroll

Professor of Biostatistics

Professional overview

Yang Feng is an Associate Professor of Biostatistics. He received his B.S. in mathematics from the University of Science and Technology of China and his Ph.D. in Operations Research from Princeton University.

Dr. Feng's research interests include machine learning with applications to public health, high-dimensional statistics, network models, nonparametric statistics, and bioinformatics. He has published in The Annals of Statistics, Journal of the American Statistical Association, Journal of the Royal Statistical Society Series B, Journal of Machine Learning Research, International Journal of Epidemiology, and Science Advances. Feng serves on the editorial boards of the Journal of Business & Economic Statistics, Statistica Sinica, Stat, and Statistical Analysis and Data Mining: The ASA Data Science Journal.

Prior to joining NYU, Feng was an Associate Professor of Statistics and an affiliated member in the Data Science Institute at Columbia University. He is an elected member of the International Statistical Institute and a recipient of the NSF CAREER award.

Please visit Dr. Yang Feng's website and Google Scholar page from more information.

Education

B.S. in Mathematics, University of Science and Technology of China, Hefei, China
Ph.D. in Operations Research, Princeton University, Princeton, NJ

Areas of research and study

Bioinformatics
Biostatistics
High-dimensional data analysis/integration
Machine learning
Modeling Social and Behavioral Dynamics
Nonparametric statistics

Publications

Publications

Tuning-parameter selection in regularized estimations of large covariance matrices

Fang, Y., Wang, B., & Feng, Y. (n.d.).

Publication year

2016

Journal title

Journal of Statistical Computation and Simulation

Volume

86

Issue

3

Page(s)

494-509
Abstract
Abstract
Recently many regularized estimators of large covariance matrices have been proposed, and the tuning parameters in these estimators are usually selected via cross-validation. However, there is a lack of consensus on the number of folds for conducting cross-validation. One round of cross-validation involves partitioning a sample of data into two complementary subsets, a training set and a validation set. In this manuscript, we demonstrate that if the estimation accuracy is measured in the Frobenius norm, the training set should consist of majority of the data; whereas if the estimation accuracy is measured in the operator norm, the validation set should consist of majority of the data. We also develop methods for selecting tuning parameters based on the bootstrap and compare them with their cross-validation counterparts. We demonstrate that the cross-validation methods with ‘optimal’ choices of folds are more appropriate than their bootstrap counterparts.

Variable selection and prediction with incomplete high-dimensional data

Liu, Y., Wang, Y., Feng, Y., & Wall, M. M. (n.d.).

Publication year

2016

Journal title

Annals of Applied Statistics

Volume

10

Issue

1

Page(s)

418-450
Abstract
Abstract
We propose a Multiple Imputation Random Lasso (MIRL) method to select important variables and to predict the outcome for an epidemiological study of Eating and Activity in Teens. In this study 80% of individuals have at least one variable missing. Therefore, using variable selection methods developed for complete data after listwise deletion substantially reduces prediction power. Recent work on prediction models in the presence of incomplete data cannot adequately account for large numbers of variables with arbitrary missing patterns. We propose MIRL to combine penalized regression techniques with multiple imputation and stability selection. Extensive simulation studies are conducted to compare MIRL with several alternatives. MIRL outperforms other methods in high-dimensional scenarios in terms of both reduced prediction error and improved variable selection performance, and it has greater advantage when the correlation among variables is high and missing proportion is high. MIRL is shown to have improved performance when comparing with other applicable methods when applied to the study of Eating and Activity in Teens for the boys and girls separately, and to a subgroup of low social economic status (SES) Asian boys who are at high risk of developing obesity.

Functional and Parametric Estimation in a Semi-and Nonparametric Model with Application to Mass-Spectrometry Data

Ma, W., Feng, Y., Chen, K., & Ying, Z. (n.d.).

Publication year

2015

Journal title

International Journal of Biostatistics

Volume

11

Issue

2

Page(s)

285-303
Abstract
Abstract
Motivated by modeling and analysis of mass-spectrometry data, a semi-and nonparametric model is proposed that consists of linear parametric components for individual location and scale and a nonparametric regression function for the common shape. A multi-step approach is developed that simultaneously estimates the parametric components and the nonparametric function. Under certain regularity conditions, it is shown that the resulting estimators is consistent and asymptotic normal for the parametric part and achieve the optimal rate of convergence for the nonparametric part when the bandwidth is suitably chosen. Simulation results are presented to demonstrate the effectiveness and finite-sample performance of the method. The method is also applied to a SELDI-TOF mass spectrometry data set from a study of liver cancer patients.

APPLE: Approximate path for penalized likelihood estimators

Yu, Y., & Feng, Y. (n.d.).

Publication year

2014

Journal title

Statistics and Computing

Volume

24

Issue

5

Page(s)

803-819
Abstract
Abstract
In high-dimensional data analysis, penalized likelihood estimators are shown to provide superior results in both variable selection and parameter estimation. A new algorithm, APPLE, is proposed for calculating the Approximate Path for Penalized Likelihood Estimators. Both convex penalties (such as LASSO) and folded concave penalties (such as MCP) are considered. APPLE efficiently computes the solution path for the penalized likelihood estimator using a hybrid of the modified predictor-corrector method and the coordinate-descent algorithm. APPLE is compared with several well-known packages via simulation and analysis of two gene expression data sets.

Modified Cross-Validation for Penalized High-Dimensional Linear Regression Models

Yu, Y., & Feng, Y. (n.d.).

Publication year

2014

Journal title

Journal of Computational and Graphical Statistics

Volume

23

Issue

4

Page(s)

1009-1027
Abstract
Abstract
In this article, for Lasso penalized linear regression models in high-dimensional settings, we propose a modified cross-validation (CV) method for selecting the penalty parameter. The methodology is extended to other penalties, such as Elastic Net. We conduct extensive simulation studies and real data analysis to compare the performance of the modified CV method with other methods. It is shown that the popular K-fold CV method includes many noise variables in the selected model, while the modified CV works well in a wide range of coefficient and correlation settings. Supplementary materials containing the computer code are available online.

Regularized principal components of heritability

Fang, Y., Feng, Y., & Yuan, M. (n.d.).

Publication year

2014

Journal title

Computational Statistics

Volume

29

Issue

3

Page(s)

455-465
Abstract
Abstract
In family studies with multiple continuous phenotypes, heritability can be conveniently evaluated through the so-called principal-component of heredity (PCH, for short; Ott and Rabinowitz in Hum Hered 49:106-111, 1999). Estimation of the PCH, however, is notoriously difficult when entertaining a large collection of phenotypes which naturally arises in dealing with modern genomic data such as those from expression QTL studies. In this paper, we propose a regularized PCH method to specifically address such challenges. We show through both theoretical studies and data examples that the proposed method can accurately assess the heritability of a large collection of phenotypes.

A road to classification in high dimensional space: The regularized optimal affine discriminant

Fan, J., Feng, Y., & Tong, X. (n.d.).

Publication year

2012

Journal title

Journal of the Royal Statistical Society. Series B: Statistical Methodology

Volume

74

Issue

4

Page(s)

745-771
Abstract
Abstract
For high dimensional classification, it is well known that naively performing the Fisher discriminant rule leads to poor results due to diverging spectra and accumulation of noise. Therefore, researchers proposed independence rules to circumvent the diverging spectra, and sparse independence rules to mitigate the issue of accumulation of noise. However, in biological applications, often a group of correlated genes are responsible for clinical outcomes, and the use of the covariance information can significantly reduce misclassification rates. In theory the extent of such error rate reductions is unveiled by comparing the misclassification rates of the Fisher discriminant rule and the independence rule. To materialize the gain on the basis of finite samples, a regularized optimal affine discriminant (ROAD) is proposed. The ROAD selects an increasing number of features as the regularization relaxes. Further benefits can be achieved when a screening method is employed to narrow the feature pool before applying the ROAD method. An efficient constrained co-ordinate descent algorithm is also developed to solve the associated optimization problems. Sampling properties of oracle type are established. Simulation studies and real data analysis support our theoretical results and demonstrate the advantages of the new classification procedure under a variety of correlation structures. A delicate result on continuous piecewise linear solution paths for the ROAD optimization problem at the population level justifies the linear interpolation of the constrained co-ordinate descent algorithm.

Nonparametric independence screening in sparse ultra-high-dimensional additive models

Fan, J., Feng, Y., & Song, R. (n.d.).

Publication year

2011

Journal title

Journal of the American Statistical Association

Volume

106

Issue

494

Page(s)

544-557
Abstract
Abstract
A variable screening procedure via correlation learning was proposed by Fan and Lv (2008) to reduce dimensionality in sparse ultra-highdimensional models. Even when the true model is linear, the marginal regression can be highly nonlinear. To address this issue, we further extend the correlation learning to marginal nonparametric learning. Our nonparametric independence screening (NIS) is a specific type of sure independence screening. We propose several closely related variable screening procedures. We show that with general nonparametric models, under some mild technical conditions, the proposed independence screening methods have a sure screening property. The extentto which the dimensionality can be reduced by independence screening is also explicitly quantified. As a methodological extension, we also propose a data-driven thresholding and an iterative nonparametric independence screening (INIS) method to enhance the finite- sample performance for fitting sparse additive models. The simulation results and a real data analysis demonstrate that the proposed procedure works well with moderate sample size and large dimension and performs better than competing methods.

Nonparametric estimation of genewise variance for microarray data

Fan, J., Feng, Y., & Niu, Y. S. (n.d.).

Publication year

2010

Journal title

Annals of Statistics

Volume

38

Issue

5

Page(s)

2723-2750
Abstract
Abstract
Estimation of genewise variance arises from two important applications in microarray data analysis: selecting significantly differentially expressed genes and validation tests for normalization of microarray data. We approach the problem by introducing a two-way nonparametric model, which is an extension of the famous Neyman-Scott model and is applicable beyond microarray data. The problem itself poses interesting challenges because thenumber of nuisance parameters is proportional to the sample size and it is not obvious how the variance function can be estimated when measurements are correlated. In such a high-dimensional nonparametric problem, we proposed two novel nonparametric estimators for genewise variance function and semiparametric estimators for measurement correlation, via solving a system of nonlinear equations. Their asymptotic normality is established. The finite sample property is demonstrated by simulation studies. The estimators also improve the power of the tests for detecting statistically differentially expressed genes. The methodology is illustrated by the data from microarray quality control (MAQC) project.

The Microarray Quality Control (MAQC)-II study of common practices for the development and validation of microarray-based predictive models

Shi, L., Campbell, G., Jones, W. D., Campagne, F., Wen, Z., Walker, S. J., Su, Z., Chu, T. M., Goodsaid, F. M., Pusztai, L., Shaughnessy, J. D., Oberthuer, A., Thomas, R. S., Paules, R. S., Fielden, M., Barlogie, B., Chen, W., Du, P., Fischer, M., … Wolfinger, R. D. (n.d.).

Publication year

2010

Journal title

Nature Biotechnology

Volume

28

Issue

8

Page(s)

827-838
Abstract
Abstract
Gene expression data from microarrays are being applied to predict preclinical and clinical endpoints, but the reliability of these predictions has not been established. In the MAQC-II project, 36 independent teams analyzed six microarray data sets to generate predictive models for classifying a sample with respect to one of 13 endpoints indicative of lung or liver toxicity in rodents, or of breast cancer, multiple myeloma or neuroblastoma in humans. In total, >30,000 models were built using many combinations of analytical methods. The teams generated predictive models without knowing the biological meaning of some of the endpoints and, to mimic clinical reality, tested the models on data that had not been used for training. We found that model performance depended largely on the endpoint and team proficiency and that different approaches generated models of similar performance. The conclusions and recommendations from MAQC-II should be useful for regulatory agencies, study committees and independent investigators that evaluate methods for global gene expression analysis.

Local quasi-likelihood with a parametric guide

Fan, J., Wu, Y., & Feng, Y. (n.d.).

Publication year

2009

Journal title

Annals of Statistics

Volume

37

Issue

6

Page(s)

4153-4183
Abstract
Abstract
Generalized linear models and the quasi-likelihood method extend the ordinary regression models to accommodate more general conditional distributions of the response. Nonparametric methods need no explicit parametric specification, and the resulting model is completely determined by the data themselves. However, nonparametric estimation schemes generally have a slower convergence rate such as the local polynomial smoothing estimation of nonparametric generalized linear models studied in Fan, Heckman and Wand [J. Amer. Statist. Assoc. 90.

Network exploration via the adaptive LASSO and SCAD penalties

Fan, J., Feng, Y., & Wu, Y. (n.d.).

Publication year

2009

Journal title

Annals of Applied Statistics

Volume

3

Issue

2

Page(s)

521-541
Abstract
Abstract
Graphical models are frequently used to explore networks, such as genetic networks, among a set of variables. This is usually carried out via exploring the sparsity of the precision matrix of the variables under consideration. Penalized likelihood methods are often used in such explorations. Yet, positive-definiteness constraints of precision matrices make the optimization problem challenging. We introduce nonconcave penalties and the adaptive LASSO penalty to attenuate the bias problem in the network estimation. Through the local linear approximation to the nonconcave penalty functions, the problem of precision matrix estimation is recast as a sequence of penalized likelihood problems with a weighted L 1 penalty and solved using the efficient algorithm of Friedman et al. [Biostatistics 9 (2008) 432-441]. Our estimation schemes are applied to two real datasets. Simulation experiments and asymptotic theory are used to justify our proposed methods.

Contact

yang.feng@nyu.edu 708 Broadway New York, NY, 10003