Yang Feng

Scroll

Professor of Biostatistics

Yang Feng is a Professor and Ph.D. Program Director of Biostatistics in the School of Global Public Health and an affiliate faculty in the Center for Data Science at New York University. He obtained his Ph.D. in Operations Research at Princeton University in 2010.

Feng's research interests encompass the theoretical and methodological aspects of machine learning, high-dimensional statistics, social network models, and nonparametric statistics, leading to a wealth of practical applications, including Alzheimer's disease, cancer classification, and electronic health records. His research has been funded by multiple grants from the National Institutes of Health (NIH) and the National Science Foundation (NSF), notably the NSF CAREER Award.

He is currently an Associate Editor for the Journal of the American Statistical Association (JASA), the Journal of Business & Economic Statistics (JBES), Journal of Computational & Graphical Statistics (JCGS), and the Annals of Applied Statistics (AoAS). His professional recognitions include being named a fellow of the American Statistical Association (ASA) and the Institute of Mathematical Statistics (IMS), as well as an elected member of the International Statistical Institute (ISI).

Please visit Dr. Yang Feng's website and Google Scholar page from more information.

Education

B.S. in Mathematics, University of Science and Technology of China, Hefei, China

Ph.D. in Operations Research, Princeton University, Princeton, NJ

Areas of research and study

Bioinformatics

Biostatistics

High-dimensional data analysis/integration

Machine learning

Modeling Social and Behavioral Dynamics

Nonparametric statistics

Publications

Consistent Estimation of the Number of Communities in Non-uniform Hypergraph Model

Shang, Z., Zhang, Z., & Feng, Y. (n.d.).

Publication year

2025

Journal title

Stat

Volume

Issue

10.1002/sta4.70066

Abstract

Abstract

We propose an algorithm based on cross-validation to estimate the number of communities in a general non-uniform hypergraph model. The algorithm involves a three-step process. Initially, it randomly divides the set of hyperedges into a training set and a testing set. Subsequently, for each candidate number of communities, we construct a spectral estimation of community labels and least square estimation of the hyperedge probabilities based on the training set. The final step involves the computation of cross-validation scores using the testing set. The proposed algorithm is shown to be consistent when the number of vertices tends to infinity.

Machine collaboration

Liu, Q., & Feng, Y. (n.d.).

Publication year

2024

Journal title

Stat

Volume

Issue

10.1002/sta4.661

Abstract

Abstract

We propose a new ensemble framework for supervised learning, called machine collaboration (MaC), using a collection of possibly heterogeneous base learning methods (hereafter, base machines) for prediction tasks. Unlike bagging/stacking (a parallel and independent framework) and boosting (a sequential and top-down framework), MaC is a type of circular and recursive learning framework. The circular and recursive nature helps the base machines to transfer information circularly and update their structures and parameters accordingly. The theoretical result on the risk bound of the estimator from MaC reveals that the circular and recursive feature can help MaC reduce risk via a parsimonious ensemble. We conduct extensive experiments on MaC using both simulated data and 119 benchmark real datasets. The results demonstrate that in most cases, MaC performs significantly better than several other state-of-the-art methods, including classification and regression trees, neural networks, stacking, and boosting.

Machine collaboration

Liu, Q., & Feng, Y. (n.d.).

Publication year

2024

Journal title

Stat

Volume

Issue

Page(s)

e661

Abstract

Abstract

Multi-label Random Subspace Ensemble Classification

Bi, F., Zhu, J., & Feng, Y. (n.d.).

Publication year

2024

Journal title

Journal of Computational and Graphical Statistics

10.1080/10618600.2024.2421248

Abstract

Abstract

In this work, we develop a new ensemble learning framework, multi-label Random Subspace Ensemble (mRaSE), for multi-label classification. Given a base classifier (e.g., multinomial logistic regression, classification tree, K-nearest neighbors), mRaSE works by first randomly sampling a collection of subspaces, then choosing the best ones that achieve the minimum cross-validation errors and, finally, aggregating the chosen weak learners. In addition to its superior prediction performance, mRaSE also provides a model-free feature ranking depending on the given base classifier. An iterative version of mRaSE is also developed to further improve the performance. A model-free extension is pursued on the iterative version, leading to the so-called Super mRaSE, which accepts a collection of base classifiers as input to the algorithm. We show the proposed algorithms compared favorably with the state-of-the-art classification algorithm including random forest and deep neural network, via extensive simulation studies and two real data applications. The new algorithms are implemented in an updated version of the R package RaSEn.

Neyman-Pearson Multi-Class Classification via Cost-Sensitive Learning

Tian, Y., & Feng, Y. (n.d.).

Publication year

2024

Journal title

Journal of the American Statistical Association

10.1080/01621459.2024.2402567

Abstract

Abstract

Most existing classification methods aim to minimize the overall misclassification error rate. However, in applications such as loan default prediction, different types of errors can have varying consequences. To address this asymmetry issue, two popular paradigms have been developed: the Neyman-Pearson (NP) paradigm and the cost-sensitive (CS) paradigm. Previous studies on the NP paradigm have primarily focused on the binary case, while the multi-class NP problem poses a greater challenge due to its unknown feasibility. In this work, we tackle the multi-class NP problem by establishing a connection with the CS problem via strong duality and propose two algorithms. We extend the concept of NP oracle inequalities, crucial in binary classifications, to NP oracle properties in the multi-class context. Our algorithms satisfy these NP oracle properties under certain conditions. Furthermore, we develop practical algorithms to assess the feasibility and strong duality in multi-class NP problems, which can offer practitioners the landscape of a multi-class NP problem with various target error levels. Simulations and real data studies validate the effectiveness of our algorithms. To our knowledge, this is the first study to address the multi-class NP problem with theoretical guarantees. The proposed algorithms have been implemented in the R package npcs, which is available on CRAN. Supplementary materials for this article are available online, including a standardized description of the materials available for reproducing the work.

Omics feature selection with the extended SIS R package : identification of a body mass index epigenetic multimarker in the Strong Heart Study

Domingo-Relloso, A., Feng, Y., Rodriguez-Hernandez, Z., Haack, K., Cole, S. A., Navas-Acien, A., Tellez-Plaza, M., & Bermudez, J. D. (n.d.).

Publication year

2024

Journal title

American Journal of Epidemiology

Volume

193

Issue

Page(s)

1010-1018

10.1093/aje/kwae006

Abstract

Abstract

The statistical analysis of omics data poses a great computational challenge given their ultra–high-dimensional nature and frequent between-features correlation. In this work, we extended the iterative sure independence screening (ISIS) algorithm by pairing ISIS with elastic-net (Enet) and 2 versions of adaptive elastic-net (adaptive elastic-net (AEnet) and multistep adaptive elastic-net (MSAEnet)) to efficiently improve feature selection and effect estimation in omics research. We subsequently used genome-wide human blood DNA methylation data from American Indian participants in the Strong Heart Study (n = 2235 participants; measured in 1989-1991) to compare the performance (predictive accuracy, coefficient estimation, and computational efficiency) of ISIS-paired regularization methods with that of a bayesian shrinkage and traditional linear regression to identify an epigenomic multimarker of body mass index (BMI). ISIS-AEnet outperformed the other methods in prediction. In biological pathway enrichment analysis of genes annotated to BMI-related differentially methylated positions, ISIS-AEnet captured most of the enriched pathways in common for at least 2 of all the evaluated methods. ISIS-AEnet can favor biological discovery because it identifies the most robust biological pathways while achieving an optimal balance between bias and efficient feature selection. In the extended SIS R package, we also implemented ISIS paired with Cox and logistic regression for time-to-event and binary endpoints, respectively, and a bootstrap approach for the estimation of regression coefficients.

Omics feature selection with the extended SIS R package: identification of a body mass index epigenetic multi-marker in the Strong Heart Study

Domingo-Relloso, A., Feng, Y., Rodriguez-Hernandez, Z., Haack, K., Cole, S. A., Navas-Acien, A., Tellez-Plaza, M., & Bermudez, J. D. (n.d.).

Publication year

2024

Journal title

American Journal of Epidemiology

Page(s)

kwae006

Abstract

Abstract

Prognostic value of DNA methylation subclassification, aneuploidy, and CDKN2A/B homozygous deletion in predicting clinical outcome of IDH mutant astrocytomas

Galbraith, K., Garcia, M., Wei, S., Chen, A., Schroff, C., Serrano, J., Pacione, D., Placantonakis, D. G., William, C. M., Faustin, A., Zagzag, D., Barbaro, M., Del Pilar Guillermo Prieto Eibl, M., Shirahata, M., Reuss, D., Tran, Q. T., Alom, Z., von Deimling, A., Orr, B. A., … Snuderl, M. (n.d.).

Publication year

2024

Journal title

Neuro-Oncology

Volume

Issue

Page(s)

1042-1051

10.1093/neuonc/noae009

Abstract

Abstract

Background. Isocitrate dehydrogenase (IDH) mutant astrocytoma grading, until recently, has been entirely based on morphology. The 5th edition of the Central Nervous System World Health Organization (WHO) introduces CDKN2A/B homozygous deletion as a biomarker of grade 4. We sought to investigate the prognostic impact of DNA methylation-derived molecular biomarkers for IDH mutant astrocytoma. Methods. We analyzed 98 IDH mutant astrocytomas diagnosed at NYU Langone Health between 2014 and 2022. We reviewed DNA methylation subclass, CDKN2A/B homozygous deletion, and ploidy and correlated molecular biomarkers with histological grade, progression free (PFS), and overall (OS) survival. Findings were confirmed using 2 independent validation cohorts. Results. There was no significant difference in OS or PFS when stratified by histologic WHO grade alone, copy number complexity, or extent of resection. OS was significantly different when patients were stratified either by CDKN2A/B homozygous deletion or by DNA methylation subclass (P value = .0286 and .0016, respectively). None of the molecular biomarkers were associated with significantly better PFS, although DNA methylation classification showed a trend (P value = .0534). Conclusions. The current WHO recognized grading criteria for IDH mutant astrocytomas show limited prognostic value. Stratification based on DNA methylation shows superior prognostic value for OS.

Prognostic value of DNA methylation subclassification, aneuploidy, and CDKN2A/B homozygous deletion in predicting clinical outcome of IDH mutant astrocytomas

Galbraith, K., Garcia, M., Wei, S., Chen, A., Schroff, C., Serrano, J., Pacione, D., Placantonakis, D. G., William, C. M., Faustin, A., others, & Feng, Y. (n.d.).

Publication year

2024

Journal title

Neuro-Oncology

Page(s)

noae009

Abstract

Abstract

Racial distribution of molecularly classified brain tumors

Fang, C. S., Wang, W., Schroff, C., Movahed-Ezazi, M., Vasudevaraja, V., Serrano, J., Sulman, E. P., Golfinos, J. G., Orringer, D., Galbraith, K., Feng, Y., & Snuderl, M. (n.d.).

Publication year

2024

Journal title

Neuro-Oncology Advances

Volume

Issue

10.1093/noajnl/vdae135

Abstract

Abstract

Background. In many cancers, specific subtypes are more prevalent in specific racial backgrounds. However, little is known about the racial distribution of specific molecular types of brain tumors. Public data repositories lack data on many brain tumor subtypes as well as diagnostic annotation using the current World Health Organization classification. A better understanding of the prevalence of brain tumors in different racial backgrounds may provide insight into tumor predisposition and development, and improve prevention. Methods. We retrospectively analyzed the racial distribution of 1709 primary brain tumors classified by their methylation profiles using clinically validated whole genome DNA methylation. Self-reported race was obtained from medical records. Our cohort included 82% White, 10% Black, and 8% Asian patients with 74% of patients reporting their race. Results. There was a significant difference in the racial distribution of specific types of brain tumors. Blacks were overrepresented in pituitary adenomas (35%, P < .001), with the largest proportion of FSH/LH subtype. Whites were underrepresented at 47% of all pituitary adenoma patients (P < .001). Glioblastoma (GBM) IDH wild-type showed an enrichment of Whites, at 90% (P < .001), and a significantly smaller percentage of Blacks, at 3% (P < .001). Conclusions. Molecularly classified brain tumor groups and subgroups show different distributions among the three main racial backgrounds suggesting the contribution of race to brain tumor development.

Towards the Theory of Unsupervised Federated Learning : Non-asymptotic Analysis of Federated EM Algorithms

Tian, Y., Weng, H., & Feng, Y. (n.d.).

Publication year

2024

Journal title

Proceedings of Machine Learning Research

Volume

235

Page(s)

48226-48279

Abstract

Abstract

While supervised federated learning approaches have enjoyed significant success, the domain of unsupervised federated learning remains relatively underexplored. Several federated EM algorithms have gained popularity in practice, however, their theoretical foundations are often lacking. In this paper, we first introduce a federated gradient EM algorithm (FedGrEM) designed for the unsupervised learning of mixture models, which supplements the existing federated EM algorithms by considering task heterogeneity and potential adversarial attacks. We present a comprehensive finite-sample theory that holds for general mixture models, then apply this general theory on specific statistical models to characterize the explicit estimation error of model parameters and mixture proportions. Our theory elucidates when and how FedGrEM outperforms local single-task learning with insights extending to existing federated EM algorithms. This bridges the gap between their practical success and theoretical understanding. Our numerical results validate our theory, and demonstrate FedGrEM’s superiority over existing unsupervised federated learning benchmarks.

ℓ1-Penalized Multinomial Regression : Estimation, Inference, and Prediction, With an Application to Risk Factor Identification for Different Dementia Subtypes

Tian, Y., Rusinek, H., Masurkar, A. V., & Feng, Y. (n.d.).

Publication year

2024

Journal title

Statistics in Medicine

Volume

Issue

Page(s)

5711-5747

10.1002/sim.10263

Abstract

Abstract

High-dimensional multinomial regression models are very useful in practice but have received less research attention than logistic regression models, especially from the perspective of statistical inference. In this work, we analyze the estimation and prediction error of the contrast-based (Formula presented.) -penalized multinomial regression model and extend the debiasing method to the multinomial case, providing a valid confidence interval for each coefficient and (Formula presented.) value of the individual hypothesis test. We also examine cases of model misspecification and non-identically distributed data to demonstrate the robustness of our method when some assumptions are violated. We apply the debiasing method to identify important predictors in the progression into dementia of different subtypes. Results from extensive simulations show the superiority of the debiasing method compared to other inference methods.

A flexible quasi-likelihood model for microbiome abundance count data

Shi, Y., Li, H., Wang, C., Chen, J., Jiang, H., Shih, Y.-C. T., Zhang, H., Song, Y., Feng, Y., & Liu, L. (n.d.).

Publication year

2023

Journal title

Statistics in Medicine

Volume

Issue

Page(s)

4632--4643

Abstract

Abstract

Comments on: Statistical inference and large-scale multiple testing for high-dimensional regression models

Tian, Y. e., & Feng, Y. (n.d.).

Publication year

2023

Journal title

Test

Volume

Issue

Page(s)

1172--1176

Abstract

Abstract

DDAC-SpAM: A Distributed Algorithm for Fitting High-dimensional Sparse Additive Models with Feature Division and Decorrelation

He, Y., Wu, R., Zhou, Y., & Feng, Y. (n.d.).

Publication year

2023

Journal title

Journal of the American Statistical Association

Page(s)

1--12

Abstract

Abstract

Design-Based Causal Inference with Missing Outcomes: Missingness Mechanisms, Imputation-Assisted Randomization Tests, and Covariate Adjustment

Heng, S., Zhang, J., & Feng, Y. (n.d.).

Publication year

2023

Journal title

arXiv preprint arXiv:2310.18556

Abstract

Abstract

Learning from Similar Linear Representations: Adaptivity, Minimaxity, and Robustness

Tian, Y. e., Gu, Y., & Feng, Y. (n.d.).

Publication year

2023

Journal title

arXiv preprint arXiv:2303.17765

Abstract

Abstract

PCABM: Pairwise Covariates-Adjusted Block Model for Community Detection

Huang, S., Sun, J., & Feng, Y. (n.d.).

Publication year

2023

Journal title

Journal of the American Statistical Association

Page(s)

1--13

Abstract

Abstract

Semiparametric Modeling and Analysis for Longitudinal Network Data

He, Y., Sun, J., Tian, Y., Ying, Z., & Feng, Y. (n.d.).

Publication year

2023

Journal title

arXiv preprint arXiv:2308.12227

Abstract

Abstract

Simulation of New York City’s Ventilator Allocation Guideline During the Spring 2020 COVID-19 Surge

Walsh, B. C., Zhu, J., Feng, Y., Berkowitz, K. A., Betensky, R. A., Nunnally, M. E., & Pradhan, D. R. (n.d.).

Publication year

2023

Journal title

JAMA network open

Volume

Issue

Page(s)

e2336736--e2336736

Abstract

Abstract

Transfer learning under high-dimensional generalized linear models

Tian, Y. e., & Feng, Y. (n.d.).

Publication year

2023

Journal title

Journal of the American Statistical Association

Volume

118

Issue

544

Page(s)

2684--2697

Abstract

Abstract

Unsupervised Federated Learning: A Federated Gradient EM Algorithm for Heterogeneous Mixture Models with Robustness against Adversarial Attacks

Tian, Y. e., Weng, H., & Feng, Y. (n.d.).

Publication year

2023

Journal title

arXiv preprint arXiv:2310.15330

Abstract

Abstract

Variable selection for high-dimensional generalized linear model with block-missing data

He, Y., Feng, Y., & Song, X. (n.d.).

Publication year

2023

Journal title

Scandinavian Journal of Statistics

Volume

Issue

Page(s)

1279--1297

Abstract

Abstract

A likelihood-ratio type test for stochastic block models with bounded degrees

Yuan, M., Feng, Y., & Shang, Z. (n.d.).

Publication year

2022

Journal title

Journal of Statistical Planning and Inference

Volume

219

Page(s)

98--119

Abstract

Abstract

Association of hyperglycemia and molecular subclass on survival in IDH-wildtype glioblastoma

Liu, E. K., Vasudevaraja, V., Sviderskiy, V. O., Feng, Y., Tran, I., Serrano, J., Cordova, C., Kurz, S. C., Golfinos, J. G., Sulman, E. P., & others. (n.d.).

Publication year

2022

Journal title

Neuro-Oncology Advances

Volume

Issue

Page(s)

vdac163

Abstract

Abstract

Yang Feng

Yang Feng

Professor of Biostatistics

Professional overview

Education

Areas of research and study

Publications

Publications

Consistent Estimation of the Number of Communities in Non-uniform Hypergraph Model

Publication year

Journal title

Volume

Issue

Machine collaboration

Publication year

Journal title

Volume

Issue

Machine collaboration

Publication year

Journal title

Volume

Issue

Page(s)

Multi-label Random Subspace Ensemble Classification

Publication year

Journal title

Neyman-Pearson Multi-Class Classification via Cost-Sensitive Learning

Publication year

Journal title

Omics feature selection with the extended SIS R package : identification of a body mass index epigenetic multimarker in the Strong Heart Study

Publication year

Journal title

Volume

Issue

Page(s)

Omics feature selection with the extended SIS R package: identification of a body mass index epigenetic multi-marker in the Strong Heart Study

Publication year

Journal title

Page(s)

Prognostic value of DNA methylation subclassification, aneuploidy, and CDKN2A/B homozygous deletion in predicting clinical outcome of IDH mutant astrocytomas

Publication year

Journal title

Volume

Issue

Page(s)

Prognostic value of DNA methylation subclassification, aneuploidy, and CDKN2A/B homozygous deletion in predicting clinical outcome of IDH mutant astrocytomas

Publication year

Journal title

Page(s)

Racial distribution of molecularly classified brain tumors

Publication year

Journal title

Volume

Issue

Towards the Theory of Unsupervised Federated Learning : Non-asymptotic Analysis of Federated EM Algorithms

Publication year

Journal title

Volume

Page(s)

ℓ1-Penalized Multinomial Regression : Estimation, Inference, and Prediction, With an Application to Risk Factor Identification for Different Dementia Subtypes

Publication year

Journal title

Volume

Issue

Page(s)

A flexible quasi-likelihood model for microbiome abundance count data

Publication year

Journal title

Volume

Issue

Page(s)

Comments on: Statistical inference and large-scale multiple testing for high-dimensional regression models

Publication year

Journal title

Volume

Issue

Page(s)

DDAC-SpAM: A Distributed Algorithm for Fitting High-dimensional Sparse Additive Models with Feature Division and Decorrelation

Publication year