Model-free feature screening based on Hellinger distance for ultrahigh dimensional data


Speaker: Professor Cui Hengjian

Topic: Model-free feature screening based on Hellinger distance for ultrahigh dimensional data

Date: March 29th, 2024 (Friday)

Time: 2.30 p.m.

Tencent Meeting ID: 243-343-042

Sponsors: School of Mathematics and Statistics, Institute of Mathematics, Institute of Science and Technology


Cui Hengjian is a professor at Capital Normal University, a doctoral supervisor, a member of the 10th National Congress of the China Association for Science and Technology (CAST), and a former expert of the Academic Degrees Committee of the State Council Discipline Appraisal Group. He graduated from the Institute of Systems Science, Chinese Academy of Sciences with a PhD degree. He has achieved many important research results in the fields of big data statistical modeling, high-dimensional statistics and theories and methods of robust statistics, statistical machine learning, financial statistics, and quality management. In addition, he has published more than 180 papers in various journals, including top international journals of Statistics and Econometrics such as JASA, AoS, JRSS(B), Biometrika and JoE. Professor Cui presides over key projects of the National Natural Science Foundation of China, Distinguished Youth (B) projects and a number of general projects. He mainly participates in major scientific research fund projects of the Ministry of Education. He serves as an editorial board member of the Chinese and English series of Acta Mathematica Sinica and Acta Mathematicae Applicatae Sinica. Also, he serves as an editorial board member of Statistical Theory and Related Fields, vice chairman of the Chinese Association for Applied Statistics (CAAS), vice chairman of the National Industrial Statistics Teaching and Research Association, president of the Beijing Applied Statistics Society, and executive director of the Institute of Mathematical Statistics (China Branch). He has won the second prize of the Science and Technology Award for Higher Education Institutions - Natural Science Award, the first prize of the National Statistical Science Research Outstanding Achievements Award, etc.


With the explosive development of data acquisition and processing technology, the dimension of features increases exponentially with the sample size, which poses great challenges for data analysis. It is vital to accurately identify useful features from thousands of them. In this paper, we develop an omnibus model-free feature screening procedure based on the Hellinger distance with some appealing merits. First, we define the Hellinger distance index for discrete response variables in discriminant analysis. Second, this procedure works consistently for continuous response variables, in which the continuous responses are discretized by slice-and-fused technique. Third, it is robust to the potential outliers and model misspecification. Theoretically, the procedure for discrete and continuous response variables possess sure screening properties and ranking consistency properties under mild conditions. Numerical studies demonstrate that this procedure exhibits strong competitiveness in heavy-tailed and skewed data, while remaining comparable to existing approaches for light-tailed data, indicating its robustness performance across a range of data. Real data contains two examples, discrete and continuous response variables, to illustrate the effectiveness of the proposed method.