Mixture Conditional Regression with Ultrahigh Dimensional Text Data for Estimating Extralegal Factor Effects


Speaker: Professor Wang Hansheng

Topic: Mixture Conditional Regression with Ultrahigh Dimensional Text Data for Estimating Extralegal Factor Effects

Date: March 14th, 2024 (Thursday)

Time: 10.00 a.m.

Venue: Academic Lecture Hall 1506, Jingyuan Building

Sponsors: School of Mathematics and Statistics, Institute of Mathematics, Institute of Science and Technology


Wang Hansheng is a Professor of Business Statistics in the Department of Business Statistics and Econometrics at Guanghua School of Management in Peking University . He is a recipient of the Outstanding Young Scholar Grant from NSFC, and the founding president of the Chinese Statistical Association of Young Scholars. He is also a Fellow of the Institute of Mathematical Statistics (IMS), a fellow of the American Statistical Association (ASA), and an Elected Member of the International Statistical Institute (ISI). Throughout his career, he has served as associate editor or editor for 9 international academic journals. He has published over 100 articles in various professional journals both domestically and internationally, co-authored one English monograph, and co-authored four Chinese textbooks. He has been recognized as a highly cited scholar by Elsevier in the fields of mathematics (2014-2019), applied economics (2020), and statistics (2021-2022).


Testing judicial impartiality is a problem of fundamental importance in empirical legal studies, for which standard regression methods have been popularly used to estimate the extralegal factor effects. However, those methods cannot handle control variables with ultrahigh dimensionality, such as those found in judgment documents recorded in text format. To solve this problem, we develop a novel mixture conditional regression (MCR) approach, assuming that the whole sample can be classified into a number of latent classes. Within each latent class, a standard linear regression model can be used to model the relationship between the response and a key feature vector, which is assumed to be of a fixed dimension. Meanwhile, ultrahigh dimensional control variables are then used to determine the latent class membership, where a na\ive Bayes type model is used to describe the relationship. Hence, the dimension of control variables is allowed to be arbitrarily high. A novel expectation-maximization algorithm is developed for model estimation. Therefore, we are able to estimate the key parameters of interest as efficiently as if the true class membership were known in advance. Simulation studies are presented to demonstrate the proposed MCR method. A real dataset of Chinese burglary offenses is analyzed for illustration purposes.