The 2nd Joint Conference on Statistics
and Data Science in China
Abstract
Haigeng Convention Center
Kunming Yunnan
July 12-14, 2024
The 2nd Joint Conference on Statistics
and Data Science in China
Abstract
Haigeng Convention Center
Kunming Yunnan
July 12-14, 2024
2
Contents
July 12 09:00 - 10:40
Plenary Talk 1: Neural Causal AI: Adversarial Invariance Learning from Heterogeneous Environments
Jianqing Fan, Room: Lecture Hall (1st Floor) …………………………………… ##
Plenary Talk 2: Genomic Testing in the Presence of Unmeasured Confounding and Missing Data
Kathryn Roeder, Room: Lecture Hall (1st Floor) …………………………… ##
July 12 11:00 - 11:50 ………………………………………………………………………… ##
Plenary Talk 3: Multi-Scale Spatial-Temporal Data in Brain Science: Data, Model and Theory
Jianfeng Feng, Room: Lecture Hall (1st Floor) ………………………………… ##
July 12 14:00 - 15:40 ………………………………………………………………………… ##
Panel Discussion on Developing Statistics in China: History of Statistics in China
Wei Yuan, Room: Yulan Hall (1st Floor) ……………………………………… ##
IS016: Exploring the Frontiers of Machine Learning: Algorithms, Theoretical Insights and Applications
Organizer: Weijie Su, Room: A201-202 (2nd Floor) …………………………………… ##
IS095: Modern Statistical Learning for High-Dimensional Data
Organizer: Jingyuan Liu, Room: A216-217 (2nd Floor) ………………………………… ##
IS005: AI and Machine Learning in Complex Biomedical Data
Organizer: Hongzhe Li, Room: A301-302 (3rd Floor) ………………………………… ##
IS026: Innovative Statistical Methods for Data with Complex Structures
Organizer: Lan Wang, Room: A320-321 (3rd Floor) …………………………………… ##
IS012: Data Science Methods for Complex Data with Endogeneity and Heterogeneity
Organizer: Linda Zhao, Room: A203 (2nd Floor) …………………………………… ##
IS004: Advancing Statistical Frontiers in Data Privacy Protection
Organizer: Weijie Su, Room: A218 (2nd Floor) ……………………………………… ##
IS101: Statistical Inference and Computation for Complex Data
Organizer: Baoxue Zhang, Room: A303 (3rd Floor) ………………………………… ##
IS031: Machine Learning Methods: Theory and Applications
Organizer: Xinyuan Song, Room: A305 (3rd Floor) ………………………………… ##
IS097: Statistical Learning for Threshold Models and Applications
Organizer: Wei Zhong, Room: A306 (3rd Floor) …………………………………… ##
IS075: Statistical Learning for Large Foundation Models
Organizer: Shurong Zheng, Room: A307 (3rd Floor) ………………………………… ##
IS080: Theoretical Foundations for Machine Learning
Organizer: Tracy Ke, Room: A308 (3rd Floor) ……………………………………… ##
IS015: Experimental Design and Big Data Subsampling
Organizer: Niansheng Tang, Room: A315 (3rd Floor) ………………………………… ##
SS2: Bernoulli Session on Stochastic Methods for Data Science
Organizer: Ajay Jasra, Room: A316 (3rd Floor) …………………………………… ##
IS090: Intersection Research of Statistics and Computer Science (统计学与计算机科学的交叉研究)
Organizer: Xingdong Feng, Room: A317 (3rd Floor) ………………………………… ##
IS028: Interface Between Statistics and Neuro and Cognitive Science
Organizer: Song Xi Chen, Room: A318 (3rd Floor) ………………………………… ##
3
CS003: Recent Advances in Mixture Model
Room: A322 (3rd Floor) ………………………………………………………………… ##
CS004: Statistical Hypothesis Testing in Complex Data
Room: B601 (3rd Floor) ………………………………………………………………… ##
CS001: Recent Advances in Reinforcement Learning
Room: B603 (3rd Floor) ………………………………………………………………… ##
CS006: Interdisciplinary and Applied Research: Statistical Analysis on Medical Data and Models
Room: B606 (3rd Floor) ………………………………………………………………… ##
July 12 16:00 - 18:05
IS072: Statistical Interdisciplinary Studies I
Organizer: Song Xi Chen, Room: Yulan Hall (1st Floor) …………………………………… ##
IS050: Recent Advances in Functional and Complex Data
Organizer: Fang Yao, Room: A201-202 (2nd Floor) …………………………………… ##
IS071: Statistical Inference for High-Dimensional Data
Organizer: Tracy Ke, Room: A216-217 (2nd Floor) ………………………………… ##
IS006: AI and Machine Learning in Single Cell Genomic
Organizer: Hongzhe Li, Room: A301-302 (3rd Floor) ………………………………… ##
IS054: Recent Advances in Statistical Machine Learning
Organizer: Zhenhua Lin, Room: A320-321 (3rd Floor) …………………………………… ##
IS060: Recent Developments in Complex Time Series Analysis
Organizer: Linda Zhao, Room: A203 (2nd Floor) …………………………………… ##
IS045: Recent Advancements in Large Network and Tensor Data Analysis
Organizer: Tracy Ke, Room: A218 (2nd Floor) ……………………………………… ##
IS083: Limit Theory of Large Dimensional Random Matrices (大维随机矩阵极限理论)
Organizer: Shurong Zheng, Room: A303 (3rd Floor) ………………………………… ##
IS099: Economic Statistics and Research on High-Quality Development (经济统计与高质量发展研究)
Organizer: Hu Zhang, Room: A305 (3rd Floor) …………………………………… ##
IS091: Model Averaging and Related Topics
Organizer: Xinyu Zhang, Room: A306 (3rd Floor) …………………………………… ##
IS079: The Interplay Between Statistical Inference and Data-Driven Decision Making
Organizer: Zhimei Ren, Room: A307 (3rd Floor) ………………………………… ##
IS011: Data Science and Engineering
Organizer: Jian Shi, Room: A308 (3rd Floor) ……………………………………… ##
IS051: Recent Advances in High-Dimensional and Heterogeneous Data Analysis
Organizer: Xinyuan Song, Room: A315 (3rd Floor) ………………………………… ##
IS062: Recent Developments in the Analysis of High-Dimensional and Complex Data
Organizer: Jinyuan Chang, Room: A316 (3rd Floor) …………………………………… ##
CS007: Precision Medicine and Survival Data
Room: A317 (3rd Floor) ………………………………………………………………… ##
CS008: Factor Models and Bayesian Analysis
Room: A318 (3rd Floor) ………………………………………………………………… ##
CS009: Statistical Machine Learning: Methodology and Applications
Room: A322 (3rd Floor) ………………………………………………………………… ##
CS010: Recent Advances in Statistical Learning Methods
4
Room: B601 (3rd Floor) ………………………………………………………………… ##
CS011: Causal Inference and Applications
Room: B603 (3rd Floor) ………………………………………………………………… ##
CS012: Recent Advances in Statistical Inference
Room: B606 (3rd Floor) ………………………………………………………………… ##
July 13 08:30 - 10:10
Plenary Talk 4: Build an End-to-End Scalable and Interpretable Data Science Ecosystem by Integrating
Statistics, ML, and Domain Sciences
Xihong Lin, Room: Lecture Hall (1st Floor) …………………………………… ##
Plenary Talk 5: Statistics and its Applications in Forensic Science and the Criminal Justice System
Alicia Carriquiry, Room: Lecture Hall (1st Floor) ………………………… ##
July 13 10:30 - 11:20
Plenary Talk 6: Generative Adversarial Learning with Optimal Input Dimension and Its Adaptive Generator Architecture
Huazhen Lin, Room: Lecture Hall (1st Floor)……………… ##
July 13 14:00 - 15:40
IS053: Recent Advances in Statistical Learning
Organizer: Tony Cai, Room: Yulan Hall (1st Floor) ……………… ##
IS022: Frontier of Statistics Machine Learning
Organizer: Annie Qu, Room: A201-202 (2nd Floor) ……………… ##
IS036: New Advances in Complex Data Analyses
Organizer: Peter Song, Room: A216-217 (2nd Floor) ……………… ##
IS037: New Statistical Methods for Causal Inference and Hidden Factor Learning
Organizer: Fang Yao, Room: A301-302 (3rd Floor) ……………… ##
IS063: Recent Topics in Machine Learning
Organizer: Zijian Guo, Room: A320-A321 (3rd Floor) ……………… ##
IS025: Independence Test and Association Analysis
Organizer: Liping Zhu, Room: A203 (2nd Floor) ……………… ##
IS066: Semiparametric Modeling for Complex Survival Data
Organizer: Xinyuan Song, Room: A218 (2nd Floor) ……………… ##
IS070: Statistical Inference for Biological and Medical Data
Organizer: Qizhai Li, Room: A303 (3rd Floor) ……………… ##
IS078: Statistics in Earth Science Applications
Organizer: Song Xi Chen, Room: A305 (3rd Floor) ……………… ##
IS065: Robust Inference in High-Dimensional Complex Data
Organizer: Zhanrui Cai, Room: A306 (3rd Floor) ……………… ##
IS052: Recent Advances in Sequencing and Imaging Data Analysis
Organizer: Anru Zhang, Room: A307 (3rd Floor) ……………… ##
IS085: Industrial Big Data and Intelligent Statistical Analysis (工业大数据和智能化统计分析)
Organizer: Jianping Zhu, Room: A308 (3rd Floor) ……………… ##
IS076: Statistical Learning on Multi-Source and Complicated Data
Organizer: Niansheng Tang, Room: A315 (3rd Floor) ……………… ##
SS1: Bernoulli Session on Statistical Methodology & Theory
Organizer: Jeff Yao, Room: A316 (3rd Floor) ……………… ##
5
IS056: Recent Advances in Statistical Network Analysis - Methodology and Applications
Organizer: Ji Zhu, Room: A317 (3rd Floor) ……………… ##
IS102: Statistical Measurement, Evaluation and Decision(统计测度、评价与决策分会场)
Organizer: Weihua Su, Room: A318 (3rd Floor) ……………… ##
CS019: Statistical Applications in Economics and Medicine
Room: A322 (3rd Floor) ……………… ##
CS016: Statistical Inference in Complex Data Analysis
Room: B601 (3rd Floor) ……………… ##
CS017: Statistical Modeling for Complex Networks
Room: B603 (3rd Floor) ……………… ##
CS018: Complex Data Analysis
Room: B606 (3rd Floor) ……………… ##
July 13 16:00 - 18:05
IS048: Recent Advances in Deep Learning Theory(深度学习理论最新进展)
Organizer: Huazhen Lin, Room: Yulan Hall (1st Floor) ……………… ##
IS082: Trustworthy AI
Organizer: Annie Qu, Room: A201-202 (2nd Floor) ……………… ##
IS021: Foundation Models in Large-Scale Biomedical Studies
Organizer: Ting Li, Room: A216-217 (2nd Floor) ……………… ##
IS077: Statistical Theory and Learning
Organizer: Chinese Society for Probability and Statistics, Room: A301-302 (3rd Floor) … ##
IS027: Innovative Statistical Methods for Heterogeneous Data
Organizer: Lan Wang, Room: A320-A321 (3rd Floor) ……………… ##
IS040: Novel Applications in Biostatistics
Organizer: Annie Qu, Room: A203 (2nd Floor) ……………… ##
IS073: Statistical Interdisciplinary Studies II
Organizer: Song Xi Chen, Room: A218 (2nd Floor) ……………… ##
IS087: Data Science and Business Intelligence Statistical Analysis (数据科学与商业智能统计分析)
Organizer: Jianping Zhu, Room: A303 (3rd Floor) ……………… ##
IS013: Design and Modeling for Computer Experiments (计算机试验的设计与建模)
Organizer: Niansheng Tang, Room: A305 (3rd Floor) ……………… ##
IS001: Advanced Estimation Methods and Machine Learning
Organizer: Xingqiu Zhao, Room: A306 (3rd Floor) ……………… ##
IS003: Advancements in Statistical Inference of Point Processes and Their Applications
Organizer: Jiancang Zhuang, Room: A307 (3rd Floor) ……………… ##
IS007: Asymptotic Theory and High-Dimensional Statistics
Organizer: Takeru Matsuda, Room: A308 (3rd Floor) ……………… ##
IS092: Statistical Network Analysis and Its Application
Organizer: Jialiang Li, Room: A315 (3rd Floor) ……………… ##
IS009: Complex Data, Geometry and Related Fields
Organizer: Zhigang Yao, Room: A316 (3rd Floor) ……………… ##
IS067: Stastistical Analysis with Complex Data
Organizer: Qihua Wang, Room: A317 (3rd Floor) ……………… ##
IS102: Statistical Measurement, Evaluation and Decision(统计测度、评价与决策分会场)
6
Organizer: Weihua Su, Room: A318 (3rd Floor) ……………… ##
CS021: Matrix Theory and Sufficient Dimension Reduction
Room: A322 (3rd Floor) ……………… ##
CS022: Asymptotic Theory in Probability and Statistics
Room: B601 (3rd Floor) ……………… ##
CS023: Recent Advances in Graphical Models and Image
Room: B603 (3rd Floor) ……………… ##
CS024: Statistical Modeling for Complex Data
Room: B606 (3rd Floor) ……………… ##
July 14 08:30 - 10:10
IS023: Deep Generative Models
Organizer: Jian Huang, Room: Yulan Hall (1
st Floor) ……………… ##
IS096: New Statistical and Machine Learning Methods for Complex Data
Organizer: Depeng Jiang, Room: A201-202 (2nd Floor) ……………… ##
IS024: High-Dimensional Statistical Learning
Organizer: Chinese Society for Probability and Statistics, Room: A216-217 (2nd Floor) …… ##
IS019: Financial Machine Learning
Organizer: Xinghua Zheng, Room: A301-302 (3rd Floor) ……………… ##
IS008: Causal Inference in Observational Studies
Organizer: Chinese Society for Probability and Statistics, Room: A320-A321 (3rd Floor) … ##
IS002: Advancements in Integrative Statistical Inference
Organizer: Wenguang Sun, Room: A203 (2nd Floor) ……………… ##
IS047: Recent Advances in Data Integration in Survey Sampling
Organizer: Jae-Kwang Kim, Room: A218 (2
nd Floor) ……………… ##
IS049: Recent Advances in Efficient and Fair Machine Learning
Organizer: Anru Zhang, Room: A303 (3rd Floor) ……………… ##
IS057: Recent Advances in Statistical Network Analysis - Theory and Methodology
Organizer: Ji Zhu, Room: A305 (3rd Floor) ……………… ##
IS017: Financial and Macroeconometrics
Organizer: Zhijie Xiao, Room: A306 (3rd Floor) ……………… ##
IS020: Foundation Models in Modern Industries
Organizer: Jinhan Xie, Room: A307 (3rd Floor) ……………… ##
IS029: Interface of Functional Data Analysis and Dynamic Models
Organizer: Jiguo Cao, Room: A308 (3rd Floor) ……………… ##
IS033: Modeling and Statistical Inference of Medical Big Data
Organizer: Chinese Society for Probability and Statistics, Room: A315 (3rd Floor) …… ##
CS002: Recent Advances in Deep Learning
Room: A316 (3rd Floor)……………… ##
CS025: Complex Data Modeling
Room: A317 (3rd Floor) ……………… ##
CS026: Change-Points Detection
Room: A318 (3rd Floor) ……………… ##
CS027: Nonparametric Statistical Inference
Room: A322 (3rd Floor) ……………… ##
7
CS028: Recent Advances in Large-Scale Data
Room: B601 (3rd Floor) ……………… ##
CS029: Interdisciplinary and Applied Research: Statistical Analysis on Medical and Economic Data
Room: B603 (3rd Floor) ……………… ##
CS030: Statistical Modeling and Its Applications
Room: B606 (3rd Floor) ……………… ##
July 14 10:30 - 12:00
IS038: New Statistical Methods for Complex Imaging and Genetics Data
Organizer: Fang Yao, Room: Yulan Hall (1
st Floor) ……………… ##
IS035: Network Analysis and Cluster Analysis
Organizer: Anderson Zhang, Room: A201-202 (2nd Floor) ……………… ##
IS014: Dynamic and Reinforcement Learning
Organizer: Jialiang Li, Room: A216-217 (2nd Floor) ……………… ##
IS043: Panel Data and Microeconometrics
Organizer: Zhijie Xiao, Room: A301-302 (3rd Floor) ……………… ##
IS034: Modern Statistical Methods for Causal Inference
Organizer: Jae-Kwang Kim, Room: A320-A321 (3rd Floor) ……………… ##
IS018: Financial Big Data
Organizer: Xinghua Zheng, Room: A203 (2nd Floor) ……………… ##
IS069: Statistical Inference Beyond Euclidean Spaces
Organizer: Xueqin Wang, Room: A218 (2nd Floor) ……………… ##
IS041: Omics and Big Data in Medical Research(组学与医学大数据)
Organizer: Feng Chen, Room: A303 (3rd Floor) ……………… ##
IS042: Optimality Consideration in Modern Statistical Inference
Organizer: Arlene Kim, Room: A305 (3rd Floor) ……………… ##
IS061: Recent Developments in Conformal Inference and Causal Inference
Organizer: Wenguang Sun, Room: A306 (3rd Floor) ……………… ##
IS039: Nonlinear Probability and Statistics for Machine Learning
Organizer: Zengjing Chen, Room: A307 (3rd Floor) ……………… ##
IS010: Data Privacy and Statistical Modeling
Organizer: Zhigang Li, Room: A308 (3rd Floor) ……………… ##
IS058: Recent Development in Complex Data Analysis
Organizer: Xingdong Feng, Room: A315 (3rd Floor) ……………… ##
CS015: High Dimensional Statistical Inference
Room: A316 (3rd Floor) ……………… ##
CS031: Advance in Statistical Methods for Large and Complex Data
Room: A317 (3rd Floor) ……………… ##
CS032: Complex Data Analysis
Room: A318 (3rd Floor) ……………… ##
CS033: Statistical Modeling and Application of Complex Data
Room: A322 (3rd Floor) ……………… ##
CS034: Statistical Applications in Interdisciplinary Research
Room: B601 (3rd Floor) ……………… ##
CS035: Model Averaging/Cross Disciplinary Research in Statistics
8
Room: B603 (3rd Floor) ……………… ##
CS036: Feature Screening and High Dimensional Data
Room: B606 (3rd Floor) ……………… ##
July 14 14:00 - 15:40
IS032: Model-Agnostic Statistical Inference
Organizer: Changliang Zou, Room: Yulan Hall (1st Floor) ……………… ##
IS089: Statistical Research on the Digital Economy and Its Impact (数字经济及其影响的统计研究)
Organizer: Xiuying Ma, Room: A201-202 (2nd Floor) ……………… ##
IS046: Recent Advances in Causal Inference
Organizer: Zijian Guo, Room: A216-217 (2nd Floor) ……………… ##
IS093: Mathematical Foundations in AI
Organizer: Qian Lin, Room: A301-302 (3rd Floor) ……………… ##
IS059: Recent Developments in Causal Learning (因果学习最新进展)
Organizer: Huazhen Lin, Room: A320-A321 (3rd Floor) ……………… ##
IS030: Large-Scale Inference and Private Statistical Analysis
Organizer: Qihua Wang, Room: A203 (2nd Floor) ……………… ##
IS068: Statistical Applications in Behavioral Decision and Behavioral Experiments (统计学在行为决策
及行为实验中的应用)
Organizer: Lei Shi, Room: A218 (2nd Floor) ……………… ##
IS074: Statistical Learning for Complex and Challenging Data
Organizer: Jianxin Pan, Room: A303 (3rd Floor) ……………… ##
IS081: Theory, Method and Application for Major Problems in Statistical Modernization of China
Organizer: Yanyun Zhao, Room: A305 (3rd Floor) ……………… ##
IS084: Statistical Modeling and Inference of High-Dimensional Complex Data (高维复杂数据的统计建
模与推断)
Organizer: Xingdong Feng, Room: A306 (3rd Floor) ……………… ##
CS005: Recent Advances in Bayesian Analysis
Room: A307 (3rd Floor) ……………… ##
CS037: Recent Advances in Quantile Regression
Room: A308 (3rd Floor) ……………… ##
CS038: Advance in Statistical Methods for Complex Data
Room: A315 (3rd Floor) ……………… ##
CS039: Advance in Missing Data and Treatment Effects
Room: A316 (3rd Floor) ……………… ##
CS040: Bayesian and Machine Learning
Room: A317 (3rd Floor) ……………… ##
CS014: Ultrahigh Dimensional Statistical Inference
Room: A318 (3rd Floor) ……………… ##
CS042: Bayesian and Nonparametric Statistical Inferences
Room: A322 (3rd Floor) ……………… ##
CS043: Recent Advances in Differentially Private and Complex Data Model
Room: B603 (3rd Floor) ……………… ##
CS044: Complex Data Model
Room: B606 (3rd Floor) ……………… ##
9
July 14 16:00 - 17:40
IS064: Research on the Statistics and Development of Networked Economic and Social Systems in the
Context of Digital Intelligence Technology
Organizer: Yanyun Zhao, Room: Yulan Hall (1st Floor) ……………… ##
IS055: Recent Advances in Statistical Machine Learning: Theory and Applications
Organizer: Rong Ma, Room: A201-202 (2nd Floor) ……………… ##
IS044: Progress in Best Subset Selection
Organizer: Xueqin Wang, Room: A216-217 (2nd Floor) ……………… ##
IS098: Spatial and Network Econometrics
Organizer: Xingbai Xu, Room: A301-302 (3rd Floor) ……………… ##
IS100: Doctoral Dissertation in Statistical Machine Learning
Organizer: Zhihua Zhang, Room: A320-A321 (3rd Floor) ……………… ##
IS088: Special Topic on Data Asset Accounting (数据资产核算专题)
Organizer: Haiqi Lv, Room: A203 (2nd Floor) ……………… ##
IS086: Business Big Data Analysis and Application (商务大数据分析与应用)
Organizer: Xiuying Ma, Room: A218 (2nd Floor) ……………… ##
CS013: Statistical Inference for Functional/Time Series Data
Room: A303 (3rd Floor) ……………… ##
CS041: Biostatistics and Industrial Statistics
Room: A305 (3rd Floor) ……………… ##
CS053: Complex Statistical Models and Its Applications
Room: A306 (3rd Floor) ……………… ##
CS045: Bayesian and Casual Inference
Room: A307 (3rd Floor) ……………… ##
CS046: Recent Advances in Large Models and Artificial Intelligence (大模型和人工智能的最近研究)
Room: A308 (3rd Floor) ……………… ##
CS047: Statistical Inference in Complex Data
Room: A315 (3rd Floor) ……………… ##
CS048: Factor Model and Community Network
Room: A316 (3rd Floor) ……………… ##
CS049: Bayesian and Machine Learning
Room: A317 (3rd Floor) ……………… ##
CS020: Optimal Subsampling
Room: A318 (3rd Floor) ……………… ##
CS050: Mathematical Statistics and Industrial Statistics
Room: A322 (3rd Floor) ……………… ##
CS051: Quantile Regression and Dimension Reduction
Room: B603 (3rd Floor) ……………… ##
CS052: Statistical Models and Methods in Economics and Finance
Room: B606 (3rd Floor) ……………… ##
Poster
Poster 001-027, South Rest Area (1st Floor)………………… ##
1
July 12, 9:00-11:50
Plenary Talk 1: Neural Causal AI: Adversarial
Invariance Learning from Heterogeneous Environments
Jianqing Fan
Princeton University
Abstract: This talk develops nonparametric invariance and causal learning from multiple environments
regression models in which data from heterogeneous
experimental settings are collected. The joint distribution of the response variable and covariate may vary
across different environments. Yet, the conditional
expectation of outcome given the unknown set of
important or quasi-causal variables is invariant across
environments. Our idea of invariance and causal
learning is to find a set of variables as exogenous as
possible across multiple environments to minimize the
empirical loss. To realize this idea, we proposed a
Neural Adversial Invariant Learning (NAIL) framework, in which the unknown regression is represented
by a ReLu network, and invariance across multiple
environments is tested using adversarial neural networks. Leveraging the representation power of neural
networks, we introduce neural causal networks based
on a focus adversarial invariance regularization (FAIR)
and its novel training algorithm. It is shown that
FAIR-NN can find the invariant variables and quasi-causal variables and that the resulting procedure is
adaptive to low-dimensional composition structures.
The combinatorial optimization problem is implemented by a Gumble approximation with decreased
temperature and stochastic approximations. The procedures are convincingly demonstrated using simulated examples.
Joint work with: Cong Fang, Yihong Gu, and Peter
Buelhmann.
Plenary Talk 2: Genomic Testing in the Presence of
Unmeasured Confounding and Missing Data
Kathryn Roeder
Carnegie Mellon University
Abstract: When aiming to identify differential genomic outcomes such as gene expression or protein
abundance, thousands of simultaneous hypothesis
tests are routinely performed. These tests can be biased by the presence of unmeasured confounders and
missing data. Recent advances in scRNA-Seq and
CRISPR technologies have allowed for the study of
case vs. control and the characterization of experimental perturbations at single-cell resolution, further
exacerbating these challenges. We develop a
large-scale hypothesis testing solution for multivariate
generalized linear models in the presence of confounding effects. Next, realizing that a number of
advantages can be accrued by taking a causal inference approach, we expand this solution by exploring
doubly robust and proximal inference options as well.
As genomic studies progress from studying transcriptomic to proteomic readouts, new challenges
have arisen, most notably large numbers of missing
values. A common strategy to address this issue is to
rely on an imputed dataset, which often introduces
systematic bias into downstream analyses. By contrast,
we develop a statistical framework inspired by doubly
robust estimators that offers valid and efficient inference for proteomic data. Our framework relies on
powerful machine learning tools, such as variational
autoencoders, to augment the imputation quality with
high-dimensional peptide data.
Plenary Talk 3: Multi-Scale Spatial-Temporal Data
in Brain Science: Data, Model and Theory
Jianfeng Feng
Fudan University
Abstract: In brain science, we have accumulated
many huge datasets spanned from subcellular, cellular
and tissue level data with multi-spatial structures and
evolving with multi-temporal scales. How to first
develop statistical approaches to tackle these structured and often noncontinuous data (point processes
or point fields) is a challenging issue. We will first
review some of the existing methods to analyze the
data. Many novel applications are included to explain
the challenging issues we are facing at the moment.
We then introduce the digital twin approach to model
the whole human brain with 86B neurons and 100T
parameters being estimated. Finally using the first
principle, a new type of neural network, the moment
2
neuronal network approach, is covered to better approximate the biological neuron network and potentially lead to AGI. Our talk serves as a typical showcase of how we as an applied mathematician can contribute and help the development of a data rich area.
July 12, 14:00-15:40
Invited Session IS016: Exploring the Frontiers of
Machine Learning: Algorithms, Theoretical Insights and Applications
A Very Dutch Scandal: Did Overhyped Stats Ruin
Dutch Appetites?
Fengnan Gao
University College Dublin
Abstract: Applying simple linear regression models,
an economist analyzed a published dataset from an
influential annual ranking in 2016 and 2017 of consumer outlets for Dutch New Herring and concluded
that the ranking was manipulated. His finding was
promoted by his university in national and international media, and this led to public outrage and ensuing discontinuation of the survey. We reconstitute the
dataset, correcting errors and exposing features already important in a descriptive analysis of the data.
The economist has continued his investigations, and in
a follow-up publication repeats the same accusations.
We point out errors in his reasoning and show that
alleged evidence for deliberate manipulation of the
ranking could easily be an artifact of specification
errors. Temporal and spatial factors are both important
and complex, and their effects cannot be captured
using simple models, given the small sample sizes and
many factors determining perceived taste of a food
product. The talk is based on the journal version published in Scandinavian Journal of Statistics and the
cover story published in Significance, the August
2023 issue.
A Statistical Framework of Watermarks for Large
Language Models: Pivot, Detection Efficiency and
Optimal Rules
Xiang Li
University of Pennsylvania
Abstract: Since ChatGPT was introduced in November 2022, embedding (nearly) unnoticeable statistical
signals into text generated by large language models
(LLMs), also known as watermarking, has been used
as a principled approach to provable detection of
LLM-generated text from its human-written counterpart. In this paper, we introduce a general and flexible
framework for reasoning about the statistical efficiency of watermarks and designing powerful detection rules. Inspired by the hypothesis testing formulation of watermark detection, our framework starts by
selecting a pivotal statistic of the text and a secret key
provided by the LLM to the verifier to enable controlling the false positive rate (the error of mistakenly
detecting human-written text as LLM-generated).
Next, this framework allows one to evaluate the power of watermark detection rules by obtaining a
closed-form expression of the asymptotic false negative rate (the error of incorrectly classifying
LLM-generated text as human-written). Our framework further reduces the problem of determining the
optimal detection rule to solving a minimax optimization program. We apply this framework to two representative watermarks one of which has been internally
implemented at OpenAI and obtain several findings
that can be instrumental in guiding the practice of
implementing watermarks. In particular, we derive
optimal detection rules for these watermarks under
our framework. These theoretically derived detection
rules are demonstrated to be competitive and sometimes enjoy a higher power than existing detection
approaches through numerical experiments.
Joint work with Feng Ruan, Huiyuan Wang, Qi Long,
Weijie Su.
Understanding the Implicit Bias of Stochastic
Gradient Descent: A Dynamical Stability Perspective
Lei Wu
Peking University
Abstract: In deep learning, models are often
over-parameterized, which leads to concerns about
algorithms picking solutions that generalize poorly.
Fortunately, stochastic gradient descent (SGD) always
converges to solutions that generalize well even
3
without needing any explicit regularization, suggesting certain \"implicit regularization\" at work. This talk
will provide an explanation of this striking phenomenon from a stability perspective. Specifically, we
show that a stable minimum of SGD must be flat, as
measured by various norms of local Hessian. Furthermore, these flat minima provably generalize well
for two-layer neural networks and diagonal linear
networks. As opposed to popular continuous-time
analysis, our stability analysis respects the discrete
nature of SGD and can explain the effect of finite
learning rates, batch size, and why SGD often generalizes better than GD.
Algorithms and Incentives in Statistics
Haifeng Xu
University of Chicago
Abstract: A generic question in statistics is to design
approaches that take data as input and output estimation of certain parameters or prediction of some quantities. The standard paradigm often assumes these data
are objectively generated from distributions, without
being affected by any human factors. However, this
paradigm ceases to be true when our predictions or
estimated parameters will in turn affect the data providers' welfare. In such situations, data providers have
incentives to alter the data for their own benefits. Thus
the design of any statistical methods must account for
potential data manipulations due to data providers'
incentives. This talk will introduce a general \"incentive-aware\" framework for designing prediction
methods. I will illustrate this design paradigm with
two examples: (1) a very recent and timely application
of eliciting authors' truthful private information for
improving the peer review systems for today's massive scale machine learning conferences; (2) a very
classic problem of PAC-learning classifiers but with
strategic providers of data features. In both problems,
I will illustrate how the presence of incentives can
fundamentally change the problem's statistical efficiency and how algorithms can help to overcome
some statistical barriers.
Invited Session IS095: Modern Statistical Learning
for High-Dimensional Data
Adaptive Shrinkage Estimation for High- Dimensional Change Point Detection
Yingxing Li
Xiamen University
Abstract: In this paper, we propose an adaptive
sparse group LASSO estimator for high dimensional
change point detection. Our method could simultaneously estimate the change structure as well as the
model parameters for different types of change point
patterns and signal strengths. The penalty parameters
are determined by data driven algorithms. As a result,
it is not necessary to know or pretest whether the
change point is present, or where it occurs. We establish the theoretical property of the proposed estimation even when the magnitudes of change shrinks to
zero. Simulation study and empirical application
demonstrate the excellent performance of our approach.
Joint work with Xiangfu Luo.
A Unifying Dependent Combination Framework
with Applications to Association Tests
Xiufan Yu
University of Notre Dame
Abstract: We introduce a novel meta-analysis
framework to combine dependent tests under a general setting, and utilize it to synthesize various association tests that are calculated from the same dataset.
Our development builds upon the classical meta-analysis methods of aggregating p-values and also a
more recent general method of combining confidence
distributions, but makes generalizations to handle
dependent tests. The proposed framework ensures
rigorous statistical guarantees, and we provide a comprehensive study and compare it with various existing
dependent combination methods. Notably, we demonstrate that the widely used Cauchy combination
method for dependent tests, referred to as the vanilla
Cauchy combination in this article, can be viewed as a
special case within our framework. Moreover, the
proposed framework provides a way to address the
problem when the distributional assumptions underlying the vanilla Cauchy combination are violated.
4
Our numerical results demonstrate that ignoring the
dependence among the to-be-combined components
may lead to a severe size distortion phenomenon.
Compared to the existing p-value combination methods, including the vanilla Cauchy combination method, the proposed combination framework can handle
the dependence accurately and utilizes the information
efficiently to construct tests with accurate size and
enhanced power.
Joint work with Linjun Zhang, Arun Srinivasan,
Min-ge Xie, and Lingzhou Xue.
Communication-Efficient and Distributed-Oracle
Estimation for High-Dimensional Quantile Regression
Songshan Yang
Renmin University of China
Abstract: In this article, we present a novel communication-efficient estimator for distributed
high-dimensional quantile regression with foldedconcave penalties. An iterative multi-step (IM) algorithm is employed to tackle the nonconvex challenge
of the objective function, taking into account both the
statistical accuracy and the communication constraints.
We demonstrate that the proposed IM estimators share
similar properties with those of the global folded-concave penalized estimator. To establish the theoretical results, we introduce a new concept called
distributed-oracle estimator. We prove that the proposed estimator converges to the distributed-oracle
estimator with high probability. Compared to the
L1-penalized method, the proposed estimator possesses a faster rate of convergence and requires milder
conditions to achieve support recovery. Furthermore,
we extend our framework to facilitate distributed inference for the preconceived low-dimensional components within the high-dimensional model. We derive the limiting distribution of the corresponding test
statistic under the null hypothesis and the local alternatives. In addition, a new feature-splitting algorithm
is devised to accommodate the high-dimensional data
within the distributed system. Extensive numerical
studies demonstrate the effectiveness and validity of
our proposed estimation and inference method. A real
example is also presented for illustration.
Joint work with Yifan Gu, Hanfang Yang and Xuming He.
Efficient Learning of Directed Acyclic Graphs in
Heavy-Tailed Data
Wei Zhou
Southwestern University of Finance and Economics
Abstract: Directed acyclic graph (DAG) models are
widely used to discover causal relationships among
random variables. However, most existing DAG
learning algorithms are not directly applicable to
heavy-tailed data which are commonly observed in
finance and other fields. In this article, we propose a
two-step topological layers based efficient algorithm,
to learn linear DAGs with heavy-tailed error distributions which include Pareto, Frchet, log-normal, Cauchy distributions, and so on. First, we reconstruct the
topological layers hierarchically in a top-down fashion
based on a new reconstruction criterion for
heavy-tailed DAGs without assuming the popularly-employed faithfulness condition. Second, we recover the directed edges via the modified conditional
independent testing for heavy-tailed distributions. We
theoretically demonstrate the consistency of the exact
DAG structures. Monte Carlo simulations validate the
outstanding finite-sample performance of the proposed algorithm compared with competing methods.
In the real data analysis, we analyze the exchange
rates among the 17 OECD countries and uncover the
source of financial contagions and the pathways, some
parts of which may not be detected by existing methods in empirical finance. This helps to identify several
currencies as good options for risk diversification and
reduce the global system risk.
Joint work with Xueqian Kang, Wei Zhong, Junhui
Wang.
Invited Session IS005: AI and Machine Learning in
Complex Biomedical Data
Tackling Biased, Incomplete Data in Electronic
Health Records
Qi Long
University of Pennsylvania
5
Abstract: Electronic health records (EHR), routinely
collected as part of healthcare delivery, have great
potential to be utilized to advance precision medicine.
They contain multiple years of health information to
be leveraged for risk prediction, disease detection, and
treatment evaluation. However, they do not have a
consistent, standardized format across institutions,
particularly in the United States, and can present significant analytical challenges–they contain multi-scale
data from heterogeneous domains and include both
structured and unstructured data. Data for individual
patients are collected at irregular time intervals and
with varying frequencies. In addition, EHR can reflect
inequity–for example, patients with less access to
healthcare, often people of color or with lower socioeconomic status, tend to have more incomplete data in
EHR. Many of these issues can contribute to biased
data collection. In this talk, I will share our recent
research on developing AI/ML models for addressing
biased, incomplete data in EHR including more accurate assessment of the harmful impact of incomplete
EHR data on algorithmic fairness, challenges associated with mitigating such bias, and potential strategies.
Bias Correction Models for Electronic Health
Records Data in the Presence of Non-random
Sampling
Judy Zhong
New York University
Abstract: Electronic health records (EHRs) contain
rich clinical information for millions of patients and
are increasingly used for public health research.
However, non-random inclusion of subjects in EHRs
can result in selection bias, with factors such as demographics, socioeconomic status, healthcare referral
patterns, and underlying health status playing a role.
While this issue has been well-documented, little
work has been done to develop or apply bias-correction methods, often due to the fact that most
of these factors are unavailable in EHRs. To address
this gap, we propose a series of Heckman type bias
correction methods by incorporating social determinants of health selection covariates to model the EHR
non-random sampling probability. Through simulations under various settings, we demonstrate the effectiveness of our proposed method in correcting biases in both the association coefficient and the outcome mean. Our method augments the utility of EHRs
for public health inferences, as we show by estimating
the prevalence of cardiovascular disease and its correlation with risk factors in the New York City network
of EHRs.
Federated Efficient Estimation of Average Treatment Effects
Rui Duan
Harvard University
Abstract: The expanding opportunities for multi-institutional collaborative research and data integration bring important opportunities in statistical learning and inference but also present significant challenges. This talk addresses the issues of integrating
heterogeneous data from multiple sources to estimate
and infer treatment effects for a specific target population. We explore critical concerns such as data heterogeneity, model misspecifications, and the barriers of
data sharing. To overcome these obstacles, we introduce methods that adapt to source-specific heterogeneity in conditional outcome distributions. Our decentralized approaches allow each site to share only
summary statistics, achieving asymptotic efficiency
achieving asymptotic efficiency equivalent to using
combined individual-level data. This allows us to
estimate the average treatment effect without compromising privacy. We present results from both theoretical and empirical investigations that assess the
performance of our proposed methods across various
settings. Additionally, we discuss the real-world implementation of these methods in large-scale nationwide clinical databases, highlighting the effectiveness
of our approach in diverse and complex data environments.
Invited Session IS026: Innovative Statistical
Methods for Data with Complex Structures
A Stability Approach for Feature Selection with
False Discovery Rate Control
6
Wei Zhong
Xiamen University
Abstract: In this talk, we first make an overview of
false discovery rate controlling methods for multiple
testing and variable selection for high dimensional
data analysis, including BH method, data-splitting-based method, knockoffs, etc. Although
most of these methods are successfully and widely
used in practice, the results of some methods are unstable due to the inherent randomness. For example,
different runs of model-X knockoffs on the same dataset result in different sets of selected variables due
to the randomness of knockoff data generation. Ren
and Barber (2023) introduced a derandomized
knockoffs method to derandomize model-X knockoffs
via leveraging e-values for false discovery rate control.
But it has non-negligible drawbacks such as the need
to select two FDR parameters and the tendency to
have low Power. To make the statistical results stable
and reproducible, we introduce a general stability
approach for variable selection algorithms with FDR
control. Our approach aggregates e-values generated
from multiple runs of the base algorithm to construct a
stabilized e-value, which leads to higher Power without loss of stability. It is very general and can be applied to almost all FDR control method, such as
knockoffs, data splitting methods. Theoretical properties of this stability method are also studied, such as
asymptotic FDR control guarantee. Extensive numerical experiments and real data applications demonstrate that the proposed method is generally more
powerful and stable than the existing competitors.
High-Dimensional Scale Invariant Discriminant
Analysis
Shurong Zheng
Northeast Normal University
Abstract: In this paper, we propose a scale invariant
linear discriminant analysis classifier for
high-dimensional data. The method is valid for both
cases that the data dimension is smaller or greater than
the sample size. The method is also suitable for
missing data. Based on recent advances of the sample
correlation matrix in random matrix theory, we derive
the asymptotic limits of the error rate which characterizes the influences of the data dimension and the
tuning parameter. The major advantage of our proposed classifier is scale invariant and it is applicable
to any variances of the feature. Several numerical
studies are investigated and our proposed classifier
performs favorably in comparison to some existing
methods.
On Functional Processes with Multiple Discontinuities
Yaguang Li
University of Science and Technology of China
Abstract: We consider the problem of estimating
multiple change points for a functional data process.
There are numerous examples in science and finance
in which the process of interest may be subject to
some sudden changes in the mean. The process data
that are not in a close vicinity of any change point can
be analysed by the usual nonparametric smoothing
methods. However, the data close to change points
and contain the most pertinent information of structural breaks need to be handled with special care. This
paper considers a half-kernel approach that addresses
the inference of the total number, locations and jump
sizes of the changes. Convergence rates and asymptotic distributional results for the proposed procedures
are thoroughly investigated. Simulations are conducted to examine the performance of the approach, and a
number of real data sets are analysed to provide an
illustration.
Structured Feature Ranking for Genomic Marker
Identification Accommodating Multiple Types of
Networks
Xingdong Feng
Shanghai University of Finance and Economics
Abstract: Numerous statistical methods have been
developed to search for genomic markers associated
with the development, progression, and response to
treatment of complex diseases. Among them, feature
ranking plays a vital role due to its intuitive formulation and computational efficiency. However, most of
the existing methods are based on the marginal im-
7
portance of molecular predictors and share the limitation that the dependence (network) structures among
predictors are not well accommodated, where a disease phenotype usually reflects various biological
processes that interact in a complex network. In this
paper, we propose a structured feature ranking method
for identifying genomic markers, where such network
structures are effectively accommodated using Laplacian regularization. The proposed method innovatively investigates multiple network scenarios, where the
networks can be known a priori and data-dependently
estimated. In addition, we rigorously explore the noise
and uncertainty in the networks and control their impacts with proper selection of tuning parameters.
These characteristics make the proposed method enjoy
especially broad applicability. Theoretical result of our
proposal is rigorously established. Compared to the
original marginal measure, the proposed network
structured measure can achieve sure screening properties with a faster convergence rate under mild conditions. Extensive simulations and analysis of The Cancer Genome Atlas melanoma data demonstrate the
improvement of finite sample performance
and practical usefulness of the proposed method.
Joint work with Yeheng Ge, Tao Li, Mengyun Wu.
Invited Session IS012: Data Science Methods for
Complex Data with Endogeneity and Heterogeneity
BELIEF in Dependence: Leveraging Atomic Linearity in Data Bits for Rethinking Generalized Linear Models
Kai Zhang
University of North Carolina, Chapel Hill
Abstract: Two linearly uncorrelated binary variables
must be also independent because non-linear dependence cannot manifest with only two possible states.
This inherent linearity is the atom of dependency constituting any complex form of relationship. Inspired
by this observation, we develop a framework called
binary expansion linear effect (BELIEF) for understanding arbitrary relationships with a binary outcome.
Models from the BELIEF framework are easily interpretable because they describe the association of binary variables in the language of linear models, yielding convenient theoretical insight and striking Gaussian parallels. With BELIEF, one may study generalized linear models (GLM) through transparent linear
models, providing insight into how the choice of link
affects modeling. For example, setting a GLM interaction coefficient to zero does not necessarily lead to
the kind of no-interaction model assumption as understood under their linear model counterparts. Furthermore, for a binary response, maximum likelihood
estimation for GLMs paradoxically fails under complete separation, when the data are most discriminative, whereas BELIEF estimation automatically reveals the perfect predictor in the data that is responsible for complete separation. We explore these phenomena and provide related theoretical results. We
also provide preliminary empirical demonstration of
some theoretical results.
Joint work with Benjamin Brown, Xiao-Li Meng.
A Scalable, Interpretable, and Data-Driven Approach to Analyzing Unstructured Information
Wu Zhu
Tsinghua University
Abstract: We introduce a general framework for analyzing large-scale text-based data, combining the
strengths of neural network language processing and
generative statistical modeling to create a factor
structure of unstructured data for downstream regressions used in social sciences. We generate textual
factors by (i) representing texts using vector word
embedding, (ii) clustering words using locality-sensitive hashing, and (iii) identifying spanning
clusters/factors through topic modeling. Our data-driven approach captures complex linguistic structures while ensuring computational scalability and
economic interpretability. We also discuss applications
of textual factors in (i) prediction and inference, (ii)
interpreting (non-text-based) models and variables,
and (iii) constructing new text-based metrics and explanatory variables, with illustrations using topics in
finance and economics such as macroeconomic forecasting and factor asset pricing. Finally, we provide a
8
flexible statistical package of textual factors for online
distribution to facilitate future applications.
Regularizing BELIEF for Smooth Dependency
Wan Zhang
University of North Carolina, Chapel Hill
Abstract: As the complexity of models and the volumes of data increase, interpretable methods for modeling complicated dependence are in great need. A
recent frame work of binary expansion linear effect
(BELIEF) provides a \"divide and conquer\" approach
to decompose any complex form of dependency into
small linear regressions over data bits. Although BELIEF can be used to approximate any relationship, it
faces an important challenge of high dimensionality.
To overcome this obstacle, we propose a novel definition of smoothness for binary interactions and create a
regularization of BELIEF under smoothness interpretations. We prove that there is a one-one correspondence between each marginal binary interaction and the
smoothness we defined. Additionally, we have shown
that in higher dimensions, the smoothness can be expressed as a product of that for marginal binary interactions. Based on these observations, we propose to
model the smooth form of dependency with a generalized LASSO model with larger penalty on less
smooth terms. The numerical studies show that the
smooth LASSO takes advantages in clear interpretability and effectiveness for nonlinear and high dimensional data.
Joint work with Heyang Ni, Yufeng Liu, Kai Zhang.
Personalized Reinforcement Learning for
Healthcare: With Applications to Sepsis Management in ICU
Linda Zhao
University of Pennsylvania
Abstract: In numerous fields such as healthcare, public policy, and e-commerce, a primary objective is to
make multiple decisions simultaneously in a dynamic
and personalized fashion. This sequential decision-making process is especially relevant in
healthcare for developing personalized treatment
plans. The main challenge stems from the dynamic
and personalized nature of the process -- each patient's
history and unique responses to treatments significantly influence their current and future care. To tackle these challenges, we develop a personalized reinforcement learning algorithm that provides optimal
and interpretable personalized treatment decisions.
Focusing on sepsis management in ICUs, a condition
as the main cause of mortality in hospitals accounting
for more than 20 billion of total costs yet no consensus on optimal treatment strategies, we demonstrate
the value of our algorithm on the ICU data from five
Boston hospitals. We show that our algorithm can
outperform standard care by providing more effective
and personalized treatment plans for sepsis patients,
showcasing the potential of our approach to improve
outcomes and reduce costs in complex healthcare
settings.
Invited Session IS004: Advancing Statistical Frontiers in Data Privacy Protection
Gaussian Differential Privacy on Riemannian
Manifolds
Linglong Kong
University of Alberta
Abstract: We develop an advanced approach for extending Gaussian Differential Privacy (GDP) to general Riemannian manifolds. The concept of GDP
stands out as a prominent privacy definition that
strongly warrants extension to manifold settings, due
to its central limit properties. By harnessing the power
of the renowned Bishop-Gromov theorem in geometric analysis, we propose a Riemannian Gaussian distribution that integrates the Riemannian distance,
allowing us to achieve GDP in Riemannian manifolds
with bounded Ricci curvature. To the best of our
knowledge, this work marks the first instance of extending the GDP framework to accommodate general
Riemannian manifolds, encompassing curved spaces,
and circumventing the reliance on tangent space
summaries. We provide a simple algorithm to evaluate
the privacy budget μ on any one-dimensional manifold and introduce a versatile Markov Chain Monte
Carlo (MCMC)-based algorithm to calculate μ on any
Riemannian manifold with constant curvature.
9
Through simulations on one of the most prevalent
manifolds in statistics, the unit sphere Sd, we demonstrate the superior utility of our Riemannian Gaussian
mechanism in comparison to the previously proposed
Riemannian Laplace mechanism for implementing
GDP.
Joint work with Yangdi Jiang, Xiaotian Chang, Yi
Liu, Lei Ding, Bei Jiang.
Unveiling Enhanced Privacy in Data Science via
Statistical Methods
Chendi Wang
University of Pennsylvania
Abstract: Recently, f-differential privacy (f-DP),
which evaluates the privacy level of an algorithm
from a hypothesis testing perspective using trade-off
functions, has been established. However, accurately
counting the privacy level of a privacy-preserving
algorithm in practical applications is challenging due
to the co-existence of multiple algorithm modules. In
this talk, we demonstrate that f-DP provides
state-of-the-art privacy analysis for various applications. We first apply f-DP to assess the privacy level
of the U.S. Census data, a critical application of differential privacy. Our analysis shows that achieving
the same privacy level requires less noise when using
f-DP compared to the zero-concentrated differential
privacy method currently used by the Census Bureau,
thereby enhancing the utility of privatized Census data.
Additionally, we propose an inequality for f-DP to
handle mixture distributions caused by machine
learning algorithms, which implies the joint convexity
of F-divergence. This inequality is shown to be tight
in widely used shuffling models. Applying this inequality to federated learning, we demonstrate that
f-DP can improve the privacy-utility tradeoff in federated learning.
Joint work with Buxin Su, Xiang Li, Jiayuan Ye, Qi
Long, Reza Shokri, Weijie Su.
Differentially Private Estimation and Inference in
High-Dimensional Regression with FDR Control
Zhanrui Cai
The University of Hong Kong
Abstract: This paper presents novel methodologies
for conducting practical differentially private (DP)
estimation and inference in high-dimensional linear
regression. We start by proposing a differentially private Bayesian Information Criterion for selecting the
unknown sparsity parameter in DP-sparse linear regression, eliminating the need for prior knowledge of
model sparsity, a requisite in the existing literature.
Then we propose a differentially private debiased
algorithm that enables privacy-preserving inference
on a particular subset of regression parameters. Our
proposed method enables accurate and private inference on the regression parameters by leveraging the
inherent sparsity of high-dimensional linear regression
models. Additionally, we address the private feature
selection by considering multiple testing in
high-dimensional linear regression by introducing a
differentially private multiple testing procedure that
controls the false discovery rate (FDR). This allows
for accurate and privacy-preserving identification of
significant predictors in the regression model.
Through extensive simulations and real data analysis,
we demonstrate the efficacy of our proposed methods
in conducting inference for high-dimensional linear
models while safeguarding privacy and controlling the
FDR.
Joint work with Sai Li, Xintao Xia, Linjun Zhang.
Online Local Differential Private Quantile Inference via Self-normalization
Bei Jiang
University of Alberta
Abstract: Based on binary inquiries, we developed an
algorithm to estimate population quantiles under Local Differential Privacy (LDP). By self-normalizing,
our algorithm provides asymptotically normal estimation with valid inference, resulting in tight confidence
intervals without the need for nuisance parameters to
be estimated. Our proposed method can be conducted
fully online, leading to high computational efficiency
and minimal storage requirements with space. We also
proved an optimality result by an elegant application
of one central limit theorem of Gaussian Differential
Privacy (GDP) when targeting the frequently encoun-
10
tered median estimation problem. With mathematical
proof and extensive numerical testing, we demonstrate
the validity of our algorithm both theoretically and
experimentally.
Joint work with Yi Liu, Qirui Hu, Lei Ding,
Linglong Kong.
Invited Session IS101: Statistical Inference and
Computation for Complex Data
Statistical Analysis of Averaged Weighted Gradient
Descent Algorithm for Decentralized Federated
Learning
Yue Chen
Capital University of Economics and Business
Abstract: In recent years, decentralized federated
learning has become increasingly important for training collaborative models without sharing sensitive
data. Although its numerical convergence theory and
communication efficiency have been well developed
in the literature, its statistical property has received
little attention, especially for the unbalanced data,
where the amounts of data across different clients vary
greatly. To this end, in this paper we propose an innovative Averaged Weighted Gradient Descent algorithm,
AWGD, based on a circle-type network structure.
Theoretically, we start with a linear regression model,
and then find that larger learning rate leads to faster
numerical convergence but worse statistical efficiency.
The resulting AWGD estimator is asymptotically efficient, if the learning rate is appropriate and the data
are distributed homogeneously, even if the data are
unbalanced. Those interesting findings are further
extended to general models, general loss functions and
heterogeneous data. Numerically, studies on simulated
and real data demonstrate that, under the same convergence rate, the proposed AWGD estimator has
superior statistical efficiency compared to the existing
competitors. More importantly, our numerical experiment results show that even for unbalanced data, the
proposed AWGD estimators are statistically as efficient as the global ones, if the learning rate is sufficiently small.
Hypothesis Testing in High Dimensional Linear
Regression via Wild Bootstrapping
Wenjuan Hu
Capital University of Economics and Business
Abstract: In recent years, U-statistic type of tests
have been proposed for testing linear hypotheses on
regression parameters in high dimensional linear
models. We investigate the distributional property of
the test statistic under a more general setting under
null and a local alternative hypothesis. Different from
previous studies, we found that the test statistic's asymptotic distribution is given by the sum of a normal
random variable and a mixed chi-square random variable. Previous test theories based on asymptotic normality can be viewed as a special case of our more
general theory. We further proposed using wild bootstrap with U-centering for practical implementation of
the new test theory. Our new test is shown to more
accurately control type-I error rates under more general settings. Simulation and real data examples further demonstrate the merit of our tests.
Joint work with Nan Lin, Baoxue Zhang.
Sequential Quantile Regression for Stream Data by
Least Squares
Ye Fan
Capital University of Economics and Business
Abstract: Massive stream data are common in modern economics applications, such as e-commerce and
finance. They cannot be permanently stored due to
storage limitation, and real-time analysis needs to be
updated frequently as new data become available. In
this work, we develop a sequential algorithm, SQR, to
support efficient quantile regression (QR) analysis for
stream data. Due to the non-smoothness of the check
loss, popular gradient-based methods do not directly
apply. Our proposed algorithm, partly motivated by
the Bayesian QR, converts the non-smooth optimization into a least squares problem and is hence significantly faster than existing algorithms that all require
solving a linear programming problem in local processing. We further extend the SQR algorithm to
composite quantile regression (CQR), and prove that
the SQR estimator is unbiased, asymptotically normal
and enjoys a linear convergence rate under mild con-
11
ditions. We also demonstrate the estimation and inferential performance of SQR through simulation
experiments and a real data example on a US used car
price data set.
Summary Statistics-Based Association Test for
Identifying the Pleiotropic Effects with Set of Genetic Variants
Deliang Bu
Capital University of Economics and Business
Abstract:Traditional genome-wide association study
focuses on testing one-to-one relationship between
genetic variants and complex human diseases or traits.
While its success in the past decade, this one-to-one
paradigm lacks efficiency because it does not utilize
the information of intrinsic genetic structure and pleiotropic effects. Due to privacy reasons, only summary
statistics of current genome-wide association study
data are publicly available. Existing summary statistics-based association tests do not consider covariates
for regression model, while adjusting for covariates
including population stratification factors is a routine
issue.
In this work, we first derive the correlation coefficients between summary Wald statistics obtained from
linear regression model with covariates. Then, a new
test is proposed by integrating three-level information
including the intrinsic genetic structure, pleiotropy,
and the potential information combinations. Extensive
simulations demonstrate that the proposed test outperforms three other existing methods under most of
the considered scenarios. Real data analysis of polyunsaturated fatty acids further shows that the proposed
test can identify more genes than the compared existing methods.
Invited Session IS031: Machine Learning Methods:
Theory and Applications
Towards Non-Asymptotic Convergence for Diffusion-Based Generative Models
Gen Li
The Chinese University of Hong Kong
Abstract: Diffusion models, which convert noise into
new data instances by learning to reverse a Markov
diffusion process, have become a cornerstone in contemporary generative modeling. While their practical
power has now been widely recognized, the theoretical underpinnings remain far from mature. In this
work, we develop a suite of non-asymptotic theory
towards understanding the data generation process of
diffusion models in discrete time, assuming access to
?2-accurate estimates of the (Stein) score functions.
For a popular deterministic sampler (based on the
probability flow ODE), we establish a convergence
rate proportional to 1/? (with ? the total number of
steps), improving upon past results; for another mainstream stochastic sampler (i.e., a type of the denoising
diffusion probabilistic model), we derive a convergence rate proportional to 1/√? , matching the
state-of-the-art theory. Imposing only minimal assumptions on the target data distribution (e.g., no
smoothness assumption is imposed), our results characterize how ?2 score estimation errors affect the
quality of the data generation process.
Dual-Directed Algorithm Design for Efficient Pure
Exploration
Wei You
The Hong Kong University of Science and Technology
Abstract: We consider pure-exploration problems in
the context of stochastic sequential adaptive experiments with a finite set of alternative options. The goal
of the decision-maker is to accurately answer a query
question regarding the alternatives with high confidence with minimal measurement efforts. A typical
query question is to identify the alternative with the
best performance, leading to best-arm identification in
the machine learning literature. We focus on the
fixed-confidence setting and by incorporating the dual
variables directly, we characterize the necessary and
sufficient conditions for an allocation to be optimal.
The use of dual variables allow us to bypass the combinatorial structure of the optimality conditions that
relies solely on primal variables. Remarkably, these
optimality conditions enable an extension of top-two
algorithm design principle (Russo, 2020), initially
proposed for best-arm identification. Furthermore, our
12
optimality conditions give rise to a straightforward yet
efficient selection rule, termed information-directed
selection, which adaptively picks from a candidate set
based on information gain of the candidates. We establish that, paired with information-directed selection,
top-two Thompson sampling is (asymptotically) optimal for Gaussian best-arm identification, solving a
glaring open problem in the pure exploration literature.
Moreover, our analysis also leads to a general principle to guide adaptations of Thompson sampling for
pure-exploration problems. Numerical experiments
highlight the exceptional efficiency of our proposed
algorithms relative to existing ones.
Joint work with Chao Qin.
Inference for High Dimensional Proportional Hazards Model with Streaming Survival Data
Haijin He
Shenzhen University
Abstract: We propose an online inference procedure
for high dimensional streaming survival data based on
the proportional hazards model. We offer an online
Lasso method for regression parameter estimation and
establish the non-asymptotic error bounds of the corresponding Lasso estimators for the regression parameter vector. In addition, we study the pointwise and
group inference for the regression parameters by utilizing a debiased Lasso method. Extensive simulations
are conducted to evaluate the finite sample performance of the proposed method. The results show good
performance of the proposed method. An application
to a colon cancer dataset is provided to demonstrate
the practical utility of the proposed methodology.
Screen Then Select: A Strategy for Correlated Predictors in High-Dimensional Quantile Regression
Xuejun Jiang
Southern University of Science and Technology
Abstract: Strong correlation among predictors and
heavy-tailed noises pose a great challenge in the analysis of ultra-high dimensional data. Such challenge
leads to an increase in the computation time for discovering active variables and a decrease in selection
accuracy. To address this issue, we propose an innovative two-stage screen-then-select approach and its
derivative procedure based on a robust quantile regression with sparsity assumption. This approach
initially screens important features by ranking quantile
ridge estimation and subsequently employs a likelihood-based post-screening selection strategy to refine
variable selection. Additionally, we conduct an internal competition mechanism along the greedy search
path to enhance the robustness of algorithm against
the design dependence. Our methods are simple to
implement and possess numerous desirable properties
from theoretical and computational standpoints. Theoretically, we establish the strong consistency of feature selection for the proposed methods under some
regularity conditions. In empirical studies, we assess
the finite sample performance of our methods by
comparing them with utility screening approaches and
existing penalized quantile regression methods. Furthermore, we apply our methods to identify genes
associated with anticancer drug sensitivities for practical guidance.
Joint work with Yakun Liang, Haofeng Wang.
Invited Session IS097: Statistical Learning for
Threshold Models and Applications
Lasso and Post-Lasso Inference for Multiple
Threshold Regressions with an Application to Return Predictability
Chenchen Ma
Peking University
Abstract: This paper considers a multiple threshold
regression model, where the coefficient parameters
can switch between regimes according to the value of
a threshold variable, and establishes the valid inference of a Lasso-type shrinkage estimation procedure
that consistently estimates the multiple thresholds.
The procedure is robust to both diverging number of
thresholds and shrinking threshold effects. Asymptotic
properties, including the consistency of the group
Lasso estimators and threshold number estimator, and
limiting distribution of the threshold estimators and
the likelihood ratio statistic, are established under a
set of regularity conditions. The focus is further
placed on the new development of the post-Lasso
13
inferential theory, which accounts for the randomness
of threshold selection and is achieved by characterizing the distribution of the coefficient estimators conditional on the selected model. Monte Carlo simulations demonstrate that the estimators are well-behaved
in finite samples. An empirical application to return
prediction further illustrates the practical merits of our
methodology.
Joint work with Yundong Tu.
Multi-Threshold Regression with Endogeneity
Chuang Wan
Nankai University
Abstract: This article develops a comprehensive estimation and inference framework for multi-threshold
regression, employing instrumental variables. One
major challenge is determining the number of threshold points. We first propose a modified information
criterion and show its consistency under mild conditions. However, its practical utility is sometimes compromised due to its sensitivity to the choice of penalization magnitude, creating a gap between theory and
practice. To bridge this gap, we exploit a
cross-validation criterion alongside an order-preserved
sample-splitting strategy tailored specifically for
threshold regression. The new criterion is completely
data-driven and therefore more convenient for practical use. We then formulate hypotheses serving distinct
purposes: the presence of threshold effects and the
existence of endogeneity. In cases where regressors
and threshold variable are both endogenous, the proposed approaches remain applicable with slight adjustments using the control function framework. Extensive simulation experiments validate the reliable
performance of our methodologies in finite sample
cases. We finally conduct an empirical application to
explore the 401(k) retirement plans dataset for which
some new findings are discovered.
Robust Estimation of Structural Instability in the
Large-Dimensional Factor Model
Wei Wang
Shandong University of Finance and Economics
Abstract: The numerous economic and financial empirical researches have verified that the distribution of
many economic variables exhibit the heavy tailedness
and structural instability. In this paper, we consider
the estimation of structural instability in
large-dimensional factor model with heavy-tailed
distributions. There exists an unknown structural
break in the factor loadings. We estimate the structural
break via minimizing the piece-wise Huber principal
component analysis (HPCA) and give the relaxed
conditions, such as relaxing the higher order moment,
showing that the estimator is consistent for the break
date. The Monte Carlo simulation are designed to
compare the finite performance with the classical
estimators by the principal component analysis. Last,
we also estimated the structural break in the U.S.
stock market and U.S. macro-economic data, respectively.
Common Threshold Estimation in Large Heterogeneous Panels with a Multifactor Error Structure
Yimeng Xie
Xiamen University
Abstract: This paper studies large panel heterogeneous models with common threshold effect and multifactor error structure, where the threshold effect is
allowed to influence both coefficients of observed
regressors and loadings of latent factors. To estimate
the coefficients and the threshold parameter, we consider auxiliary regressions where cross sectional
aver-ages of observed individual-specific covariates
are used to augment regressors, and propose a simple
concentrated least squares estimation procedure, in
which estimation of factor number is not needed. It is
shown the estimator of threshold parameter is super
consistent and the convergence rate depends on dimensions of both time periods (T) and cross sectional
units (n), while the estimators of coefficients are
√T-consistent and has asymptotic normality. In addition, we propose a test of linearity to examine the
existence of the common threshold. Monte Carlo simulations are provided and show that our proposed
estimators and test have satisfactory finite sample
performances. Finally, an empirical application about
asset pricing is presented to illustrate how to adjust
14
portfolio via our proposed methods.
Joint work with Yanbo Liu, Rui Chen.
Invited Session IS075: Statistical Learning for
Large Foundation Models
Learning Prediction Function of Prior Measures
for Statistical Inverse Problems of Partial Differential Equations
Junxiong Jia
Xi'an Jiaotong University
Abstract: In this study, we formulate statistical inverse problems involving partial differential equations
(PDEs) as PDE-constrained regression problems, with
a focus on learning the predictive functions of prior
probability measures. Adopting this viewpoint, we
introduce general generalization bounds for infinite-dimensional prior measures within the framework
of the probability approximately correct learning theory. Our theoretical framework is meticulously constructed on infinite-dimensional separable Banach
spaces, closely linking it to conventional infinite-dimensional Bayesian inverse methods. Motivated by the notion of α-differential privacy, we advance
a more general condition that includes the standard
Gaussian measures prevalent in statistical inverse
problems. This condition permits the learned prior
measures to be contingent on the observed data.
Through a series of pivotal theoretical demonstrations,
we derive concrete generalization bounds suitable for
both linear and nonlinear inverse problems, in a form
that can integrate typical PDEs. Utilizing these derived bounds, we construct well-defined practical
algorithms in infinite-dimensional spaces. To illustrate
the potential applications of our proposed methodology, we present numerical examples that showcase its
effectiveness in learning the predictive functions of
prior probability measures.
Advancements in Understanding
Over-Parameterized Deep Equilibrium Models:
Bridging Theory and Practice
Zenan Ling
Huazhong University of Technology
Abstract: A deep equilibrium model (DEQ) is implicitly defined through an equilibrium point of an infinite-depth weight-tied model with an input-injection.
Instead of infinite computations, it solves an equilibrium point directly with root-finding and computes
gradients with implicit differentiation. As a typical
implicit neural network (NN), DEQ has recently
emerged as a new neural network design paradigm,
demonstrating remarkable success on various tasks.
Nevertheless, the theoretical understanding of DEQs
is still limited. In this talk, we will introduce several
recent advancements in the theoretical comprehension
of over-parameterized DEQs: (1) a novel
non-asymptotic framework to establish the global
convergence of the gradient descent (GD) associated
with an over-parameterized DEQ; (2) a novel asymptotic framework for establishing the equivalence between implicit DEQs and explicit NNs in high dimensions. These findings leverage recent advances in
high-dimensional analysis and random matrix theory.
Simulation Learning Methodology: Theory, Algorithms, and Applications
模拟学习方法论:理论,算法和应用
Jun Shu
Xi'an Jiaotong University
摘要:近几年,人工智能研究的突破之一是以
ChatGPT 为代表的大模型的显著发展。相比于以解
决特定任务为特征的传统深度学习模型,大模型在
解决跨任务泛化等复杂任务中展示出惊人的能力
(称为涌现能力)。但大模型 “大力出奇迹”的实现
模式与“资源稀疏型”的学术研究模式存在鸿沟。本
报告将从大模型的约化这一问题展开讨论,介绍基
于“模型学习方法论”(SLeM)的元学习框架,阐述
背后关于任务迁移泛化的统计学习理论,并以机器
学习自动化作为典型应用场景展示,研发了一系列
机器学习自动化基础算法簇,揭示 SLeM 学习范式
对现实场景中的潜在适用性。
Understanding and Improving LLM Training:
Insights into Adam and Advent of Adam-Mini
Ruoyu Sun
The Chinese University of Hong Kong (Shenzhen)
Abstract: Adam is the default algorithm for training
large foundation models. In this talk, we aim to un-
15
derstand why Adam is better than SGD on training
large foundation models, and propose a
memory-efficient alternative called Adam-mini. First,
we provide an explanation of the failure of SGD on
transformer: (i) Transformers are \"heterogeneous\": the
Hessian spectrum across parameter blocks vary dramatically; (ii) Heterogeneity hampers SGD: SGD
performs badly on problems with block heterogeneity.
Second, motivated by this finding, we introduce Adam-mini, which partitions the parameters according to
the Hessian structure and assigns a single second
momentum term to all weights in a block. We empirically show that Adam-mini saves 45-50% memory
over Adam without compromising performance, on
various models including 7B-size language models
and ViT.
Invited Session IS080: Theoretical Foundations for
Machine Learning
Enhanced Topic Modeling using Entry-Wise Eigenvector Analysis
Tracy Ke
Harvard University
Abstract: Topic modeling is a widely used tool in text
analysis, aiming to extract meaningful 'topics' from a
collection of documents. This paper investigates the
optimal statistical guarantees for estimating a topic
model. The authors introduce a new normalized word
count matrix and provide a sharp entry-wise eigenvector analysis for this matrix. These results are used
to enhance an existing spectral algorithm, Topic-SCORE, for topic modeling. The authors demonstrate that the error rate of the enhanced algorithm is
minimax optimal across the entire parameter regime
of interest. Compared to existing results, the improvement is particularly significant in the challenging regime where all documents are short. Joint work
with Jingming Wang.
Network Tight Community Detection
Huimin Cheng
Boston University
Abstract: Conventional community detection methods often categorize all nodes into clusters. However,
the presumed community structure of interest may
only be valid for a subset of nodes (named as \"tight
nodes''), while the rest of the network may consist
of noninformative \"scattered nodes''. For example, a
protein-protein network often contains proteins that do
not belong to specific biological functional modules
but are involved in more general processes, or act as
bridges between different functional modules. Forcing
each of these proteins into a single cluster introduces
unwanted biases and obscures the underlying biological implication. To address this issue, we propose a
tight community detection (TCD) method to identify
tight communities excluding scattered nodes. The
algorithm enjoys a strong theoretical guarantee of
tight node identification accuracy and is scalable for
large networks. The superiority of the proposed
method is demonstrated by various synthetic and real
experiments.
Joint work with Jiayi Deng, Xiaodong Yang, Jun Yu,
Jun Liu, Zhaiming Shen, Danyang Huang.
Is Your Data Alignable? A Geometric Approach to
Single-Cell Data Integration
Rong Ma
Harvard University
Abstract: Single-cell data integration can provide a
comprehensive molecular view of cells, and many
algorithms have been developed to remove unwanted
technical or biological variations and integrate heterogeneous single-cell datasets. Despite their wide usage, existing methods suffer from several fundamental
limitations. In particular, we lack a rigorous statistical
test for whether two high-dimensional single-cell
datasets are alignable (and therefore should even be
aligned). Moreover, popular methods can substantially
distort the data during alignment, making the aligned
data and downstream analysis difficult to interpret. To
overcome these limitations, we present a spectral
manifold alignment and inference (SMAI) framework,
which enables principled and interpretable alignability
testing and structure-preserving integration of single-cell data with the same type of features. SMAI
provides a statistical test to robustly assess the alignability between datasets to avoid misleading inference
16
and is justified by high-dimensional statistical theory.
On a diverse range of real and simulated benchmark
datasets, it outperforms commonly used alignment
methods. Moreover, we show that SMAI improves
various downstream analyses such as identification of
differentially expressed genes and imputation of single-cell spatial transcriptomics, providing further biological insights. SMAI’s interpretability also enables
quantification and a deeper understanding of the
sources of technical confounders in single-cell data.
Joint work with Eric Sun, David Donoho, James
Zou.
Invited Session IS015: Experimental Design and
Big Data Subsampling
Optimal Designs for Order-of-Addition Two-Level
Factorial Experiments
Fasheng Sun
Northeast Normal University
Abstract: A new type of experiment, called the order-of-addition factorial experiment, has recently
received considerable attention in medicine science
and bioengineering. These experiments aim to simultaneously optimize the order of addition and dose
levels of drug components. In the experimental design
literature, the idea of dual-orthogonal arrays (DOAs)
was recently introduced for such experiments. However, constructing flexible DOAs is a challenging task.
In this paper, we propose a novel theory-guided search
method that efficiently identifies DOAs of any size (if
present). We also provide an algebraic construction
that instantly leads to certain DOAs. Moreover, to
address the potential issue that DOA ignores interaction effects, we propose to construct a new type of
optimal designs under the expanded compound model,
named the strong DOA (SDOA). We provide two
algebraic constructions of the SDOA. We establish
theoretical results on the optimality of both DOAs and
SDOAs. Simulation studies are performed to demonstrate the superiority of our proposed designs.
Joint work with Qiang Zhao, Qian Xiao, Abhyuday
Mandal.
An Improved K-farthest Neighbor Detection
Methodology for Covariate Shift
Yu Tang
Soochow University
Abstract: In supervised learning, there is often a discrepancy between the data distribution during training
(source distribution) and the data distribution when
the model is used for testing (target distribution). We
aim to achieve a sensitive response from the software
system with minimal testing samples, even in the
presence of subtle covariate shift. To address covariate
shift in multivariate two-sample testing, this paper
proposes a novel Half K-Farthest Neighbor
(Half-KFN) test that demonstrates superior sensitivity
compared to traditional K-Nearest Neighbor (KNN)
test when detecting data distributions with small sample sizes and small magnitude shifts. The underling
idea comes from the fact that for a small number of
proportional shift samples, they are likely to shift to
other distributions, and the farthest neighbors with far
distance can better describe this shift. Furthermore, it
effectively controls Type I error rate under null hypothesis. To evaluate our proposed methodology, numerical experiments are conducted on two open data
sets, namely MNIST and CIFAR-10. Various shift
forms are preset in the test data, with different sample
sizes and shift ratios. A comparative analysis of different multivariate two-sample testing methods is
performed. The results demonstrate that our proposed
Half-KFN algorithm consistently exhibits superior
sensitivity across various scenarios.
Joint work with Bingbing Wang.
Model-Free Subsampling Method for Massive Data
Based on Uniform Designs
Yongdao Zhou
Nankai University
Abstract: Subsampling or subdata selection is a useful approach in large-scale statistical learning. Most
existing studies focus on model-based subsampling
methods which significantly depend on the model
assumption. In this paper, we consider the model-free
subsampling strategy for generating subdata from the
original full data. In order to measure the goodness of
representation of a subdata with respect to the original
17
data, we propose a criterion, generalized empirical
F-discrepancy (GEFD), and study its theoretical
properties in connection with the classical generalized
ℓ2 -discrepancy in the theory of uniform designs.
These properties allow us to develop a kind of
low-GEFD data-driven subsampling method based on
the existing uniform designs. By simulation examples
and a real case study, we show that the proposed subsampling method is superior to the random sampling
method. Moreover, our method keeps robust under
diverse model specifications while other popular
model-based subsampling methods are under-performing. In practice, such a model-free property is more appealing than the model-based subsampling methods, where the latter may have poor performance when the model is misspecified, as demonstrated in our simulation studies. In addition, our
method is orders of magnitude faster than other model-free subsampling methods, which makes it more
applicable for subsampling of big data.
Joint work with Mei Zhang, Zheng Zhou, Aijun
Zhang.
Focus Subsampling: A More Efficient Subsampling
Method for Large-scale Linear Classification
Jun Yu
Beijing Institute of Technology
Abstract: Subsampling is one of the popular methods
to balance statistical efficiency and computational
efficiency in the big data era. Most of these aim at
selecting informative or representative sample points
to achieve good overall information of the full data.
Examples include OSMAC, IBOSS, OSS, leverage-score subsampling, and low-GEFD subsampling,
along with many suitable variations. The present talk
takes the view that sampling techniques are recommended for the region we focus on and summary
measures are enough to collect the information for the
rest according to a well-designed data partitioning. We
propose a focus subsampling strategy that combines
the summary measures and selected subdata points.
We will show that the proposed method will lead to a
more efficient estimation for general large-scale linear
classification problems. We investigate and discuss
some properties of the method, establish some connections to the OSMAC subsampling method, and
illustrate its use via a real-world example.
Joint work with Haolin Chen.
Special Session SS2: Bernoulli Session on Stochastic Methods for Data Science
Multilevel Particle Filters for Partially Observed
McKean-Vlasov Stochastic Differential Equations
Ajay Jasra
The Chinese University of Hong Kong (Shenzhen)
Abstract: In this talk we consider the filtering problem associated to partially observed McKean- Vlasov
stochastic differential equations (SDEs). The model
consists of data that are observed at regular and discrete times and the objective is to compute the conditional expectation of (functionals) of the solutions of
the SDE at the current time. This problem, even the
ordinary SDE case is challenging and requires numerical approximations. We develop a new particle filter
(PF) and multilevel particle filter (MLPF) to approximate the afore-mentioned expectations. We prove
under assumptions that, for ε > 0, to obtain a mean
square error of O(ε
2
) the PF has a cost
per-observation time of O(ε
−5
) and the MLPF costs
O(ε
−4
) (best case) or O(ε
−5
)log(ε
2
) (worst case).
Our theoretical results are supported by numerical
experiments.
Bayesian Fixed-Domain Asymptotics for Covariance Parameters in Spatial Gaussian Process Models
Cheng Li
National University of Singapore
Abstract: Gaussian process models typically contain
finite dimensional parameters in the covariance function that need to be estimated from the data. We study
the Bayesian fixed-domain asymptotics for the covariance parameters in spatial Gaussian process regression models with an isotropic Matern covariance
function, which has many applications in spatial statistics. For the model without nugget, we show that
when the dimension of the domain is less than or
equal to three, the microergodic parameter and the
18
range parameter are asymptotically independent in the
posterior. While the posterior of the microergodic
parameter is asymptotically close in total variation
distance to a normal distribution with shrinking variance, the posterior distribution of the range parameter
does not converge to any point mass distribution in
general. For the model with nugget, we derive new
evidence lower bound and consistent higher-order
quadratic variation estimators, which lead to explicit
posterior contraction rates for both the microergodic
parameter and the nugget parameter. We further study
the asymptotic efficiency and convergence rates of
Bayesian kriging prediction. All the new theoretical
results are verified in numerical experiments and real
data analysis.
Bootstrap-Assisted Inference for Weakly Stationary Time Series
Yunyi Zhang
The Chinese University of Hong Kong
Abstract: The literature often adopts two types of
stationarity assumptions in the analysis of time series,
i.e., the weak stationarity, suggesting that the mean
and the autocovariance function of a time series are
time invariant; and strict stationarity, indicating that
the marginal distributions of the time series are time
invariant. While the strict stationarity assumption is
vital from theoretical aspect, it is hard to verify in
practice. On the other hand, the weak stationarity is
relatively feasible to ensure and verify, as it only relies
on the second-order structures of the time series.
Concerning this, while sorts of weak stationarity assumptions are typically adopted in time series modeling, statisticians may want to avoid relying on strict
stationarity assumptions during statistical inference.
This presentation focuses on the analysis of
quadratic forms within a weakly, but not necessarily
strictly stationary (vector) time series. In the context
of scalar time series, it establishes the Gaussian approximation for quadratic forms of a short-range dependent weakly stationary scalar time series. Building
upon this result, it derives the asymptotic distributions
of the sample autocovariances, the sample autocorrelations, and the sample autoregre- ssive coefficients.
Transitioning to vector time series, this presentation
tackles statistical inference within high-dimensional
vector autoregressive models with white noise innovations. Given the complicated covariance structures
inherent in non- stationary time series, this presentation adopts the dependent wild bootstrap method to
facilitate statistical inference. Numerical results verifies the consistency of the proposed theories and
methods.
Strict stationarity is hard to ensure and verify for
a real-life dataset. Therefore, our work should be able
to assist statisticians in capturing the inherent
non-stationarity of real-life time series.
Joint work with Efstathios Paparoditis and Dimitris
N. Politis.
Invited Session IS090: Intersection Research of
Statistics and Computer Science (统计学与计算机
科学的交叉研究)
Optimal One-Pass Nonparametric Estimation under Memory Constraint
Zhenhua Lin
National University of Singapore
Abstract: For nonparametric regression in the
streaming setting, where data constantly flow in and
require real-time analysis, a main challenge is that
data are cleared from the computer system once processed due to limited computer memory and storage.
We tackle the challenge by proposing a novel
one-pass estimator based on penalized orthogonal
basis expansions and developing a general framework
to study the interplay between statistical efficiency
and memory consumption of estimators. We show that,
the proposed estimator is statistically optimal under
memory constraint, and has asymptotically minimal
memory footprints among all one-pass estimators of
the same estimation quality. Numerical studies
demonstrate that the proposed one-pass estimator is
nearly as efficient as its nonstreaming counterpart that
has access to all historical data.
Joint work with Mingxue Quan.
Evaluating Dynamic Conditional Quantile Treatment Effects with Applications in Ridesharing
19
Ting Li
Shanghai University of Finance and Economics
Abstract: Many modern tech companies, such as
Google, Uber, and Didi, utilize online experiments
(also known as A/B testing) to evaluate new policies
against existing ones. While most studies concentrate
on average treatment effects, situations with skewed
and heavy-tailed outcome distributions may benefit
from alternative criteria, such as quantiles. However,
assessing dynamic quantile treatment effects (QTE)
remains a challenge, particularly when dealing with
data from ride-sourcing platforms that involve sequential decision-making across time and space. In
this paper, we establish a formal framework to calculate QTE conditional on characteristics independent of
the treatment. Under specific model assumptions, we
demonstrate that the dynamic conditional QTE
(CQTE) equals the sum of individual CQTEs across
time, even though the conditional quantile of cumulative rewards may not necessarily equate to the sum of
conditional quantiles of individual rewards. This crucial insight significantly streamlines the estimation
and inference processes for our target causal estimand.
We then introduce two varying coefficient decision
process (VCDP) models and devise an innovative
method to test the dynamic CQTE. Moreover, we
expand our approach to accommodate data from spatiotemporal dependent experiments and examine both
conditional quantile direct and indirect effects. To
showcase the practical utility of our method, we apply
it to three real-world datasets from a ride-sourcing
platform. Theoretical findings and comprehensive
simulation studies further substantiate our proposal.
Joint work with Chengchun Shi, Zhaohua Lu, Yi Li,
Hongtu Zhu.
Inverse Constrained Reinforcement Learning:
From Theory to Practice
Guiliang Liu
The Chinese University of Hong Kong (Shenzhen)
Abstract: In recent years, Reinforcement Learning
(RL) has achieved remarkable performance in some
tasks, receiving widespread attention from academia
and industry. However, successful applications of RL
in tasks that can bring significant value to society
(such as autonomous driving, medical diagnosis, and
robot control) are still relatively limited. The main
reason for this is the difficulty in ensuring the safety
of the control policies. To ensure the reliability of RL
algorithms in critical applications, agents must understand their constraints. However, in many real-world
tasks, considering that constraints can change over
time and scenarios, and are highly related to the inherent experience of human experts, the optimal constraints are often difficult to specify with prior
knowledge and formulas accurately. To address these
challenges, we propose Inverse Constrained Reinforcement Learning (ICRL), which aims to learn the
constraints followed by experts from their demonstration data. This helps determine the constraints in different scenarios, allowing the imitating agents to
achieve performance like human experts. Compared
to static, artificially designed constraints, constraints
learned through data-driven methods can generalize
more effectively across multiple environments, provide a more comprehensive explanation for expert
behavior, and promote the safety of downstream applications. In this report, I will discuss the latest progress in inverse constraint reinforcement learning
research, delve into theoretical algorithmic achievements and applications, and introduce how to combine
inverse constraint reinforcement learning with large
decision-making models and human feedback to obtain policies with better generalization under evolving
constraints.
Reinforcement Learning for Precision Medicine in
HIV
Yanxun Xu
Johns Hopkins University
Abstract: The use of antiretroviral therapy (ART) has
significantly reduced HIV-related mortality and morbidity, transforming HIV infection to a chronic disease
with the care now focusing on treatment adherence,
comorbidities including mental health, and other
long-term outcomes. Since combination ART with
three or more drugs of different mechanisms or
against different targets is recommended for all people
20
living with HIV (PWH) and they must continue on it
indefinitely once started, understanding the long-term
ART effects on health outcomes and personalizing
ART treatment based on individuals’ characteristics is
crucial for optimizing PWH’s health outcomes and
facilitating precision medicine in HIV. In this talk, I
will present reinforcement learning (RL) methods
designed to learn and understand the impact of ART
on the health outcomes of PWH, and explore the future of HIV care through innovative and individualized approaches.
Invited Session IS028: Interface Between Statistics
and Neuro and Cognitive Science
Impact of Zealots on Cooperation: A Study Based
on the Behavioral Experiments of One-Shot Prisoner’s Dilemma Games
Lei Shi
Yunnan University of Finance and Economics
Abstract: The emergence, maintenance, and evolution of cooperative behavior in selfish groups has long
been an important research question in the fields of
natural and social sciences. To investigate whether
zealots can enhance and stabilize cooperation levels in
social dilemma experiments, we utilize a mixed design by cruising human interact with machine (Zealots)
to study the conditions of promoting cooperation and
associated evolutionary mechanism. The between-subjects variable be the strength of the dilemma
in the prisoner's dilemma game, and the within-subjects variables involve incorporating zealots or
not. Participants were randomly assigned to the
one-shot and anonymous prisoner's dilemma game
with either the high dilemma strength or low dilemma
strength. For each game, participants will experience
three conditions: a control condition where they are
paired with other participants, a treatment condition
where they are paired with zealots but are unaware of
this information, and a treatment condition where they
are paired with zealots and are aware of this information. In order to counterbalance the order effects,
we employ a Latin Square method for our within-subjects experiments. We found that zealots are
indeed able to increase cooperation among players, we
also evidenced that there is a minimum number of
zealots needed to promote cooperation, finally we
found there is a bottleneck effect in enhancing cooperation.
Lifespan Connectome Growth Modeling
Yong He
Beijing Normal University
Abstract: The emergence, development, and aging of
the connectome architecture enable the dynamic reorganization of network specialization and integration
throughout the lifespan, contributing to continuous
changes in human cognition and behavior. Understanding the spatiotemporal growth process of the
typical connectome is critical for elucidating network-level developmental principles in healthy individuals and for pinpointing periods of heightened
vulnerability or potential. In this talk, I will present
our recent work on lifespan normative growth modeling of the human connectome derived from multimodal MRI data from 33,250 individuals aged 32
postmenstrual weeks to 80 years from 132 global sites.
Furthermore, I will demonstrate how
these connectome-based normative models can be
employed to identify individual heterogeneities
in brain network phenotypes in patients with neurological or psychiatric disorders, including autism
spectrum disorder, major depressive disorder, and
Alzheimer's disease. It is anticipated that the connectome-based growth modeling will assist in elucidating the lifespan evolution of the brain networks and
serve as a normative reference for quantifying individual variation in development, aging, and neuropsychiatric disorders.
Learning Network-Structured Dependence from
Non-Stationary Multivariate Point Process Data
Chunming Zhang
University of Wisconsin-Madison
Abstract: Understanding sparse network dependencies among nodes from multivariate point process
data has broad applications in information transmission, social science, and computational neuroscience.
This paper introduces new continuous- time stochastic
21
models for conditional intensity processes, revealing
network structures within non-stationary multivariate
counting processes. Our model's stochastic mechanism is crucial for inferring graph parameters relevant
to structure recovery, distinct from commonly used
processes like the Poisson, Hawkes, queuing, and
piecewise deterministic Markov processes. This leads
to proposing a novel marked point process for intensity discontinuities. We derive concise representations
of their conditional distributions and demonstrate
cyclicity of the counting processes driven by recurrence time points. These theoretical properties enable
us to establish statistical consistency and convergence
properties for proposed penalized M-estimators in
graph parameters under mild regularity conditions.
Simulation evaluations showcase the method's computational simplicity and improved estimation accuracy compared to existing approaches. Real neuron
spike train recordings are analyzed to infer connectivity in neuronal networks.
Debiased Estimation and Inference for Spatial-Temporal EEG/MEG Source Imaging
Peifeng Tong
Peking University
Abstract: The development of accurate electroencephalography (EEG) and magnetoencephalography
(MEG) source imaging algorithm is of great importance for functional brain research and
non-invasive presurgical evaluation of epilepsy. In
practice, the challenge arises from the fact that the
number of measurement channels is far less than the
number of candidate source locations, rendering the
inverse problem ill-posed. A widely used approach is
to introduce a regularization term into the objective
function, which inevitably biased the estimated amplitudes towards zero, leading to an inaccurate estimation of the estimator's variance. This study proposes a
novel debiased EEG/MEG source imaging (DeESI)
algorithm for detecting sparse brain activities, which
corrects the estimation bias in signal amplitude, dipole
orientation and depth. The DeESI extends the idea of
group Lasso by incorporating both the matrix Frobenius norm and the L1-norm, which guarantees the
estimators are only sparse over sources while maintains smoothness in time and orientation. We also
derived variance of the debiased estimators for standardization and hypothesis testing. A fast alternating
direction method of multipliers (ADMM) algorithm is
proposed for solving the matrix form optimization
problem directly without the need for vectorization.
The proposed algorithm is compared with nine existing ESI methods using simulations and an open source
EEG dataset whose stimulation locations are known
precisely. The DeESI exhibits the best performance in
peak localization and amplitude reconstruction.
Joint work with Haoran Yang, Xinru Ding, Yuchuan
Ding, Xiaokun Geng, Shan An, Guoxin Wang, Song
Xi Chen.
Contributed Session CS003: Recent Advances in
Mixture Model
A Gaussian Mixture Model for Multiple Instance
Learning with Partially Subsampled Instances
Baichen Yu
Peking University
Abstract: Multiple instance learning is a powerful
machine learning technique, which is found useful
when numerous instances can be naturally grouped
into different bags. Accordingly, a bag-level label can
be created for each bag according to whether the instances contained in the bag are all negative or not.
Thereafter, how to train a statistical model with
bag-level labels with/without partially labeled instances becomes the problem of great interest. To this
end, we develop a Gaussian mixture model (GMM)
framework to describe the stochastic behavior of the
instance-level feature vectors. Both the instance-based
maximum likelihood estimator (IMLE) and the
bag-based maximum likelihood estimator (BMLE) are
theoretically investigated. We found that the statistical
efficiency of the IMLE could be much better than that
of the BMLE, if the instance-level labels are relatively
hard to be predicted. To fix the problem, we develop
here a subsampling-based maximum likelihood estimation (SMLE) approach, where the instance-level
labels are partially provided through carefully subsampling. This leads to a significantly reduced label-
22
ing cost with little sacrifice in terms of statistical efficiency. To demonstrate the finite sample performance,
extensive simulation studies are presented. A real data
example using whole-slide images (WSIs) to diagnose
metastatic breast cancer is illustrated.
Joint work with Xuetong Li, Jing Zhou, Hansheng
Wang.
Semi-Implicit Variational Inference via Score
Matching
Longlin Yu
Peking University
Abstract: Semi-implicit variational inference (SIVI)
greatly enriches the expressiveness of variational families by considering implicit variational distributions
defined in a hierarchical manner. However, due to the
intractable densities of variational distributions, current SIVI approaches often use surrogate evidence
lower bounds (ELBOs) or employ expensive inner-loop MCMC runs for direct ELBO maximization
for training. In this paper, we propose SIVI-SM, a
new method for SIVI based on an alternative training
objective via score matching. Leveraging the hierarchical structure of semi-implicit variational families,
the score matching objective allows a minimax formulation where the intractable variational densities
can be naturally handled with denoising score matching. We show that SIVI-SM closely matches the accuracy of MCMC and outperforms ELBO-based SIVI
methods in a variety of Bayesian inference tasks.
Joint work with Cheng Zhang.
Estimating IRT Models under Gaussian Mixture
Modelling of Latent Traits: An Application of
MSAEM Algorithm
Siyao Cheng
Northeast Normal University
Abstract: The assumption of a normal distribution for
latent traits is a common practice in item response
theory (IRT) models. Numerous studies have demonstrated that this assumption is often inadequate, impacting the accuracy of statistical inferences in IRT
models. To mitigate this issue, Gaussian mixture
modeling (GMM) for latent traits, known as
GMM-IRT, has been proposed. Moreover, the
GMM-IRT models can also serve as powerful tools
for exploring the heterogeneity of latent traits. However, the computation of GMM-IRT model estimation
encounters several challenges, impeding its widespread application. The purpose of this paper is to
propose a reliable and robust computing method for
GMM-IRT model estimation. Specifically, we develop
a mixed stochastic approximation EM (MSAEM)
algorithm for estimating the three-parameter normal
ogive model with GMM for latent traits
(GMM-3PNO). Crucially, the GMM-3PNO is augmented to be a complete data model within the exponential family, thereby substantially streamlining the
computation of the MSAEM algorithm. Furthermore,
the MSAEM algorithm adeptly avoid the label-switching issue, ensuring its convergence. Finally,
simulation and empirical studies are conducted to
validate the performance of the MSAEM algorithm
and demonstrate the superiority of the GMM-IRT
models.
Joint work with Xiangbin Meng.
Gaussian Mixture Model with Rare Events
Xuetong Li
Peking University
Abstract: We study here a Gaussian Mixture Model
(GMM) with rare events data. In this case, the commonly used Expectation-Maximization (EM) algorithm exhibits extremely slow numerical convergence
rate. To theoretically understand this phenomenon, we
formulate the numerical convergence problem of the
EM algorithm with rare events data as a problem
about a contraction operator. Theoretical analysis
reveals that the spectral radius of the contraction operator in this case could be arbitrarily close to 1 asymptotically. This theoretical finding explains the
empirical slow numerical convergence of the EM
algorithm with rare events data. To overcome this
challenge, a Mixed EM (MEM) algorithm is developed, which utilizes the information provided by partially labeled data. We find that MEM algorithm significantly improves the numerical convergence rate as
compared with the standard EM algorithm. The finite
23
sample performance of the proposed method is illustrated by both simulation studies and a real-world
dataset of Swedish traffic signs.
Joint work with Jing Zhou, Hansheng Wang.
Mixed Models for Longitudinal Binary Outcomes
with Crossed Random Effects
Shi Zhang
University of Manitoba
Abstract: Longitudinal studies are important for understanding patterns of change and the effectiveness
of interventions. These studies sometimes involve
data collected from different levels, such as firms
being analyzed by multiple analysts and analysts
providing forecasts for multiple firms. This reflects
the interaction of factors across different levels. It is
crucial to use accurate statistical methods to analyze
such complex nested longitudinal data to ensure the
scientific validity of study results and conclusions.
This proposal aims to develop new statistical techniques specifically designed for analyzing longitudinal binary outcomes with crossed nested structures.
An appropriate analysis of this type of data should
consider random effects at all levels. In this thesis, we
include partially crossed random effects in mixed
models for longitudinal binary outcomes. We predict
the random effects using the orthodox best linear unbiased predictor method and obtain consistent estimators for the regression parameters. This method relies
only on the first and second moments of the random
effects, making it robust against distributional assumptions. We demonstrate the usefulness of our approach through simulation and application to US firms
to investigate factors from analysts and firms that are
linked to long-term growth forecasts for firms.
Joint work with Depeng Jiang.
Contributed Session CS004: Statistical Hypothesis
Testing in Complex Data
Hypothesis Testing in Gaussian Graphical Models:
Goodness-of-Fit and Conditional Randomization
Tests
Xiaotong Lin
National University of Singapore
Abstract: We introduce novel hypothesis testing
methods for Gaussian graphical models by generating
exchangeable copies. We utilize the copies to formulate a goodness-of-fit test, which is valid in both low
and high-dimensional settings and flexible in choosing
the test statistic. This test exhibits superior power
performance, especially in scenarios where the true
precision matrix violates the null hypothesis with
many small entries. Furthermore, we adapt the sampling algorithm for constructing a new conditional
randomization test for the conditional independence
between a response Y and a vector of covariates X
given some other variables Z without requirement of
any modeling assumption about Y. It also relaxes the
assumptions of conditional randomization tests by
allowing the number of unknown parameters of the
distribution of X to be much larger than the sample
size. For both of our testing procedures, we propose
several test statistics and conduct comprehensive simulation studies to demonstrate their superior performance in controlling the Type-I error and achieving
high power. The usefulness of our methods is further
demonstrated through real-world applications.
Joint work with Dongming Huang, Fangqiao Tian.
Robust Estimation and Testing for GARCH Models via Exponentially Tilted Empirical Likelihood
Yashuang Li
Yunnan University
Abstract: The GARCH model has become one of the
most powerful and widespread tools for dealing with
time series heteroskedastic models. A commonly employed approach for inference on GARCH models is
via the quasi-maximum likelihood. However, unless
the data are sampled regularly, the quasi-maximum
likelihood estimator is inconsistent due to density
misspecification or the presence of outliers. The main
aim of this paper is to present a robust nonparametric
likelihood analysis of GARCH models including estimation of the coefficient parameters as well as model specification testing of the GARCH process. A set
of identifying moment functions are specified by applying the idea of quantile regression models to the
GARCH process. Our moment restrictions not only
24
allow the GARCH innovations to be general distribution but also is less sensitive to outliers. We then explore the use of exponentially tilted empirical likelihood (ETEL) to effectively combine these quantile
related moment restrictions. The ETEL framework
allows for imposing over-identifying restrictions and
offers implied probabilities for efficient and robust
moment estimation and inference. Asymptotic properties of the resultant ETEL estimators and test statistics
are investigated under mild conditions on the innovation distributions. We illustrate and evaluate the proposed strategies through numerical experiments on
simulated and real datasets.
Joint work with Puying Zhao, Niansheng Tang.
A Bayesian Phase I/II Platform Design for Multiple
Indications with Mixed Types of Endpoints of Toxicity and Efficacy
Xian Shi
East China Normal University
Abstract: For a new targeted or immunotherapy agent,
studying phase I/II behavior by combining multiple
indications with cancer-specific standard of care becomes a new direction. In this article, we propose
Bayesian phase I/II platform design to co-develop
combination therapies in multiple indications with
binary toxicity endpoint and survival efficacy endpoint with a generic master-protocol in the evaluation
of each indication. Bayesian hierarchical models are
used to borrow information across indications for
more efficient indication-specific decision-making.
Sequential design for optimal biological dose finding
through utility function is provided. Simulation study
shows that the proposed design has desirable operating characteristics and is superior to design that uses
dichotomized efficacy endpoint.
Elevating Federated Clustering: Deep Generative
Models and Contrastive Learning Strategies
Jie Yan
Central University of Finance and Economics
Abstract: Federated clustering (FC) is an essential
extension of centralized clustering designed for the
federated setting, wherein the challenge lies in constructing a global similarity measure without sharing
private data. Conventional approaches to FC typically
adopt extensions of centralized methods, like
K-means and fuzzy c-means. However, these methods
are susceptible to non-independent-and-identicallydistributed (non-IID) data among clients, leading to
suboptimal performance, particularly with
high-dimensional data. To handle these, we first
bridged FC and deep generative models. By executing
an autoencoder-based deep clustering method on the
generated data, one can make the model immune to
the non-IID problem and significantly enhance its
performance, especially when dealing with
high-dimensional data. Nevertheless, this bridge still
manifests a substantial gap in clustering quality compared to state-of-the-art centralized clustering methods due to the inadequate representation learning capacity of autoencoder, and the generated data could
elevate the risk of privacy breaches. Then, we bridged
FC and contrastive models using the classic FedAvg
framework. By learning more clustering-friendly representations, the gap has been notably reduced in certain federated scenarios. However, our empirical and
theoretical analyses indicate that increased non-IID
level often accompanies increased correlations across
multiple dimensions of the learned representations,
leading to poor and unrobust performance. To address
this, we introduce a decorrelation regularizer, which
effectively mitigates the detrimental effects of the
non-IID problem, and achieves superior performance,
as evidenced by a marked increase in NMI scores,
with the gain reaching as high as 0.32 in the most
pronounced case. Moreover, these methodologies also
show superior performance in handling device failures
from a practical viewpoint.
Joint work with Jing Liu, Ji Qi, Yizi Ning, Zhongyuan Zhang.
The Lose-Lose or All-Lose Consequences: Assessing the International Economic Impact of Sino–U.S. Technological Decoupling
Yutao Jiang
Anhui University of Finance and Economics
Abstract: In response to the U.S. chip embargo, Chi-
25
na has proposed export controls on crucial materials
like gallium, germanium, and graphite. However, few
studies have explored the economic impacts of these
trade sanctions policies. This study addresses this gap
by examining theoretical mechanisms and constructing a global input-output database for the chip, gallium-germanium, and graphite sectors. Using a dynamic computable general equilibrium model, we quantitatively evaluate the economic impacts of the Sino–
U.S. technological competition and conduct robustness tests. The results show that in the most extreme
scenario of the chip embargo, the GDP of China, U.S.,
and the world decreases by 1.051%, 0.006%, and
0.201%, respectively; that of Japan, South Korea, and
Taiwan, which follow the U.S. in implementing chip
sanctions, decreases by 0.109%, 0.177%, and 0.330%,
respectively. China’s export controls on crucial raw
materials will reduce national economic damage and
have a large negative impact on Japan, South Korea,
and Taiwan, whereas the U.S. suffers relatively limited negative impacts. Our findings reveal that the
Sino–U.S. technological competition is unfavorable to
the economic interests of the two countries and poses
challenges to global economic recovery in the
post-pandemic era.
Joint work with Lianbiao Cui.
Contributed Session CS001: Recent Advances in
Reinforcement Learning
Strategy Evaluation in Non-Stationary Reinforcement Learning Environment
非平稳强化学习环境下的策略评估
Wei Wang
Shandong University
摘要:强化学习是当前机器学习的前沿热点方向之
一,已经广泛应用在医疗、经济等各个领域中。经
典的强化学习框架通常考虑平稳的或者分布不随
时间变化的决策环境。这不适用于现实场景下具有
非平稳单位根过程的环境,经典的强化学习方法往
往得到次优的策略。为此,本研究考虑非平稳单位
根环境下的马尔可夫决策过程,并提出了基于无模
型的 Q 学习算法和基于模型的最大似然函数估计
法。研究分析了算法和估计方法的相合性和有效性
,采用模拟验证了提出方法的性质,并应用在实际
问题中。
Joint work with Xiaodong Yan.
Reinforcement Learning in Interval-Censored
Data
Zhimiao Cao
Shandong University
Abstract: Reinforcement learning is a general technique enabling an agent to learn an optimal policy and
interact with an environment in sequential decision-making problems. Interval-censored data is a
common form of data encountered in practical data
analysis, where observed results are only known to lie
within certain intervals rather than exact values. The
characteristics of this data structure pose challenges to
traditional data analysis and decision-making, requiring appropriate strategies for handling and decision-making based on incomplete information. This
paper proposes a framework that applies reinforcement learning to interval-censored data processing to
develop an intelligent decision system capable of
offering personalized behavioral recommendations
based on observers' state and activity variables. By
combining reinforcement learning with interval-censored data, we can devise effective intervention strategies to address observers' emotional fluctuations and enhance the overall quality of their emotional states. Experimental results demonstrate that this
integrated approach effectively optimizes observers'
emotional states, providing a new method for personalized interventions and recommendations. This research is significant for the development of intelligent
and personalized emotion management systems, offering valuable insights for future health sciences and
intelligent decision systems.
Joint work with Xiaodong Yan, Chengchun Shi,
Jianqi Feng.
Unobserved Structural Changes in the Factor-Augmented Panel Quantile Model
Fanyu Meng
Shandong University
Abstract: Panel data models, known for their varied
structural patterns, have increasingly attracted atten-
26
tion in the fields of econometrics and statistics. This
paper addresses the issue of structural changes within
the factor-augmented panel quantile model. We introduce a unified method that detects the structural instabilities and sparsity in the model by employing a double penalized loss function. This approach guarantees
that the penalized estimators exhibit oracle properties,
providing adaptability without the necessity for stringent moment conditions on the errors. Our simulations
validate that the method is effective with finite samples, and its application to real-world data proves the
approach's capability to accurately detect structural
instability.
Joint work with Wei Wang, Xiaodong Yan, Xinbing
Kong.
Reinforcement Learning for Survival Analysis
Jianqi Feng
Shandong University
Abstract: Reinforcement learning aims to optimize
the mapping of states to actions in order to maximize
rewards. In survival analysis, determining the appropriate policy based on an individual's state is crucial.
To address this, we propose combining reinforcement
learning with survival analysis to achieve optimal
treatment outcomes. In order to accommodate the data
structure in survival analysis, we introduce a Markov
decision process that handles censored and recurrent
events. We propose a Q-function that utilizes reinforcement learning to determine the best treatment at
each step. To address the complexities associated with
multi-stage, censored and recurrent events, we redesign the data structure for single events and extend
finite data to infinite-stage data structures to accommodate reinforcement learning algorithms. By estimating the duration of recurrent events for all individuals at each stage, we maximize the probability of
exceeding a specified value as our target function in
reinforcement learning. Experimental results demonstrate that our proposed framework achieves and
maintains a high level of accuracy, even in the presence of right-censored data. This presents a novel
approach to decision-making in survival analysis.
Joint work with Wei Zhao, Chengchun Shi, Zhenke
Wu, Xiaodong Yan.
A Fast Optimal Hyperparameter Selection Based
on Bandits for Streaming Data
Zhang Yu
Shandong University
Abstract: In this paper, we propose an algorithm to
address the problem of efficiently selecting optimal
hyperparameters in continuous data streams. Specifically, we introduce an online algorithm named the
Online Hyperparameter Selection (OHS) based on the
multi-armed bandit framework. The implementation
of this algorithm requires only the availability of the
current data batch at each stage of the data stream,
without the need to observe the entire dataset. We
develop a dynamic procedure to select the optimal
hyperparameters at each arrival of a new data batch,
enabling adaptive adjustment of parameter values
along the data stream. Extensive numerical experiments are conducted to evaluate the performance of
OHS, covering a range of models including, but not
limited to, linear regression models, quantile regression models, and non-parametric models. These experiments demonstrate the effectiveness of our algorithm and are supported by theoretical results.
Joint work with Xiaodong Yan.
Contributed Session CS006: Interdisciplinary and
Applied Research: Statistical Analysis on Medical
Data and Models
Research on Convex Clustering for Multi-Source
Data
Jianxi Zhao
Beijing Information Science and Technology University
Abstract: In recent years, convex clustering has attracted intensive attentions because it has basically
overcome the three shortcomings of traditional clustering methods: non-global convergence, poor robustness and the need for prior information. Nowadays, data of a large number of problems can be obtained from multiple sources. Therefore, in this paper,
I propose a convex clustering model for multiple
sources, give its theoretical recovery guarantee theo-
27
rem, present its solving algorithm and analyze the
convergence of the algorithm in theory. Numerical
experiments performed on several multi-source datasets show that the proposed method achieves the
better clustering performance, which is compared with
some state-of-the-art clustering methods.
Generalization Analysis of Deep CNNs under
Maximum Correntropy Criterion
Zhiying Fang
Shenzhen Polytechnic University
Abstract: Convolutional neural networks (CNNs)
have gained immense popularity in recent years, finding their utility in diverse fields such as image recognition, natural language processing, and
bio-informatics. Despite the remarkable progress
made in deep learning theory, most studies on CNNs,
especially in regression tasks, tend to heavily rely on
the least squares loss function. However, there are
situations where such learning algorithms may not
suffice, particularly in the presence of heavy-tailed
noises or outliers. This predicament emphasizes the
necessity of exploring alternative loss functions that
can handle such scenarios more effectively, thereby
unleashing the true potential of CNNs. In this paper,
we investigate the generalization error of deep CNNs
with the rectified linear unit (ReLU) activation function for robust regression problems within an information theoretic learning framework. Our study
demonstrates that when the regression function exhibits an additive ridge structure and the noise possesses
a finite pth moment, the empirical risk minimization
scheme, generated by the maximum correntropy criterion and deep CNNs, achieves fast convergence rates.
Notably, these rates align with the mini-max optimal
convergence rates attained by fully connected neural
network model with the Huber loss function up to a
logarithmic factor. Additionally, we further establish
the convergence rates of deep CNNs under the maximum correntropy criterion when the regression function resides in a Sobolev space on the sphere.
Construction and Validation of an Imaging Omics
Prediction Model for the Efficacy of Methylprednisolone in the Treatment of Radiation-Induced
Brain Injury
甲基强的松龙治疗放射性脑损伤疗效的影像组学
预测模型构建与验证
Xiaohuang Zhuo
Tianjin Huanhu Hospital
摘要:目的:静脉输注甲基强的松龙是鼻咽癌后放
射性脑损伤的主要治疗方法。然而,一些患者未能
从甲基强的松龙中受益甚至病情可能会加重。因此
,本研究的目的是建立影像组学模型来预测甲基强
的松龙对放射性脑损伤患者的治疗效果。研究对象
和方法:本研究纳入 66 例接受甲基强的松龙治疗
的放射性脑损伤患者。所有患者在激素治疗前后均
接受头颅磁共振成像(MRI)。每个放射性脑损伤病
人治疗前的 MRI 图像可提取出 961 个影像特征。
然后应用 LASSO 回归分析挑选出跟激素疗效相关
的影像特征来构建影像组学特征分类器,同时结合
激素疗效的临床预测因素,用多因素 Logistic 回归
分析建立临床影像组学预测模型,并对模型的区分
度、校准度和临床应用性进行评估。同时使用 10
倍交叉验证对模型进行内部验证。结果:由 16 个
筛选出的影像特征组成的影像组学特征分类器在
整个数据集和不同的亚组中都取得了良好的预测
效果。结合了影像组学特征分类器和放疗后至放射
性脑损伤诊断之间时间间隔的预测模型结果显示
有良好的区分度,其 AUC 值和通过 10 倍交叉验证
校正的 AUC 值分别为 0.966 和 0.967。校准曲线也
显示该模型有较好的一致性。决策曲线分析表明,
该影像组学预测模型具有一定的临床应用价值。结
论:本研究提出结合影像组学特征和放疗后到放射
性脑损伤诊断之间时间间隔的影像组学预测模型,
该模型可以方便地用于提前预测静脉输注甲基强
的松龙对放射性脑损伤患者的治疗效果。
A Copula-Based Approach on Optimal Allocation
of Hot Standbys in Series Systems
Jiandong Zhang
Northwest Normal University
Abstract: In this talk, we propose a copula-based
approach to study the allocation problem of hot
standbys in series systems composed of two heterogeneous and dependent components. By assuming that
the lifetimes of components and spares are dependent
and linked via a general survival copula, optimal al-
28
location strategies are presented for the case of one
and two redundancies at the component level. Further,
redundancies allocation mechanisms are also compared between the allocations at the component level
and the system level. For the case of one hot standby,
we find that the performance of the redundant system
at the component level is always worse than that at the
system level. For the case of two hot standbys, the
reversed allocation principle (i.e., Barlow–Proschan
principle) is valid. Numerical examples and applications are also provided as illustrations. A real application on improving tensile strength of cables in high
voltage electricity transmission network systems is
presented for showing the applicability of our results.
Joint work with Yiying Zhang and Rongfang Yan.
A Monte Carlo Classification Method Based on
Partial Variables and Its Application in Clinical
Data
Shanjun Mao
Hunan University
Abstract: In some specific classification problems,
the input features X are divided into two parts: one
part is X1 that can be measured before constructing
the model, and the other part is X2 that cannot be
measured before constructing the model. In these
specific problems, the input features often cannot be
all measured before constructing the classification
model, and modeling using only the features that can
be observed will lose classification information. To
address this class of problems, this paper proposes a
Monte Carlo classification method based on partial
variables. The method models the sufficient dimension reduction of X2 at y = 0 and y = 1 by learning the
relationship between variable X1 and variable X2 in
the training set, and obtains the sufficient dimension
reduction result R(X2). Then, the Monte Carlo Markov chain method is used to build a sampling model
of R(X2), and finally a classification model is built
using X1 and R(X2) as input features. By sampling
out the intraoperative data after sufficient dimensionality reduction when new samples appear, thus being
able to take into account both preoperative and intraoperative feature information when classifying, the
method is able to predict more accurately the category
to which the samples belong.
Joint work with Yao Cui, LingYi Hu.
July 12, 16:00-17:40
Invited Session IS072: Statistical Interdisciplinary
Studies I
Solving Large-Scale Sparse Equations with Tree
Structures and Its Applications to Optical Fiber
Networks
Bingyi Jing
Southern University of Science and Technology
Abstract: In communication networks, detecting
asymmetric links is of significant practical importance
and has been a long-standing problem in industry.
From a statistical perspective, this problem can be
transformed into that of solving large-scale sparse
equations with tree structures. In this talk, we will
discuss how to embed sparsity into this problem and
provide effective and reliable solutions. Finally, we
will demonstrate the effectiveness of this approach in
finding asymmetric links in optical fiber networks. In
particular, we show that our proposed approach can
achieve an accuracy rate of 100% in realistic settings
and the method has already been deployed in the industry.
Integrating Statistical Learning and Deep Learning for Efficient and Interpretable Analysis of
Complex Unstructured Data
Ke Deng
Tsinghua University
Abstract: The great success of large deep learning
models in various applications in recent years have
encouraged many researchers to seek improved performance by utilizing larger models and bigger data in
practical problems involving unstructured data, leading to increasingly obvious psychological implications
to pursuit large models everywhere. However, the
fundamental principle of statistical modelling tells us
that an over-flexible large model without a clear focus
on unique features of the problem of interest would
often lead to inefficient utilization of data and
sub-optimal results. In this talk, we will provide con-
29
crete examples, in context of video analysis, that deep
learning can be greatly enhanced by statical learning
once we integrate them wisely. We hope these examples could inspire more research efforts on developing
advanced statistical approaches with competitive performance and transparent interpretation for analyzing
complex unstructured data on top of deep learning.
Joint work with Haifeng Wang.
Exploring Novel Uncertainty Quantification
through Forward Intensity Function Modeling
Cheng Yong Tang
Temple University
Abstract: Predicting future time-to-event events is a
foundational task in statistical learning. While various
methods exist for generating point predictions, quantifying the associated uncertainties poses a more substantial challenge. In this study, we introduce an innovative approach specifically designed to address
this challenge, accommodating dynamic predictors
that may manifest as stochastic processes. Our investigation harnesses the forward intensity function in a
novel way, providing a fresh perspective on this intricate problem. The framework we propose demonstrates remarkable computational efficiency, enabling
efficient analyses of large-scale investigations. We
validate its soundness with theoretical guarantees, and
our in-depth analysis establishes the weak convergence of function-valued parameter estimations. We
illustrate the effectiveness of our framework with two
comprehensive real examples and extensive simulation studies.
Dynamic Synthetic Control Method for Semiparametric Time-Varying Additive Autoregression
Model
Shouxia Wang
Peking University
Abstract: Motivated by evaluating the treatment effects of a policy for nonlinear time-varying confounding variables, we propose a dynamic synthetic
control (DSC) method under the semiparametric
time-varying additive autoregression outcome model.
The proposed method allows for micro-level data with
nonlinear time-varying confounders, multiple treated
units and spatial correlations in the data.
Spline-back-fitted-kernel estimation method is used to
obtain good estimations of the unknown additive
functions, which are then used for matching when we
construct the DSC weights. The DSC weights are
constructed by the empirical likelihood, which guarantees a unique solution and a consistent estimation of
the average treatment effect on the treated group.
Semiparametric additive model provides more flexibility in modelling and estimation, making it more
favorable when either the parametric form of the
model is unknown or the model is incorrectly specified. We have developed an unconfounded assumption
assessment test based on the estimated effects in the
pre-treatment period and a normalized placebo test to
determine the significance of the estimated treatment
effects. The proposed DSC method is demonstrated by
both numerical simulations and real data examples
that highlight the effects of the air pollution alerts in
Beijing and the COVID-19 lockdown in Shanghai.
Joint work with Song Xi Chen, Xiangyu Zheng.
Invited Session IS050: Recent Advances in Functional and Complex Data
Functional Principal Component Analysis of Spatially and Temporally Indexed Point Processes
Yehua Li
University of California, Riverside
Abstract: We model spatially and temporally indexed
point process data as a multi-level log-Gaussian Cox
process where the log intensity function depends on a
partially linear single-index structure of spatio-temporal covariates and three latent functional
random effects representing the spatial and temporal
random effects as well as their interactions. We assume that the latent functional effects are Gaussian
processes with Karhunen-Loeve representations, and
model the unknown link function of the single-index
as well as the covariance functions of the latent functional effects as splines. We propose to estimate the
partially linear coefficients and the single-index link
function using a Poisson maximum likelihood method,
and the covariance functions of the latent processes
30
using maximum composite likelihood methods. We
also propose approaches to predict the functional
principal component scores. Under the multi-level
dependence structure and allowing the spatio-temporal covariates to be non-stationary, the proposed estimators follow rather unconventional convergence rates which depend on both the number of
locations and the number of repeated measures in time.
We illustrate the proposed method through a simulation study and a real-data application in modeling
bike-sharing events.
Joint work with Kun Huang, Yongtao Guan.
Functional Neural Networks
Jiguo Cao
Simon Fraser University
Abstract: Functional data analysis (FDA) is a growing statistical field for analyzing curves, images, or
any multidimensional functions, in which each random function is treated as a sample element. Functional data is found commonly in many applications
such as longitudinal studies and brain imaging. In this
talk, I will present a methodology for integrating
functional data into deep neural networks. The model
is defined for scalar responses with multiple functional and scalar covariates. A by-product of the method is
a set of dynamic functional weights that can be visualized during the optimization process. This visualization leads to greater interpretability of the relationship
between the covariates and the response relative to
conventional neural networks. The model is shown to
perform well in a number of contexts including prediction of new data and recovery of the true underlying relationship between the functional covariate and
scalar response; these results were confirmed through
real data applications and simulation studies.
Causal Mediation Analysis for Multilevel and
Functional Data
Xi Luo
The University of Texas Health Science Center at
Houston
Abstract: Causal mediation analysis typically involves conditions that may not be applicable in neuroimaging studies. We introduce a multilevel causal
mediation framework to overcome this limitation and
more accurately quantify information flow in brain
pathways. This framework is designed to tackle several challenges: unmeasured mediator-outcome confounding, multilevel time series analysis, and the estimation of functional causal effects. Our approach is
grounded in multilevel structural equation modeling,
complemented by relaxed likelihood estimation
methods. Interestingly, certain causal estimates, typically unobtainable in simpler data structures, become
identifiable in our more complex data setting. We
provide proof of the asymptotic properties of our estimators and illustrate the numerical properties
through empirical analysis. Additionally, we utilize
real fMRI data to demonstrate the practical effectiveness of our proposed framework.
Joint work with Yi Zhao, Michael Sobel, Martin
Lindquist, Brian Caffo.
Frequent-Voting Independence Screening for Data
of Different Types or Different Dimensions
Kehui Chen
University of Pittsburgh
Abstract: Modern datasets often include different
types of variables with complex features, making
variable selection particularly challenging. For example, a measure of dependence with the response variable may not be directly comparable among predictor
variables of different types such as functional data. To
address this challenge, this work proposes a frequent-voting based independent screening method for
variable selection, which avoids a direct comparison
of the dependence measure among different variables.
Asymptotic analyses show that the proposed method
selects all of the active variables with probability
converging to one. We also demonstrate its great finite
sample performance through numerical experiments
and the application to an ADHD study.
Joint work with Haeun Moon.
Invited Session IS071: Statistical Inference for
High-Dimensional Data
Some Recent Results for P-Value Free FDR Con-
31
trols
Jun Liu
Harvard University
Abstract: There has been significant interest among
researchers in false discovery rate (FDR) control
methods partially due to the strong desire from the
scientific community for reproducibility and replicability of scientific discoveries. I will discuss our recent efforts trying to go beyond the recently popular
p-value-free FDR control methods such as the
knockoff filter (KF), data splitting (DS), and Gaussian
mirror (GM). We present some power analysis of
these methods under the weak-and-rare signal framework and discuss its implications under different correlation structures of the design matrix. We then focus
on the DS procedure and its variant. In particular, we
reformulate the DS method into a two-step procedure:
using part of the data for estimation and feature ranking (in regression setting) and using the other part as
checking/validation. FDR control can be achieved by
monitoring how well the validation goes along the
feature ranking. Under this setup, we may utilize external information and apply any procedure, such as a
Bayesian method with spike-and-slab priors, to work
on the first part of the data. We show that substantial
power gain can be achieved in this way.
Joint work with Buyu Lin, Tracy Ke, Yuanchuan
Guo.
Enhancing Integrative Association Tests: Optimal
Weighting Approaches in Whole-Genome Sequencing Studies
Zheyang Wu
Worcester Polytechnic Institute
Abstract: Integrative association tests are a crucial
method for detecting association signals and have
broad applications. Weighting is an important strategy
for incorporating useful information to increase statistical power. For example, in genetic association studies using whole-genome sequencing (WGS) data, SNP
allele frequencies and annotations are considered indicative of the likelihood and effect size of genetic
causal variants. Consequently, they are widely utilized
in weighted integrative association tests to enhance
the identification of novel genes associated with human complex traits. However, the rationale for their
use is mostly based on biological motivations. In this
study, we reveal the statistical mechanisms by which
weighting contributes to increased power, deduce the
optimal weights based on signal and data correlation,
and discuss the advantages and limitations of
weighting. In particular, we establish the asymptotically optimal weights for a general framework of
weighted p-value combination tests, which include
prevalent methods used in genetic association studies.
We also explore the principles for estimating optimal
weights in practice. Our findings are validated through
extensive simulations and real data analysis.
Joint work with Hong Zhang, Ming Liu, John
Landers.
Detection and Statistical Inference on Informative
Core and Periphery Structures in Weighted Directed Networks
Wen Zhou
New York University
Abstract: In network analysis, noises and biases,
which are often introduced by peripheral or
non-essential components, can mask pivotal structures
and hinder the efficacy of many network modeling
and inference procedures. Recognizing this, identification of the core-periphery (CP) structure has
emerged as a crucial data pre-processing step. While
the identification of the CP structure has been instrumental in pinpointing core structures within networks,
its application to directed weighted networks has been
underexplored. Many existing efforts either fail to
account for the directionality or lack the theoretical
justification of the identification procedure. In this
work, we seek answers to three pressing questions: (i)
How to distinguish the informative and
non-informative structures in weighted directed networks? (ii) What approach offers computational efficiency in discerning these components? (iii) Upon the
detection of CP structure, can uncertainty be quantified to evaluate the detection? We adopt the signal-plus-noise model, categorizing uniform relational
patterns as non-informative, by which we define the
32
sender and receiver peripheries. Furthermore, instead
of confining the core component to a specific structure,
we consider it complementary to either the sender or
receiver peripheries. Based on our definitions on the
sender and receiver peripheries, we propose spectral
algorithms to identify the CP structure in directed
weighted networks. Our algorithm stands out with
statistical guarantees, ensuring the identification of
sender and receiver peripheries with overwhelmingly
probability. Additionally, our methods scale effectively for expansive directed networks. Implementing our
methodology on faculty hiring network data revealed
captivating insights into the informative structures and
distinctions between informative and non-informative
sender/receiver nodes across various academic disciplines.
Invited Session IS006: AI and Machine Learning in
Single Cell Genomic
Integrating Transcriptomic and Pathomic Features
to Reconstruct 3D Tissue Maps with Super-Resolution
Mingyao Li
University of Pennsylvania
Abstract: Solid tissues form complex 3D structures,
and examining the tissue microenvironment in 3D
context allows researchers to gain a comprehensive
understanding of how cells interact within the original
tissue context. This 3D information also reveals spatial relationships between different cell types and
signaling pathways that are not observable in 2D tissue sections. In this talk, I will present our recently
developed tool that is aimed at generating single-cell
resolution 3D ST tissue maps while significantly reducing experimental costs. By integrating information
from spatial transcriptomics and pathology imaging
data, our method gradually increases gene expression
resolution down to the single-cell level. Additionally,
we have developed an algorithm to register tissue
sections obtained from serial tissue cuts and impute
missing gene expression data between tissue gaps,
enabling the construction of accurate 3D tissue volumes. The resulting analysis will not only generate a
single-cell resolution spatial transcriptomics tissue
map but also facilitate detailed characterization and
quantification of tissue structures of interest in 3D.
Joint work with Daiwei Zhang.
A Hybrid Approach for Selecting Highly Variable
Genes in Single-Cell RNA-Seq
Hongkai Ji
Johns Hopkins University
Abstract: Selecting highly variable genes (HVGs) or
features (HVFs) is a key component of many single
cell RNA-seq data analysis pipelines. Here we conduct a systematic benchmark study of 47 existing and
new HVG selection methods using 19 benchmark
datasets and an average of 18 evaluation criteria per
method. We found that a hybrid approach integrating
features from multiple methods robustly outperformed
existing individual methods, yielding more accurate
cell clustering, label transfer, and improved
cross-modality correlation. We developed an R package mixhvg that delivers this hybrid solution. Users
can conveniently use this package to perform HVG
selection independently or as part of their custom data
analysis pipelines.
Joint work with Ruzhang Zhao, Jiuyao Lu, Weiqiang
Zhou, Ni Zhao.
Supervised Deep Learning with Gene Annotation
for Cell Classification
Wei Sun
Fred Hutchinson Cancer Center
Abstract: Gene-by-gene differential expression analysis is a widely used supervised-learning method for
analyzing single-cell RNA sequencing (scRNA-seq)
data. However, due to the large number of cells in
scRNA-seq studies, such analysis can lead to many
differentially expressed genes with extremely small
p-values but minimal effect sizes, making interpretation challenging. To address this issue, we proposed
an alternative method called Supervised Deep Learning with Gene Annotation (SDAN). SDAN integrates
gene annotation and gene expression data using a
graph neural network, which identifies gene sets that
accurately classify cells. By using SDAN, we have
successfully identified gene sets associated with se-
33
vere COVID-19, Alzheimer's disease, and cancer
patients' response to immunotherapy.
Joint work with Zhexiao Lin.
Invited Session IS054: Recent Advances in Statistical Machine Learning
A Bayesian Framework for Leveraging Pretrained
Large Diffusion Models
Jian Huang
The Hong Kong Polytechnic University
Abstract: Diffusion-based generative models have
achieved remarkable successes in learning complex
probability measures for various types of data, including image, video, audio, and biomedical data.
Researchers have taken steps to leverage pre-trained
large-scale models with a significantly reduced
amount of data, enabling them to generate samples
that align with the dataset's support and achieve comparable quality. The combination of learnable modules
and large models has shown impressive generation
capabilities. Therefore, it is useful to understand how
we can leverage a large model for analyzing data from
\"a small probability space\" with a limited amount of
data. In this work, we formulate a Bayesian framework for leveraging large diffusion models in generative tasks. We clarify the meaning behind leveraging a
large model for analyzing data from a \"small probability space\" and explore the task of leveraging
pre-trained models using learnable modules from a
Bayesian perspective.
Joint work with Ding Huang, Ting Li.
Unsupervised Federated Learning: A Federated
Gradient EM Algorithm for Heterogeneous Mixture Models with Robustness Against Adversarial
Attacks
Yang Feng
New York University
Abstract: While supervised federated learning approaches have enjoyed significant success, the domain
of unsupervised federated learning remains relatively
underexplored. In this paper, we introduce a novel
federated gradient EM algorithm designed for the
unsupervised learning of mixture models with heterogeneous mixture proportions across tasks. We begin
with a comprehensive finite-sample theory that holds
for general mixture models, then apply this general
theory on Gaussian Mixture Models (GMMs) and
Mixture of Regressions (MoRs) to characterize the
explicit estimation error of model parameters and
mixture proportions. Our proposed federated gradient
EM algorithm demonstrates several key advantages:
adaptability to unknown task similarity, resilience
against adversarial attacks on a small fraction of data
sources, protection of local data privacy, and computational and communication efficiency.
Joint work with Ye Tian, Haolei Weng.
Value Enhancement of Reinforcement Learning
via Efficient and Robust Trust Region Optimization
Fan Zhou
Shanghai University of Finance and Economics
Abstract: Reinforcement learning (RL) is a powerful
machine learning technique that enables an intelligent
agent to learn an optimal policy that maximizes the
cumulative rewards in sequential decision making.
Most of methods in the existing literature are developed in online settings where the data are easy to collect or simulate. Motivated by high stake domains
such as mobile health studies with limited and
pre-collected data, in this article, we study offline
reinforcement learning methods. To efficiently use
these datasets for policy optimization, we propose a
novel value enhancement method to improve the performance of a given initial policy computed by existing state-of-the-art RL algorithms. Specifically, when
the initial policy is not consistent, our method will
output a policy whose value is no worse and often
better than that of the initial policy. When the initial
policy is consistent, under some mild conditions, our
method will yield a policy whose value converges to
the optimal one at a faster rate than the initial policy,
achieving the desired \"value enhancement\" property.
The proposed method is generally applicable to any
parameterized policy that belongs to certain
pre-specified function class (e.g., deep neural networks). Extensive numerical studies are conducted to
34
demonstrate the superior performance of our method. Supplementary materials for this article are available online.
Taming \"Data-Hungry\" Reinforcement Learning?
Stability in Continuous State-Action Spaces
Yaqi Duan
New York University
Abstract: We introduce a novel framework for analyzing reinforcement learning (RL) in continuous
state-action spaces, and use it to prove fast rates of
convergence in both off-line and on-line settings. Our
analysis highlights two key stability properties, relating to how changes in value functions and/or policies
affect the Bellman operator and occupation measures.
We argue that these properties are satisfied in many
continuous state-action Markov decision processes,
and demonstrate how they arise naturally when using
linear function approximation methods. Our analysis
offers fresh perspectives on the roles of pessimism
and optimism in off-line and on-line RL, and highlights the connection between off-line RL and transfer
learning.
Joint work with Martin Wainwright.
Invited Session IS060: Recent Developments in
Complex Time Series Analysis
Matrix Denoising and Completion Based on Kronecker Product Approximation
Han Xiao
Rutgers University
Abstract: We consider the problem of matrix denoising and completion induced by the Kronecker
product decomposition. Specifically, we propose to
approximate a given matrix by the sum of a few
Kronecker products of matrices, which we refer to as
the Kronecker product approximation (KoPA). Because the Kronecker product is an extension of the
outer product from vectors to matrices, KoPA extends
the low rank matrix approximation, and includes it as
a special case. Comparing with the latter, KoPA also
offers a greater flexibility, since it allows the user to
choose the configuration, which are the dimensions of
the two smaller matrices forming the Kronecker
product. On the other hand, the configuration to be
used is usually unknown, and needs to be determined
from the data in order to achieve the optimal balance
between accuracy and parsimony. We propose to use
extended information criteria to select the configuration. Under the paradigm of high dimensional analysis,
we show that the proposed procedure is able to select
the true configuration with probability tending to one,
under suitable conditions on the signal-to-noise ratio.
We demonstrate the superiority of KoPA over the low
rank approximations through numerical studies, and
several benchmark image examples.
Joint work with Chencheng Cai, Rong Chen.
Simultaneous Decorrelation of Matrix Time Series
Yuefeng Han
University of Notre Dame
Abstract: We propose a contemporaneous bilinear
transformation for a p x q matrix time series to alleviate the difficulties in modeling and forecasting matrix
time series when p and/or q are large. The resulting
transformed matrix assumes a block structure consisting of several small matrices, and those small matrix
series are uncorrelated across all times. Hence an
overall parsimonious model is achieved by modelling
each of those small matrix series separately without
the loss of information on the linear dynamics. Such a
parsimonious model often has better forecasting performance, even when the underlying true dynamics
deviates from the assumed uncorrelated block structure after transformation. The uniform convergence
rates of the estimated transformation are derived,
which vindicate an important virtue of the proposed
bilinear transformation, i.e. it is technically equivalent
to the decorrelation of a vector time series of dimension max(p,q) instead of p x q. The proposed method
is illustrated numerically via both simulated and real
data examples.
Joint work with Rong Chen, Cun-Hui Zhang, Qiwei
Yao.
Tensor Factor Model Estimation by Iterative Projection
Dan Yang
35
The University of Hong Kong
Abstract: Tensor time series, which is a time series
consisting of tensorial observations, has become ubiquitous. It typically exhibits high dimensionality. One
approach for dimension reduction is to use a factor
model structure, in a form similar to Tucker tensor
decomposition, except that the time dimension is
treated as a dynamic process with a time dependent
structure. In this paper we introduce two approaches
to estimate such a tensor factor model by using iterative orthogonal projections of the original tensor time
series. These approaches extend the existing estimation procedures and improve the estimation accuracy
and convergence rate significantly as proven in our
theoretical investigation. Our algorithms are similar to
the higher order orthogonal projection method for
tensor decomposition, but with significant differences
due to the need to unfold tensors in the iterations and
the use of autocorrelation. Consequently, our analysis
is significantly different from the existing ones.
Computational and statistical lower bounds are derived to prove the optimality of the sample size requirement and convergence rate for the proposed
methods. Simulation study is conducted to further
illustrate the statistical properties of these estimators.
Joint work with Yuefeng Han, Rong Chen, Cun-Hui
Zhang.
Invited Session IS045: Recent Advancements in
Large Network and Tensor Data Analysis
Statistical Foundations of Deep Generative Models
Lizhen Lin
University of Maryland
Abstract: Deep generative models are probabilistic
generative models where the generator is parameterized by a deep neural network. They are popular models for modeling high-dimensional data such as texts,
images and speeches, and have achieved impressive
empirical success. Despite demonstrated success in
empirical performance, theoretical understanding of
such models is largely lacking . We investigate statistical properties of deep generative models from a
nonparametric distribution estimation viewpoint. In
the considered model, data are assumed to be observed in some high-dimensional ambient space but
concentrate around some low-dimensional structure
such as a lower-dimensional manifold. This talk will
provide an explanation of why deep generative models can perform well from the lens of statistical theory.
In particular, we will provide insights into i) how
deep generative models can avoid the curse of dimensionality and outperform classical nonparametric estimates, and ii) how likelihood approaches work for
high-dimensional distribution estimation, especially in
adapting to the intrinsic geometry of the data.
Autoregressive Networks with Dependent Edges
Qiwei Yao
London School of Economics
Abstract: We propose an autoregressive framework
for modelling dynamic networks with dependent edges. It encompasses the models which accommodate,
for example, transitivity, density-dependent and other
stylized features often observed in real network data.
By assuming the edges of network at each time are
independent conditionally on their lagged values, the
models, which exhibit a close connection with temporal ERGMs, facilitate both simulation and the
maximum likelihood estimation in the straightforward
manner. Due to the possible large number of parameters in the models, the initial MLEs may suffer from
slow convergence rates. An improved estimator for
each component parameter is proposed based on an
iteration based on the projection which mitigates the
impact of the other parameters. Based on a martingale
difference structure, the asymptotic distribution of the
improved estimator is derived without the stationarity
assumption. The limiting distribution is not normal in
general, and it reduces to normal when the underlying
process satisfies some mixing conditions. Illustration
with a transitivity model was carried out in both simulation and two real network data sets.
Analysis of Large Networks
Jiashun Jin
Carnegie Mellon University
Abstract: The block-model family has four popular
network models: SBM, MMSBM, DCBM, and
36
DCMM. A fundamental problem is, how well each of
these models fits with real networks. We propose
GoF-MSCORE as a new Goodness-of-Fit (GoF) metric for DCMM (the broadest one among the four),
with two main ideas. The first is to use cycle count
statistics as a general recipe for GoF. The second is a
novel network fitting scheme. GoF-MSCORE is a
flexible GoF approach. We adapt it to all four models
in the block-model family. We show that for each of
the four models, if the assumed model is correct, then
the corresponding GoF metric converges to the standard normal as the network sizes diverge. We also analyze the powers and show that these metrics are optimal in many settings. For 11 frequently-used real
networks, we use the proposed GoF metrics to show
that DCMM fits well with almost all of them. We also
show that SBM, DCBM, and MMSBM do not fit well
with many of these networks, especially when the
networks are relatively large.
High-Order Singular Value Decomposition in Tensor Analysis
Anru Zhang
Duke University
Abstract: The analysis of tensor data, i.e., arrays with
multiple directions, is motivated by a wide range of
scientific applications and has become an important
interdisciplinary topic in data science. In this talk, we
discuss the fundamental task of performing Singular
Value Decomposition (SVD) on tensors, exploring
both general cases and scenarios with specific structures like smoothness and longitudinality. Through the
developed frameworks, we can achieve accurate denoising for 4D scanning transmission electron microscopy images; in longitudinal microbiome studies,
we can extract key components in the trajectories of
bacterial abundance, identify representative bacterial
taxa for these key trajectories, and group subjects
based on the change of bacteria abundance over time.
We also showcase the development of statistically
optimal methods and computationally efficient algorithms that harness valuable insights from
high-dimensional tensor data, grounded in theories of
computation and non-convex optimization.
Invited Session IS083: Limit Theory of Large Dimensional Random Matrices (大维随机矩阵极限
理论)
Nonlinear Principal Component Analysis with
Random Bernoulli Features for Process Monitoring
Dandan Jiang
Xi'an Jiaotong University
Abstract: This paper proposes a new random map,
the random Bernoulli feature, which captures nonlinear patterns in the process efficiently and quickly.
First, we derive its convergence bound for approximating the Gaussian kernel and apply the random
Bernoulli features to PCA to obtain its nonlinear variant: random Bernoulli PCA (RBPCA). Second, the
framework for implementing 15monitoring using
RBPCA is described and related to other tools such as
time-lagged structure and moving window. As a result,
three nonlinear process monitoring methods based on
RBPCA are proposed, which can extract the dynamic
properties of the process or make the model adaptive.
These methods utilizing random Bernoulli features
offer scalability and lower computational cost compared to kernel-based methods. Finally, the performance of process monitoring is demon-20 strated by a
numerical example, Tennessee Eastman Process and
Server Machine Dataset. The superiority of the nonlinear process monitoring methods based on the random Bernoulli feature is confirmed.
Joint work with Ke Chen, Shurong Zheng.
An Integrative Multi-Context Mendelian Randomization Method for Identifying Risk Genes
Across Human Tissues
Fan Yang
Tsinghua University
Abstract: Mendelian randomization (MR) provides
valuable assessments of the causal effect of exposure
on outcome, yet the application of conventional MR
methods for mapping risk genes encounters new challenges. One of the issues is the limited availability of
expression quantitative trait loci (eQTLs) as instrumental variables (IVs), hampering the estimation of
37
sparse causal effects. Additionally, the often context/tissue-specific eQTL effects challenge the MR
assumption of consistent IV effects across eQTL and
GWAS data. To address these challenges, we propose
a multi-context multivariable integrative MR framework, mintMR, for mapping expression and molecular
traits as joint exposures. It models the effects of molecular exposures across multiple tissues in each gene
region, while simultaneously estimating across multiple gene regions. It uses eQTLs with consistent effects
across more than one tissue type as IVs, improving IV
consistency. A major innovation of mintMR involves
employing multi-view learning methods to collectively model latent indicators of disease relevance across
multiple tissues, molecular traits, and gene regions.
The multi-view learning captures the major patterns of
disease-relevance and uses these patterns to update the
estimated tissue relevance probabilities. The proposed
mintMR iterates between performing a multi-tissue
MR for each gene region and joint learning the disease-relevant tissue probabilities across gene regions,
improving the estimation of sparse effects across
genes. We apply mintMR to evaluate the causal effects of gene expression and DNA methylation for 35
complex traits using multi-tissue QTLs as IVs. The
proposed mintMR controls genome-wide inflation and
offers new insights into disease mechanisms.
Joint work with Yihao Lu, Lin Chen.
Limit Theorems for U-Statistics of Determinantal
Point Process via Cumulant Estimates
Dong Yao
Jiangsu Normal University
Abstract: In this talk, we will derive the first and
second order Wiener chaos decomposition for
the U-statistics of determinantal processes associated
with spectral projection kernels on the d-dimensional
unit spheres. We first derive a graphical representation
for the cumulants of the U-statistics of any determinantal process. The main results are established by
combining precise estimates on the graph structure of
this representation with the spectral projection kernels.
The approach can be adapted to other determinantal
point processes, and similar results may hold. We also
compare our results with Hoeffding decomposition for
U-statistics of i.i.d. random variables.
Joint work with Renjie Feng and Friedrich Götze.
Quantitative Tracy-Widom Laws for Wigner and
Sample Covariance Matrices
Yuanyuan Xu
Chinese Academy of Sciences
Abstract: This talk will discuss a quantitative Tracy-Widom law for the largest eigenvalue of Wigner
matrices. More precisely, we will prove that the fluctuations of the largest eigenvalue of a Wigner matrix
of size N converge to the Tracy-Widom limit at a rate
nearly ?−1/3
, as N tends to infinity. Moreover, we
also establish a small deviation from the Tracy-Widom distribution for the largest eigenvalue of
Wigner matrices. The same results also hold true for
the largest eigenvalue of sample covariance matrices,
which plays a significant role in the principal component analysis. These are based on several joint works
with Kevin Schnelli (KTH).
Joint work with Kevin Schnelli.
Invited Session IS099: Economic Statistics and
Research on High-Quality Development (经济统计
与高质量发展研究)
Research on the Construction of Industrial Chain
and Supply Chain Network and Resilience Evaluation
产业链供应链关联网络构建及韧性评估研究
Shaohua Ge
Zhongnan University of Economics and Law
摘要:中国是推动全球工业增长的重要发展动力,
在全球产业链中扮演着关键角色,但产业链供应链
“大而不强、全而不优”等问题仍未得到根本性改变
产业链发展面临一系列问题,防范潜在危险、提升
产业链供应链韧性研究十分必要。研究基于产业关
联理论与复杂网络理论分析了产业链供应链韧性
内涵、发展及其相关特征。随后以微观视角切入,
以企业关联关系为依据构建产业链供应链关联网
络,分析节点与整体网络特征,识别重要节点,为
产业链供应链韧性评估奠定基础。其次从产业链供
应链抗风险能力与遭受风险后的恢复能力两方面
对产业链供应链韧性进行综合评估,并对其影响因
38
素进行拓展分析。研究发现,产业链供应链整体抗
风险能力不断增强,风险恢复能力呈先下降后上升
的变化,产业链供应链韧性水平呈逐年提升态势;
企业融资约束、产权比率、综合税率水平等对产业
链供应链网络节点发展具有正向影响。
ESG Performance, Financing Cost, and
High-Quality Development of Enterprises
ESG 表现、融资成本与企业高质量发展
Yating Gui
Zhongnan University of Economics and Law
摘要:融资成本对企业生存和发展的影响举足轻重
,随着资本市场对企业 ESG 信息披露的关注度大
幅提升,ESG 对于企业高质量发展的重要性也愈加
凸显。基于 2009-2020 年 3467 家 A 股上市公司的
样本数据,探讨 ESG 表现对企业高质量发展的影
响,并通过构建一个有调节的中介模型重点研究了
融资成本在两者中的作用。研究结果表明,良好的
ESG 表现有助于企业的高质量发展。机制分析表明
,融资成本在 ESG 表现与企业高质量发展之间呈
现部分中介作用;企业创新能力和市场竞争结构调
节了融资成本在 ESG 表现与企业高质量发展之间
的中介作用。异质性分析表明,ESG 表现对企业高
质量发展的促进作用在中西部地区、轻污染行业和
非国有企业中的作用更加显著。研究为准确理解和
评估企业 ESG 表现的社会和经济效应提供新思路,
丰富和拓展了 ESG 表现的经济后果和企业高质量
发展影响因素的相关研究文献,从融资成本视角为
企业高质量发展提供检验证据和政策参考。
Discussion on the Statistical Monitoring System of
Financial Security
金融安全统计监测体系探讨
Zihuan Gao
Zhongnan University of Economics and Law
摘要:作为国家安全的重要组成部分,金融安全是
关系我国经济社会发展全局的一件带有战略性、根
本性的大事。那么,何为金融安全?又应如何衡量
?本文基于对金融安全的深刻理解,在梳理国内外
与金融安全监测相关研究的基础上,提炼金融安全
的统计内涵,围绕宏观经济与金融体系的正常运转
、金融部门与经济部门的依赖关系两个方面,设置
外部风险抵御、金融系统稳定、经济部门运行 3 大
子系统,进而构建包含 9 个维度 22 项指标的金融
安全统计监测体系。该监测体系的特点是:紧扣“
安全”这一主题,既符合中国国情又具有国际视野,
指标少而精,数据可以获取。
Joint work with Hu Zhang.
Influencing Factors and Potential Measurement of
China's Export Trade under the \"Belt and Road\"
Vision
“一带一路”视域下中国出口贸易影响因素及潜力
测度
Qinqin Zhu
Zhongnan University of Economics and Law
摘要:文章基于 2012-2022 年的面板数据,运用随
机前沿引力模型对中国与“一带一路”沿线 66 个国
家出口贸易的效率与潜力进行测算,采用一步法对
影响因素进行了分析。结果表明:双方 GDP、贸易
国人口数量和贸易依存度对中国出口贸易具有正
向影响,双方的地理距离和中国人口数量具有负向
影响;对外直接投资、自由贸易协定是促进中国贸
易出口的因素,基础设施建设是阻碍中国出口贸易
的因素;中国对“一带一路”沿线国家的出口贸易效
率和潜力差异明显。本研究对“一带一路”沿线国家
以及全球各国的外资政策和区域政策有一定的参
考价值,特别是对中国企业“走出去”具有重要的指
导意义,同时也为提升“一带一路”经贸合作水平提
供了政策建议。
The Mechanism and Impact of Returning Home to
Start a Business on Improving County-Level Total
Factor Productivity: Based on the Investigation of
the Pilot Policy of Returning Home to Start a
Business
返乡创业提升县域全要素生产率的作用机制与影
响效应——基于返乡创业试点政策的考察
Huicong Wang
Zhongnan University of Economics and Law
摘要:基于 2012-2020 年全国 1789 个县域的面板数
据,运用交错双重差分模型对政府支持返乡创业这
一试点政策的实施影响县域全要素生产率的效应
与机制进行分析,结果显示:返乡创业试点政策可
以显著提高县域全要素生产率。机制分析发现,返
乡创业试点政策可以通过缓解政府干预的负面影
响、提升金融服务水平和激发县域创新活力实现县
域全要素生产率的提升。此外,返乡创业试点政策
39
在地理区位、经济基础、产业集聚水平和人力资本
水平方面表现出明显的分化特征。鉴于此,各县域
应结合自身发展特色,通过返乡创业打造新的经济
增长极,并利用好返乡创业带来的宽松市场环境、
金融支持效应和技术创新效应实现县域全要素生
产率的提升。
Joint work with Xian Zhu.
Invited Session IS091: Model Averaging and Related Topics
Model Averaging for Decomposed Data
Yuying Sun
Academy of Mathematics and Systems Science, Chinese Academy of Sciences
Abstract: The decomposition-ensemble algorithm has
received increasing attention in forecast and related
fields, especially in capturing the nonlinear and nonstationary characteristics of time series data. A conventional strategy involves decomposing the target
time series into various oscillation modes from the
frequency domain and assigning equal weights to all
decomposed modes for aggregated prediction. However, disparities in forecasting performance arise
among different decomposed modes due to their distinct attributes and forecast horizons. This paper proposes a novel frequency decomposition-based model
averaging approach to combine decomposed modes
with appropriate weights, thereby enhancing the accuracy of the target time series forecast. It is shown
that the proposed model averaging estimator is asymptotically optimal in the sense of achieving the
lowest possible quadratic prediction risk. The rate of
the selected weights converging to the optimal
weights to minimizing the expected quadratic loss is
established. Simulation studies and empirical applications to consumption and exchange rate forecasting
highlight the merits of the proposed method.
Model Averaging in Multivariate Spatial Autoregressive Model for Social Network Analysis
Fang Fang
East China Normal University
Abstract: In social network analysis, a crucial issue is
to determine how network nodes interact with each
other, which is to determine what spatial weight matrix to use under the framework of spatial autoregressive models. When the dependent variable is multivariate and there are multiple candidate weight matrices, this paper proposes a model averaging method
based on a Mallows type criterion to obtain a
weighted weight matrix estimate. When the candidate
weight matrices are all mis-specified, the method is
asymptotic optimal in the sense that minimizing the
prediction error for the dependent variable. When the
correct weight matrix is included in the candidates,
the weighted estimation is consistent. Numerical simulations verify the theoretical results and the superiority of the proposed method. Two examples on social
networks and financial networks are presented for
illustration.
Averaging Method of Poisson Regression Model
with Divergent Dimensions
发散维度的泊松回归模型平均方法
Jiahui Zou
Capital University of Economics and Business
摘要:本文提出了一种新的模型平均方法,以解决
泊松回归中的模型不确定性问题,并允许协变量的
维度随着样本量的增加而增加。我们基于 Kullback–
Leibler (KL)散度推导出了一种具有无偏性的准则,
来计算模型平均权重。研究结果表明,当所有候选
模型均被误设时,所提出模型平均估计具有渐近最
优性,即在 KL 损失的意义下渐近等价于理论最优
平均估计。当候选模型集合中存在正确模型时,本
文的模型平均参数估计具有相合性。最后,我们将
本文的方法应用在了研究公司创新的决定因素和
预测方面。
Joint work with Wendun Wang, Xinyu Zhang, Guohua Zou.
Optimal Distributed Prediction by Model Averaging
Jun Liao
Renmin University of China
Abstract: In this paper, we develop a new distributed
prediction approach. To obtain the data-driven weights
for averaging, an unbiased estimator of the squared
prediction risk is derived as the weight choice criteri-
40
on. The proposed distributed prediction is obtained by
combining the local predictions using the weights that
minimize such a criterion. The proposed method is
shown to be asymptotically optimal in the sense of
squared loss and the convergence rate is also studied.
The simulation and real data analysis show that the
prediction based on the new method is remarkably
superior to that based on the naive divide-conquer
averaging method, and usually has a close performance to the prediction using full data. Moreover, our
method even leads to better results than the prediction
using full data.
Optimal Weighted Random Forests
Dalei Yu
Xi'an Jiaotong University
Abstract: The random forest (RF) algorithm has become a very popular prediction method for its great
flexibility and promising accuracy. In RF, it is conventional to put equal weights on all the base learners
(trees) to aggregate their predictions. However, the
predictive performances of different trees within the
forest can be very different due to the randomization
of the embedded bootstrap sampling and feature selection. In this paper, we focus on RF for regression
and propose two optimal weighting algorithms,
namely the 1 Step Optimal Weighted RF
(1step-WRFopt) and 2 Steps Optimal Weighted RF
(2steps- WRFopt ), that combine the base learners
through the weights determined by weight choice
criteria. Under some regularity conditions, we show
that these algorithms are asymptotically optimal in the
sense that the resulting squared loss and risk are asymptotically identical to those of the infeasible but
best possible weighted RF. Numerical studies conducted on real-world data sets and semi-synthetic data
sets indicate that these algorithms outperform the
equal-weight forest and two other weighted RFs proposed in existing literature in most cases.
Joint work with Xinyu Chen, Xinyu Zhang.
Invited Session IS079: The Interplay Between Statistical Inference and Data-Driven Decision Making
Controlling the False Discovery Rate in Transformations: Split Knockoffs
Yuan Yao
The Hong Kong University of Science and Technology
Abstract: Controlling the False Discovery Rate (FDR)
in a variable selection procedure is critical for reproducible discoveries, which receives an extensive study
in sparse linear models. However, it remains largely
open in the scenarios where the sparsity constraint is
not directly imposed on the parameters, but on a linear
transformation of the parameters to be estimated.
Examples include total variations, wavelet transforms,
fused LASSO, and trend filtering, etc. In this work,
we propose a data adaptive FDR control in this transformational sparsity setting, the Split Knockoff method. The proposed scheme exploits both variable and
data splitting. The linear transformation constraint is
relaxed to its Euclidean proximity in a lifted parameter space, yielding an orthogonal design for improved
power and orthogonal Split Knockoff copies. To
overcome the challenge that exchangeability fails due
to the heterogeneous noise brought by the transformation, new inverse supermartingale structures are
developed for provable the FDR control with directional effects. Simulation experiments show that the
proposed methodology achieves desired (directional)
FDR and power. An application to Alzheimer's Disease study is provided that atrophy brain regions and
their abnormal connections can be discovered based
on a structural Magnetic Resonance Imaging dataset
(ADNI).
Joint work with Yang Cao and Xinwei Sun.
Balancing Personalization and Pooling: Decision-Making and Statistical Inference with Limited
Time Horizons
Yongyi Guo
University of Wisconsin-Madison
Abstract: In contrast to traditional clinical trials, digital health interventions facilitate adaptive personalized treatments delivered in near real-time to manage
health risks and promote healthy behaviors. Integrating Reinforcement Learning (RL) algorithms into
41
mHealth (mobile health) studies presents numerous
challenges, with a critical one being the constrained
time horizon leading to data scarcity, affecting decision quality, as well as the autonomy and stability of
RL algorithms in practical applications. To address
this challenge, we propose a solution for online decision-making and post-study statistical inference. Leveraging the mixed-effects reward model in Thompson
sampling, we efficiently utilize user data to expedite
informed decision-making. The online algorithm
makes traditional statistical analysis for the treatment
effect invalid: The user history are not independent
even if we assume the potential outcomes are i.i.d.
This is because the RL algorithm makes decisions
using pooled user information in addition to the user
state variables. We provide valid asymptotic confidence intervals for the average causal excursion effect
using the idea of decomposing the policy into \"population statistics\" and decisions based on \"(expanded)
user states\". As an example, I will also present the
MiWaves clinical trial, which is an AI-based mobile
health intervention to reduce cannabis use amongst
emerging adults.
Joint work with Susobhan Ghosh, Pei-Yao Hung,
Lara Coughlin, Erin Bonar, Inbal Nahum-Shani,
Maureen Walton, Susan Mruphy.
Conformal Alignment: Knowing When to Trust
Foundation Models with Guarantees
Ying Jin
Stanford University
Abstract: Before deploying outputs from foundation
models in high-stakes tasks, it is imperative to ensure
that they align with human values. For instance, in
radiology report generation, reports generated by a
vision-language model must align with human evaluations before their use in medical decision-making. We
present Conformal Alignment, a general framework
for identifying units whose outputs meet a user-specified alignment criterion. It is guaranteed that
on average, a prescribed fraction of selected units
indeed meet the alignment criterion, regardless of the
foundation model or the data distribution. Given any
pre-trained model and new units with model-generated outputs, Conformal Alignment leverages
a set of reference data with ground-truth alignment
status to train an alignment predictor. It then selects
new units whose predicted alignment scores surpass a
data-dependent threshold, certifying their corresponding outputs as trustworthy. Through applications to
question answering and radiology report generation,
we demonstrate that our method is able to accurately
identify units with trustworthy outputs via lightweight
training over a moderate amount of reference data. En
route, we investigate the informativeness of various
features in alignment prediction and combine them
with standard models to construct the alignment predictor.
Joint work with Yu Gui, Zhimei Ren.
Large Covariance Matrix Estimation with Factor-Assisted Variable Clustering
Cheng Yu
Tsinghua University
Abstract: In the field of large covariance matrix estimation, several methods have been developed based
on the factor models, assuming the existence of a few
common factors that can explain the co-movement of
asset pricing. However, many studies have demonstrated the presence of a cross-sectional correlation
between assets after removing the common factors. To
account for this effect, we propose an approximate
observable factor model with latent cluster structure,
along with a three-step estimator to accurately estimate the large covariance matrix for high-dimensional
time series. The rates of convergence of the residual
covariance with latent cluster structure and the whole
large covariance matrix are studied under various
norms. Additionally, we introduce a novel ratio-based
criteria for determining the latent cluster structure,
which can achieve clustering consistency with probability approaching one. The asymptotic results are
supported by simulation studies, and we demonstrate
the practical application of our approach through real
data analysis on minimal variance portfolio allocation.
Joint work with Dong Li, Xinghao Qiao.
Enhancing Decision Making with Causal Inference