The 2nd Joint Conference on Stattistics and Date Science in China

发布时间:2024-7-04 | 杂志分类:其他
免费制作
更多内容

The 2nd Joint Conference on Stattistics and Date Science in China

142proximate message passing approach to this goal. The proposed algorithm achieves both asymptotic minimum MSE under an idealized model and state-of-the-art practical performances on benchmark multi-modal single-cell datasets. Time permitting, we shall also discuss probabilistic querying of the resulting cell atlases. Joint work with Sagnik Nandy.Eigenvectors Fluctuations and Limit Results for Graphon EstimationMInh TangNorth Carolina State UniversityAbstract: We derive error bounds in two-to-i... [收起]
[展开]
The 2nd Joint Conference on Stattistics and Date Science in China
粉丝: {{bookData.followerCount}}
文本内容
第151页

142

proximate message passing approach to this goal. The

proposed algorithm achieves both asymptotic minimum MSE under an idealized model and

state-of-the-art practical performances on benchmark

multi-modal single-cell datasets. Time permitting, we

shall also discuss probabilistic querying of the resulting cell atlases.

Joint work with Sagnik Nandy.

Eigenvectors Fluctuations and Limit Results for

Graphon Estimation

MInh Tang

North Carolina State University

Abstract: We derive error bounds in two-to-infinity

norm as well as row-wise normal approximations for

the leading eigenvectors U of an inhomogeneous Erdos-Renyi random graphs whose edge probabilities

matrix are generated from a kernel that, when viewed

as an integral operator, can have infinite rank. We

apply these results to the hypothesis testing problem

that two vertices i and j in an inhomogeneous Erdos-Renyi graph A have the same latent positions, and

we propose a test statistic based on the Euclidean

distance between the ith and jth rows of U that converges in distribution to a weighted sum of independent chi-square under the null hypothesis.

Pseudo-Labeling for Kernel Ridge Regression under Covariate Shift

Kaizheng Wang

Columbia University

Abstract: We develop and analyze a principled approach to kernel ridge regression under covariate shift.

The goal is to learn a regression function with small

mean squared error over a target distribution, based on

unlabeled data from there and labeled data that may

have a different feature distribution. We propose to

split the labeled data into two subsets and conduct

kernel ridge regression on them separately to obtain a

collection of candidate models and an imputation

model. We use the latter to fill the missing labels and

then select the best candidate model accordingly. Our

non-asymptotic excess risk bounds show that in quite

general scenarios, our estimator adapts to the structure

of the target distribution as well as the covariate shift.

It achieves the minimax optimal error rate up to a

logarithmic factor. The use of pseudo-labels in model

selection does not have major negative impacts.

Invited Session IS014: Dynamic and Reinforcement Learning

Multi-Objective Tree-Based Reinforcement

Learning for Estimating Tolerant Dynamic Treatment Regimes

Lu Wang

University of Michigan

Abstract: A dynamic treatment regime (DTR) is a

sequence of treatment decision rules that dictate individualized treatments based on evolving treatment and

covariate history. It provides a vehicle for optimizing

a clinical decision support system and fits well into

the broader paradigm of personalized medicine.

However, many real-world problems involve multiple

competing priorities, and decision rules differ when

trade-offs are present. Correspondingly, there may be

more than one feasible decision that leads to empirically sufficient optimization. In this talk, we present a

concept of \"tolerant regime,\" which provides a set of

individualized feasible decision rules under a prespecified tolerance rate. Then we demonstrate a couple of

recently developed methods, including multiobjective

tree-based reinforcement learning (MOT-RL), to directly estimate the tolerant DTR (tDTR) that optimizes multiple objectives in a multistage multitreatment

setting. The algorithms are both implemented in a

backward inductive manner through multiple decision

stages, and it estimates the optimal DTR and tDTR

depending on the decision-maker's preferences. The

proposed methods for multi-objective reinforcement

learning are robust, efficient, easy-to-interpret, and

flexible in various settings.

Joint work with Yao Song and Chang Wang.

Interim Analysis in Sequential Multiple Assignment Randomized Trials for Survival Outcomes

Yu Cheng

University of Pittsburgh

Abstract: Sequential Multiple Assignment Random-

第152页

143

ized Trials (SMARTs) have been conducted to mimic

the actual treatment processes experienced by physicians and patients in clinical settings and inform

comparative effectiveness of adaptive treatment strategies (ATSs). In a SMART design, patients are involved in multiple stages of treatment, and the treatment assignment is adapted over time based on the

patient's characteristics such as disease status and

treatment history. In this work, we develop and evaluate statistically valid interim monitoring (IM) approaches to allow for early termination of SMART

trials for efficacy and/or futility for survival outcomes.

The development is nontrivial. First, in comparing

estimated survival rates from different ATSs, log-rank

statistics need to be carefully weighted to account for

overlapping treatment paths. At a given time point, we

can then test for the null hypothesis of no difference

among all ATSs based on a weighted log-rank

Chi-square statistic. With multiple stages, the number

of ATSs is way larger than the number of treatments

involved in a typical randomized trial, resulting in

many parameters to estimate for the variance matrix

of the weighted log-rank statistics. More challengingly,

for IM, we need to quantify how the log-rank statistics

at two different time points are correlated, and each

component of the covariance matrix depends on a

mixture of event processes which can jump at multiple

time points due to the nature of multiple assignments.

The covariance matrix is meticulously derived based

on martingale theory. Efficacy boundaries at multiple

interim analyses can then be established using the

Pocock boundary, the O'Brien Fleming (OBF) boundary, or some pre-specified continuous error spending

function. We run extensive simulations to evaluate

and compare type I error for our proposed weighted

log-rank Chi-square statistic for ATRs under different

boundary specifications.

Joint work with Zi Wang and Abdus Wahed.

Functional Linear Operator Quantile Regression

for Sparse Longitudinal Data

Xingcai Zhou

Nanjing Audit University

Abstract: We propose a functional linear operator

quantile regression (FLOQR) framework, which includes many important and useful functional data

models, and devote to the new framework model for

longitudinal data with the typically sparse and irregular designs. The non-smooth quantile loss and functional linear operator pose new challenges to functional data analysis for longitudinal data in both computation and theoretical development. To address the

challenge, we propose the iterative surrogate least

squares estimation approach for the FLOQR model,

which transforms the response trajectories and establishes a new connection between FLOQR and functional linear operator model. In addition, we use

Karhunen-Loe've expansion to alleviate the problem

of the nonexistence of the inverse of the covariance in

the infinite dimensional Hilbert space. Then, the approach is used to classic functional varying coefficient

QR, functional linear QR and functional varying coefficient QR with history index function for sparse longitudinal data by using functional principal components analysis through conditional expectation. The

resulting technique is flexible and allows the prediction of an unobserved quantile response trajectory

from sparse measurements of a predictor trajectory.

Theoretically, we show that, after a con- stant number

of iterations, the proposed estimator is asymptotic

consistent for sparse designs. Moreover, asymptotic

pointwise confidence bands are obtained for predicted

quantile individual trajectories based on their asymptotic distributions. The proposed algorithms perform

well in simulations, and are illustrated with longitudinal primary biliary liver cirrhosis data and time-course

gene expression data for the yeast cell cycle.

Joint work with Tingyu Lai and Linglong Kong.

Invited Session IS043: Panel Data and Microeconometrics

On the Inconsistency of Cluster-Robust Inference

and How Subsampling Can Fix It

Yulong Wang

Syracuse University

Abstract: Conventional methods of cluster-robust

inference are inconsistent in the presence of unignorably large clusters. We formalize this claim by estab-

第153页

144

lishing a necessary and sufficient condition for the

consistency of the conventional methods. We find that

this condition for the consistency is rejected for a

majority of empirical research papers. In this light, we

propose a novel score subsampling method, which is

robust even under the condition that fails the conventional method. Simulation studies support these claims.

With real data used by an empirical paper, we showcase that the conventional methods conclude significance while our proposed method concludes insignificance.

Joint work with Harold Chiang, Yuya Sasaki.

Three-Dimensional Heterogeneous Panel Data

Models with Multi-Level Interactive Fixed Effects

Xun Lu

The Chinese University of Hong Kong

Abstract: We consider a three-dimensional (3D) panel data model with heterogeneous slope coefficients

and multi-level interactive fixed effects consisting of

latent global factors and two types of local factors.

Our model nests many commonly used 3D panel data

models. We propose an iterative estimation procedure

that relies on initial consistent estimators obtained

through a novel defactored approach. We study the

asymptotic properties of our estimators and show that

our iterative estimators of the slope coefficients are

\"oracle efficient\" in the sense that they are asymptotically equivalent to those when the factors were known.

Some specification testing issues are also considered.

Monte Carlo simulations demonstrate that our estimators and tests perform well in finite samples. We

apply our new method to the international trade dataset.

DeepMed: Semiparametric Causal Mediation

Analysis with Debiased Deep Learning

Zhonghua Liu

Columbia University

Abstract: Causal mediation analysis can unpack the

black box of causality and is therefore a powerful tool

for disentangling causal pathways in biomedical and

social sciences, and also for evaluating machine

learning fairness. To reduce bias for estimating Natural Direct and Indirect Effects in mediation analysis,

we propose a new method called DeepMed that uses

deep neural networks (DNNs) to cross-fit the infinite-dimensional nuisance functions in the efficient

influence functions. We obtain novel theoretical results that our DeepMed method (1) can achieve semiparametric efficiency bound without imposing sparsity constraints on the DNN architecture and (2) can

adapt to certain low-dimensional structures of the

nuisance functions, significantly advancing the existing literature on DNN-based semiparametric causal

inference. Extensive synthetic experiments are conducted to support our findings and also expose the gap

between theory and practice. As a proof of concept,

we apply DeepMed to analyze two real datasets on

machine learning fairness and reach conclusions consistent with previous findings.

Smoothed Quantile Regression for Panel Data with

Mixed Group Structure

Haiqi Li

Hunan University

Abstract: This study aims to identify and estimate the

common parameters and latent grouped heterogeneity

in a panel quantile regression model. The quantile

slope coefficients possess a mixed group structure;

that is, only an unknown part of regression coefficients share common values across groups. To address

the non-differentiable objective function, we propose

a grouped effect smoothed quantile regression

(GE-SQR) method to identify the latent group structure and common parameters simultaneously. We

demonstrate that the proposed GE-SQR method consistently recovers the group structures and identifies

the common parameters. However, the post-selection

estimator is shown to asymptotically follows a normal

distribution with an asymptotic bias. Thus, we propose two bias-corrected estimators using analytical

and half-panel jackknife methods such that they asymptotically follow a zero-mean normal distribution.

Monte Carlo simulation studies show that the proposed methods have superior finite-sample performance. An empirical application to cross-country

economic growth illustrates the practical merits of the

第154页

145

proposed GE-SQR methods.

Joint work with Xingyi Chen, Yongmiao Hong, Zhijie Xiao.

Invited Session IS034: Modern Statistical Methods

for Causal Inference

Multiply Robust Off-Policy Evaluation and

Learning under Truncation by Death

Shu Yang

North Carolina State University

Abstract: Typical off-policy evaluation (OPE) and

off-policy learning (OPL) are not well-defined problems under \"truncation by death\", where the outcome

of interest is not defined after some events, such as

death. The standard OPE no longer yields consistent

estimators, and the standard OPL results in suboptimal

policies. In this paper, we formulate OPE and OPL

using principal stratification under \"truncation by

death\". We propose a survivor value function for a

subpopulation whose outcomes are always defined

regardless of treatment conditions. We establish a

novel identification strategy under principal ignorability, and derive the semiparametric efficiency bound of

an OPE estimator. Then, we propose multiply robust

estimators for OPE and OPL. We show that the proposed estimators are consistent and asymptotically

normal even with flexible semi/nonparametric models

for nuisance functions approximation. Moreover, under mild rate conditions of nuisance functions approximation, the estimators achieve the semiparametric efficiency bound. Finally, we conduct experiments

to demonstrate the empirical performance of the proposed estimators. If time permits, I will discuss policy

learning without the typical positivity condition.

Joint work with Jianing Chu, Wenbin Lu.

Combining Probability and Non-probability Samples Using Semi-parametric Quantile Regression

and a Non-parametric Estimator of the Participation Probability

Cindy Yu

Iowa State University

Abstract: Non-probability samples are prevalent in

various fields, such as biomedical studies, educational

research, and business investigations, owing to the

escalating challenges associated with declining response rates and the cost-effectiveness and convenience of utilizing such samples. However, relying on

naive estimates derived from non-probability samples,

without adequate adjustments, may introduce bias into

study outcomes. Addressing this concern, data integration methodologies, which amalgamate information from both probability and non-probability

samples, have demonstrated effectiveness in mitigating selection bias. Commonly employed data integration approaches encompass mass imputation, propensity score weighting, and hybrid methodologies.

Nonetheless, the efficacy of these methods hinges

upon the assumptions underlying the models. This

paper introduces innovative and robust data integration approaches, notably a semi-parametric quantile

regression-based mass imputation approach and a

doubly robust approach that integrates a

non-parametric estimator of the participation probability for non-probability samples. Our proposed

methodologies exhibit greater robustness compared to

existing parametric approaches, particularly concerning model misspecification and outliers. Theoretical

results are established, including variance estimators

for our proposed estimators. Through comprehensive

simulation studies and real-world applications, our

findings demonstrate the promising performance of

the proposed estimators in reducing selection bias and

facilitating valid statistical inference. This research

contributes to the advancement of robust methodologies for handling non-probability samples, thereby

enhancing the reliability and validity of research outcomes across diverse domains.

Joint work with Emily Berg, Sixia Chen.

Discovery and Inference of Possibly Bi-directional

Causal Relationships with Invalid Instrumental

Variables

Wei Li

Renmin University of China

Abstract: Learning causal relationships between pairs

of complex traits from observational studies is of great

interest across various scientific domains. However,

第155页

146

most existing methods assume the absence of unmeasured confounding and restrict causal relationships between two traits to be uni-directional, which

may be violated in real-world systems. In this paper,

we address the challenge of causal discovery and effect inference for two traits while accounting for unmeasured confounding and potential feedback loops.

By leveraging possibly invalid instrumental variables,

we provide sufficient identification conditions for

causal parameters in a model that allows for

bi-directional relationships, and we also establish identifiability of the causal direction under the

introduced conditions. Then we propose a data-driven

procedure to detect the causal direction and provide

inference results about causal effects along the identified direction. We show that our method consistently

identifies the true direction and produces valid confidence intervals for the causal effect. We conduct extensive simulation studies to show that our proposal

outperforms existing methods. We finally apply our

method to analyze real data sets from UK Biobank.

Generalized Entropy Calibration for Doubly Robust Estimation under Missing at Random

Jae-kwang Kim

Iowa State University

Abstract: We introduce a new class of calibration

methods that use generalized entropy to achieve double robustness and improve efficiency.

The calibration weights are derived by maximizing

generalized entropy, defined as the sum of weighted

convex functions. The proposed generalized entropy

calibration(GEC) method surpasses the efficiency of

traditional doubly robust estimators due to the enhanced stability of the calibration weights. It also

maintains double robustness by altering the calibration

constraint initially developed in the empirical likelihood. By taking into account variance components for

heterogeneous models, we seek to reflect the unequal

variance in the context of missing data analysis.

Moreover, we consider a particular class of entropy,

Rényi entropy, and aim to identify the optimal entropy

by using a tuning parameter determined by K-cross

validation.

Joint work with Yonghyun Kwon, Yumou Qiu.

Invited Session IS018: Financial Big Data

The Fine Structure of Volatility Dynamics

Carsten Chong

The Hong Kong University of Science and Technology

Abstract: We develop a nonparametric test for deciding whether volatility of an asset follows a standard

semimartingale process, with paths of finite quadratic

variation, or a rough process with paths of infinite

quadratic variation. The test utilizes the fact that volatility is rough if and only if volatility increments are

negatively autocorrelated at high frequencies. It is

based on the sample autocovariance of increments of

spot volatility estimates computed from

high-frequency asset return data. By showing a feasible CLT for this statistic under the null hypothesis of

semimartingale volatility paths, we construct a test

with fixed asymptotic size and an asymptotic power

equal to one. The test is derived under very general

conditions for the data-generating process. In particular, it is robust to jumps with arbitrary activity and to

the presence of market microstructure noise. In an

application of the test to SPY high-frequency data, we

find evidence for rough volatility.

Joint work with Viktor Todorov.

Estimating the Efficient Frontier

Yingying Li

The Hong Kong University of Science and Technology

Abstract: This paper introduces an estimator named

CORE (Constrained sparse Regression for Efficient

portfolio), designed to estimate mean-variance efficient portfolios with all risky assets in high dimensions. Collectively, these portfolios form an estimated

efficient frontier. Specifically, we decompose the optimal portfolio on individual assets and factors into a

portfolio on factors and a portfolio on idiosyncratic

components. We apply linear constrained LASSO in

our estimation by deriving a constrained regression

problem, where the solution is the portfolio weight

vector on idiosyncratic returns. We establish theoreti-

第156页

147

cal support for the portfolios' asymptotic

mean-variance efficiency. Extensive simulation and

empirical studies are conducted to thoroughly evaluate

the performance of our proposed method.

Joint work with Leheng Chen, Xinghua Zheng.

Can Machines Learn Weak Signals?

Dacheng Xiu

University of Chicago

Abstract: In high-dimensional regression scenarios

with low signal-to-noise ratios, we assess the predictive performance of several prevalent machine learning algorithms. Theoretical insights show Ridge regression's superiority in exploiting weak signals, surpassing a zero benchmark. In contrast, Lasso fails to

exceed this baseline, indicating its learning limitations.

Simulations reveal that Random Forest generally outperforms Gradient Boosted Regression Trees when

signals are weak. Moreover, Neural Networks with

?2-regularization excel in capturing nonlinear functions of weak signals. Our empirical analysis across

six economic datasets suggests that the weakness of

signals, not necessarily the absence of sparsity, may

be Lasso's major limitation in economic predictions.

Joint work with Zhouyu Shen.

Estimation of Out-of-Sample Sharpe Ratio for

High Dimensional Portfolio Optimization

Weicheng Wang

The University of Hong Kong

Abstract: Portfolio optimization aims at constructing

a realistic portfolio with significant out-of-sample

performance, which is typically measured by the

out-of-sample Sharpe ratio. However, due to

in-sample optimism, it is inappropriate to use the

in-sample estimated covariance to evaluate the

out-of-sample Sharpe, especially in the high dimensional settings. In this paper, we propose a novel

method to estimate the out-of-sample Sharpe ratio

using only in-sample data, based on random matrix

theory. Furthermore, portfolio managers can use the

estimated out-of-sample Sharpe as a criterion to decide the best tuning for constructing their portfolios.

Specifically, we consider the classical framework of

Markowits mean-variance portfolio optimization with

known mean vector and the high dimensional regime

of ?/? → ? ∈ (0, ∞), where ? is the portfolio dimension and ? is the number of samples or time

points. We propose to correct the sample covariance

by a regularization matrix and provide a consistent

estimator of its Sharpe ratio. The new estimator works

well under either of three conditions: (1) bounded

covariance spectrum, (2) arbitrary number of diverging spikes when c < 1, and (3) fixed number of diverging spikes when c ≥ 1. We can also extend the

results to construct global minimum variance portfolio

and correct out-of-sample efficient frontier. We

demonstrate the effectiveness of our approach through

comprehensive simulations and real data experiments.

Our results highlight the potential of this methodology

as a powerful tool for portfolio optimization in high

dimensional settings.

Joint work with Xuran Meng, Yuan Cao.

Invited Session IS069: Statistical Inference Beyond

Euclidean Spaces

Multiple Sample Studentized Tests:Random-Lifter

Approach

Rouling Wang

East China Normal University

Abstract: Measuring discrepancies among complex

random objects is a fundamental challenge in scientific discovery and statistical inference. Traditional

statistical test procedures often yield test statistics that

converge to complex asymptotic null distributions,

such as second-order Wiener chaos, requiring computationally intensive approximation or permutation

techniques to establish rejection regions. This work

introduces a novel approach, the 'Random-Lifter,' to

generate test statistics that approach standard normal

limits under null hypotheses without sample splitting.

We demonstrate through numerical simulations and

real-data analysis that the Random-Lifter method not

only simplifies the testing process but also maintains

robust competitiveness with minimal adjustments

compared to existing methods.

Joint work with Xueqin Wang, Baisuo Jin.

第157页

148

Variational Bayesian Logistic Tensor Regression

with Application in Image Recognition

Yanqing Zhang

Yunnan University

Abstract: Image recognition is an important research

directed and has attracted a lot of attention in various

fields including video surveillance, biometric identification, unmanned vehicles, human-computer interaction, and medical image recognition. Existing

recognition methods often ignore structural information of image data or depend heavily on sample

size of image data. To make full use of the prior

structural information of image data in limited sample size, we develop a novel variational Bayesian

method on logistic tensor model for classification with

tensor predictors. Specifically, we build a logistic

tensor regression model and adopt tensor decomposition to approximate tensor regression. We develop a

variational Bayesian approach with multiway shrinkage priors for marginal factor vectors of tensor coefficients to obtain sparse tensor estimator and build a

predictive density approximation based on variational

posteriors for classified prediction. The key idea of

the proposed method is to use variational Bayesian

approach and tensor regression model for combining

the prior structural information of tensor data efficiently and to adopt the matricization of tensor decomposition for simplifying the complexity of tensor

coefficient estimating. Moreover, we explore some

simulation studies and flower image and chest X-ray

image recognition analyses to assess the classification

performance of the proposed method.

Joint work with Yunzhi Jin, Niansheng Tang.

Functional Clustering for Longitudinal Associations between Social Determinants of Health and

Stroke Mortality in the US

Hui Huang

Renmin University of China

Abstract: Understanding the longitudinally changing

associations between Social Determinants of Health

(SDOH) and stroke mortality is essential for effective

stroke management. Previous studies have uncovered

significant regional disparities in the relationships

between SDOH and stroke mortality. However, existing studies have not utilized longitudinal associations

to develop data-driven methods for regional division

in stroke control. To fill this gap, we propose a novel

clustering method to analyze SDOH- stroke mortality

associations in US counties. To enhance the interpretability and statistical efficiency of the clustering outcomes, we introduce a new class of smoothness-sparsity pursued penalties for simultaneous clustering and variable selection in longitudinal associations. As a result, we can identify crucial SDOH that

contribute to longitudinal changes in stroke mortality.

This facilitates the clustering of US counties into different regions based on the relationships between

these SDOH and stroke mortality. The effectiveness of

our proposed method is demonstrated through extensive numerical studies. By applying our method to

longitudinal data on SDOH and stroke mortality at the

county level, we identify 18 important SDOH for

stroke mortality and divide the US counties into two

clusters based on these selected SDOH. Our findings

unveil complex regional heterogeneity in the longitudinal associations between SDOH and stroke mortality,

providing valuable insights into region-specific

SDOH adjustments for mitigating stroke mortality

Joint work with Fangzhi Luo, Jianbin Tan.

Multiple Threshold Detection and Subgroup Identification

Baisuo Jin

University of Science and Technology of China

Abstract: We propose an improved algorithm based

on parameter estimation of negative binomial distribution. By combining the MM algorithm with

non-convex penalty group coordinate descent, we

simplify the iterative steps for estimating the mean

and dispersion parameters, and provide a continuous

path solution for the parameters. Subsequently, we

extend the method to the negative binomial multi-threshold model. Using the two-stage method, we

transform the multi-threshold problem into a variable

selection problem and conduct multi-threshold detection using the improved algorithm. We also focus on

the multi-threshold change-plane subgroup model by

第158页

149

the two-stage method. Additionally, we provide consistency of the number of thresholds and upper bounds

for parameter estimation.

Invited Session IS041: Omics and Big Data in

Medical Research(组学与医学大数据)

Deconvoluting Cell State Distribution From Bulk

RNA-Seq Data

Jian Yang

Westlake University

Abstract: Deconvoluting cell-state abundances from

bulk RNA-seq data can add considerable value to

existing data, but achieving fine-resolution and

high-accuracy deconvolution remains a challenge.

Here, we introduce MeDuSA, a mixed model-based

method that leverages single-cell RNA-seq data as a

reference to estimate cell-state abundances along a

one-dimensional trajectory in bulk RNA-seq data. The

advantage of MeDuSA lies primarily in estimating

cell abundance in each state while fitting the remaining cells of the same type individually as random

effects. Extensive simulations and real-data benchmark analyses demonstrate that MeDuSA greatly improves the estimation accuracy over existing methods

for one-dimensional trajectories. Applying MeDuSA

to cohort-level RNA-seq datasets reveals associations

of cell-state abundances with disease or treatment

conditions and cell-state-dependent genetic control of

transcription. Our study provides a high-accuracy and

fine-resolution method for cell-state deconvolution

along a one-dimensional trajectory and demonstrates

its utility in characterizing the dynamics of cell states

in various biological processes.

Conditional Transcriptome-Wide Association

Study for Fine-Mapping Candidate Causal Genes

Zhongshang Yuan

Shandong University

Abstract: Transcriptome-wide association studies

(TWASs) aim to integrate genome-wide association

studies with expression-mapping studies to identify

genes with genetically predicted expression (GReX)

associated with a complex trait. In the present report,

we develop a method, GIFT(gene-based integrative

fine-mapping through conditional TWAS), that performs conditional TWAS analysis by explicitly controlling for GReX of all other genes residing in a local

region to fine-map putatively causal genes. GIFT is

frequentist in nature, explicitly models both expression correlation and cis-single nucleotide polymorphism linkage disequilibrium across multiple genes

and uses a likelihood framework to account for expression prediction uncertainty. As a result, GIFT

produces calibrated P values and is effective for fine-mapping. We apply GIFT to analyze six traits in

the UK Biobank, where GIFT narrows down the set

size of putatively causal genes by 32.16–91.32%

compared with existing TWAS fine-mapping approaches. The genes identified by GIFT highlight the

importance of vessel regulation in determining blood

pressures and lipid metabolism for regulating lipid

levels.

Statistical Inference on Road Traffic Injury Based

on the Media Data

基于媒体报道的道路交通伤害伤亡数据的统计推

Guoqing Hu

Central South University

摘要: 目的:基于网络文本大数据构建道路交通事

故数的统计推断模型,为利用有偏的道路交通伤害

网络文本大数据进行统计推断提供方法学支持。

方法:利用前期搭建的道路交通伤害网络大数据平

台,自动化抓取并结构化处理 2019 年 9 月至 2022

年 12 月间媒体报道的道路交通事故信息,评估新

闻发布者发文数量的稳定性。以符合要求的媒体报

道的每月道路交通事故数为预测变量,以每月发生

的道路交通事故数为结果变量,评估已有非概率样

本统计推断模型、传统统计推断模型以及机器学习

模型用于推断道路交通事故数的可行性和效果,采

用决定系数(R2)遴选全国以及各地区最优统计推

断模型,并对其未来 9 个月的道路交通事故数进行

估算。

结果:(1)经筛查,共有 695 家新闻媒体满足发

文数量稳定的标准,公开信息中仅全国以及湖南省

在内的 11 个地区的预测变量信息可及,故仅对上

述地区进行道路交通事故数统计推断模型的构建;

(2)经最优模型遴选,全国(R2=0.99)以及广东

第159页

150

省(R2=0.94)、河北省(R2=0.90)、河南省(R2=0.94

)、湖北省(R2=0.94)、江苏省(R2=0.94)以及

广西壮族自治区(R2=0.96)每月发生道路交通事

故数可用神经网络模型进行估算,而四川省(

R2=0.93)、浙江省(R2=0.91)、山东省(R2=0.82

)、陕西省(R2=0.88)以及湖南省(R2=0.96)每

月发生道路交通事故数可用回归树模型进行估算;

(3)模型估算结果显示,未来 9 个月全国以及湖

南省、浙江省、四川省、湖北省、河北省、陕西省

、山东省每月道路交通事故数呈现增长趋势,而广

东省、河南省、江苏省以及广西壮族自治区每月道

路交通事故数呈现下降趋势。

结论:本研究确定的模型适合网络文本大数据的特

点,且模型性能较好,具有很好的潜在应用价值。

Joint work with Min Zhao, Peixia Cheng, Wangxin

Xiao, Lei Yang, Shuying Zhao.

A Study of Confounder Adjusting Methods in ITE

Estimation Based on CF under High Dimensions

Yang Zhao

Nanjing Medical University

Abstract: We proposed the weighted causal forest

(wCF) to correct for the confounding effects to obtain

a debiased estimation of individual treatment effect

(ITE) in high dimensions. With the assumption of

complex confounding scenario (e.g., confounders

affecting both predictors and treatment effects), we

investigate and demonstrate the advantages of introducing the inverse probability weights (IPW) method

on each node of CF to improve the node splitting rule

over other existing models in terms of the identification of causal effect modifiers and the accuracy of ITE

statistical inference. Furthermore, we propose an analytical framework and provide a visualization platform

for filtering pre-treatment covariates and defining

heterogeneous subgroup simultaneously. Desirable

performance are derived from demonstrated simulations for the proposed method with different degrees

of confounding conditions compared to the previous

study. This method is applied to an empirical study

concerning the identification of potential predictive

biomarkers and heterogeneous treatment effect (HTE)

in advanced squamous non-small cell lung cancer.

Invited Session IS042: Optimality Consideration in

Modern Statistical Inference

Residual Permutation Test for High-Dimensional

Regression Coefficient Testing

Tengyao Wang

London School of Economics and Political Science

Abstract: We consider the problem of testing whether

a single coefficient is equal to zero in fixed-design

linear models under a moderately high-dimensional

regime, where the dimension of covariates ? is allowed to be in the same order of magnitude as sample

size ?. In this regime, to achieve finite-population

validity, existing methods usually require strong distributional assumptions on the noise vector (such as

Gaussian or rotationally invariant), which limits their

applications in practice. In this paper, we propose a

new method, called residual permutation test (RPT),

which is constructed by projecting the regression residuals onto the space orthogonal to the union of the

column spaces of the original and permuted design

matrices. RPT can be proved to achieve finite-population size validity under fixed design with

just exchangeable noises, whenever ? < ?/2. Moreover, RPT is shown to be asymptotically powerfulfor

heavy tailed nodises with bounded (1 + ?)-th order

moment when the true coefficient is at least of order

?

−?/(1+?)

for ? ∈ [0,1]. We further proved that this

signal size requirement is essentially rate-optimal in

the minimax sense. Numerical studies confirm that

RPT performs well in a wide range of simulation settings with normal and heavy-tailed noise distributions.

Joint work with Kaiyue Wen, Yuhao Wang.

Inference for Changing Periodicity, Smooth Trend

and Covariate Effects in Nonstationary Time Series

Lucy Xia

The Hong Kong University of Science and Technology

Abstract: Traditional analysis of a periodic time series assumes its pattern remains the same over the

entire time range. However, some recent empirical

studies in climatology and other fields find that the

amplitude may change over time, and this has im-

第160页

151

portant implications. We develop a formal procedure

to detect and estimate change-points in the periodic

pattern. Often, there is also a smooth trend, and sometimes the period is unknown, with potential other

covariate effects. Based on a new model that takes all

of these factors into account, we propose a three-step

estimation procedure to accurately estimate the unknown period, change-points, and varying amplitude

in the periodic component, as well as the trend and the

covariate effects. First, we adopt penalized segmented

least squares estimation for the unknown period, with

the trend and covariate effects approximated by

B-splines. Then, given the period estimate, we construct a novel SupF statistic and use it in binary segmentation to estimate change-points in the periodic

component. Finally, given the period and change-point

estimates, we estimate the entire periodic component,

trend, and covariate effects using B-splines. Asymptotic results for the proposed estimators are derived,

including consistency of the period and change-point

estimators, and the asymptotic normality of the estimated periodic sequence, trend and covariate effects.

Simulation results demonstrate the appealing performance of the new method.

Joint work with Ming-Yen Cheng, David Siegmund,

Shouxia Wang.

Community Extraction of Network Data under

Stochastic Block Models

Danning Li

Northeast Normal University

Abstract: Most existing community discovery methods focus on partitioning all nodes of the network into

communities. However, many real networks contain

background nodes that do not belong to any community. To handle this, some community extraction

methods have been developed to achieve community

discovery with background nodes, which are based on

searching algorithms, hence have difficulties in handling large-scale networks due to high computational

complexity. In this paper we propose a fast algorithm

with polynomial complexity to achieve community

extraction of large-scale networks. And we prove that

the estimator of the community labels using the proposed algorithm reaches the asymptotic minimax risk

under the community extraction model, a specific

stochastic block model.

Invited Session IS061: Recent Developments in

Conformal Inference and Causal Inference

An Adaptive Null Proportion Estimator for False

Discovery Rate Control

Zijun Gao

University of Southern California

Abstract: False discovery rate (FDR) is a commonly

used criterion in multiple testing and the Benjamini-Hochberg (BH) procedure is a standard approach

for FDR control. To boost power, the adaptive BH

procedure has been proposed by incorporating null

proportion estimators, among which Storey's estimator has gained substantial popularity. The performance

of Storey's estimator hinges on a critical hyper-parameter, where a pre-fixed configuration may

lack power and existing data-driven hyper-parameters

compromise the FDR control. In this work, we propose a novel class of adaptive hyper-parameters and

establish the FDR control of the associated adaptive

BH procedure using a martingale argument. Within

this class of data-driven hyper-parameters, we further

present a specific configuration designed to maximize

the number of rejections and characterize its convergence to the optimal hyper-parameter under a mixture

model. Our proposal exhibits power gains in extensive

simulations and a motivating protein dataset, particularly in cases with a conservative null distribution

common in composite null testing or a moderate proportion of weak non-nulls typically observed in biological experiments with an enrichment process.

Conformal Inference, Covariate Shift, and

De-biased Two-Sample U-Statistics

Jing Lei

Carnegie Mellon University

Abstract: We consider the problem of detecting potential deviation from the covariate shift assumption,

using a test statistic obtained by combining many

dependent conformal p-values. The resulting statistic

becomes a weighted two-sample U-statistic and can

第161页

152

be cast as a semiparametric inference problem with

two nuisance functional parameters. We show that

the use of conformal p-values can guarantee valid

inference with an accurate estimate of just one nuisance parameter and provides a natural way to avoid

degeneracy commonly encountered in such

two-sample inference problems. The potential bias

carried in the nuisance estimate can be corrected using

a general framework of de-biased two-sample

U-statistic.

False Discovery Rate Control for Structured Multiple Testing: Asymmetric Rules and Conformal

Q-values

Zinan Zhao

Zhejiang University

Abstract: The effective utilization of structural information in data while ensuring statistical validity

poses a significant challenge in false discovery rate

(FDR) analyses. Conformal inference provides rigorous theory for grounding complex machine learning

methods without relying on strong assumptions or

highly idealized models. However, existing conformal

methods have limitations in handling structured multiple testing, as their validity often requires the deployment of symmetric decision rules, which assume

the exchangeability of data points and permutation-invariance of fitting algorithms. To overcome

these limitations, we introduce the pseudo local index

of significance (PLIS) procedure, which is capable of

accommodating asymmetric rules and requires only

pairwise exchangeability between the null conformity

scores. We demonstrate that PLIS offers finite-sample

guarantees in FDR control and the ability to assign

higher weights to relevant data points. Numerical

results confirm the effectiveness and robustness of

PLIS and demonstrate improvements in power compared to existing model-free methods in various scenarios.

Confidence on the Focal: Conformal Prediction

with Selection-Conditional Coverage

Zhimei Ren

University of Pennsylvania

Abstract: Conformal prediction builds marginally

valid prediction intervals which cover the unknown

outcome of a randomly drawn new test point with a

prescribed probability. However, a common scenario

in practice is that, after seeing the data, practitioners

decide which test unit(s) to focus on in a data-driven

manner, and seek for uncertainty quantification of the

focal unit(s). In such cases, marginally valid conformal prediction intervals may not provide valid coverage for the focal unit(s) due to selection bias. This

paper presents a general framework for constructing a

prediction set with finite-sample exact coverage conditional on the unit being selected by a given procedure. The general form of our method works for arbitrary selection rules that are invariant to the permutation of the calibration units, and generalizes Mondrian

Conformal Prediction to multiple test units and

non-equivariant classifiers. We then work out computationally efficient implementation of our framework

for a number of realistic selection rules, including

top-K selection, optimization-based selection, selection based on conformal p-values, and selection based

on properties of preliminary conformal prediction sets.

The performance of our methods is demonstrated via

applications in drug discovery and health risk prediction.

Joint work with Ying Jin.

Invited Session IS039: Nonlinear Probability and

Statistics for Machine Learning

Strategic Statistical Learning

策略统计学习

Xiaodong Yan

Shandong University

摘要: 非线性期望是中国本土开辟的原创性研究方

向,对各个领域的科学研究越来越重要,尤其是大

数据和人工智能的兴起,为非线性期望创新理论与

应用研究提供了更强劲的动力。最近,团队基于强

化学习最简单模型-多臂老虎机模型,开创了“策略

极限理论”,是非线性概率理论与强化学习的重大

突破交叉研究成果,变革了传统统计方法研究范式

,后续开展的相关研究在大数据采样、实验设计、

迁移学习、在线学习和元学习等可解释和可信赖的

统计理论与方法研究上均取得重大突破。

第162页

153

Nonlinear Central Limit Theorem for the Gambling Machine Problem

赌博机问题的非线性极限定理

Guodong Zhang

Shandong University

Abstract: Motivated by the study of asymptotic behaviour of the bandit problems, we obtain several

strategy-driven limit theorems including the law of

large numbers and the central limit theorem. Different

from the classical limit theorems, we develop sampling strategy-driven limit theorems that generate the

maximum or minimum average reward. The law of

large numbers identifies all possible limits that are

achievable under various strategies. To describe the

fluctuations around averages, we obtain strategy-driven central limit theorems under optimal strategies. The limits in these theorem are identified explicitly, and depend heavily on the structure of the

events or the integrating functions and strategies.

This demonstrates the key signature of the learning

structure. Our results can be used to estimate the

maximal (minimal) rewards, and lay the theoretical

foundation for statistical inference in determining the

arm that offers the higher mean reward.

Joint work with Zengjing Chen, Shui Feng.

Application of Quantum Technology in Reinforcement Learning

量子技术在强化学习中的应用

Chengzhi Xing

Shandong University

摘要: 量子随机数发生器(QRNGs)可以产生高频

随机比特,与伪随机数相比具有真正的随机性和设

备独立性的优点。我们提出了一种新的方法,称为

量子随机数多臂机器人(QRN-MAB)算法,用于

解决无线通信系统中基于随机比特的多用户接入。

QRN-MAB 利用随机比特来学习信道特征,并进行

并发交换以获得稳定性;另外,我们研究了三态量

子行走的中心极限定理,基于特殊的酉算子

—Grover 矩阵和 Givens 矩阵,得到了在特定参数

下的极限分布的显式形式。特殊地,我们揭示了极

限分布可以退化到经典分布,例如正态分布、均匀

分布等,这一特性可作为一类 QRNGs 的理论基础

Joint work with Zengjing Chen.

Strong Law of Number and Ergodic Properties

under Sublinear Expectation

次线性期望下的强大数定律与遍历性

Yongsheng Song

Chinese Academy of Sciences

Abstract: Peng (2007) proved the law of large numbers under sublinear expectations (LLN*) with convergence in distribution. After that, much literature

was devoted to the strong version of LLN*, as well as

ergodicity under sublinear expectations, in the sense

of almost sure convergence.

In this talk, we first give a characterization of the

continuous ergodic capacities, based on which we

prove the ergodicity result that the time means are

bounded by the upper and lower expectations. Then,

we investigate under which conditions the upper and

lower expectations can be obtained by the time means.

The continuity of the capacities is a rather restrictive assumption for LLN* and ergodicity. To get

rid of this assumption, we give a version of strong

LLN under regular sublinear expectations defined on

a Polish space, which shows that any cylindrical random variables taking values in the mean interval can

be considered as a limit of the empirical averages.

Invited Session IS010: Data Privacy and Statistical

Modeling

Learning from Vertically Distributed Data across

Multiple Sites: An Efficient Privacy-Preserving

Algorithm for Cox Proportional Hazards Model

with Variable Selection

Samuel Wu

University of Florida

Abstract: We propose a novel distributed algorithm

for fitting regularized Cox proportional hazards model

when data sharing among different data providers is

restricted. Based on cyclical coordinate descent, the

proposed algorithm computes intermediary statistics

by each site and then exchanges them to update the

model parameters in other sites without accessing

individual patient-level data. We evaluate the performance of the proposed algorithm with (1) a simulation

第163页

154

study and (2) a real-world data analysis predicting the

risk of Alzheimer’s dementia from the Religious Orders Study and Rush Memory and Aging Project

(ROSMAP). Our algorithm achieves privacy-preserving variable selection for time-toevent data without degradation of accuracy compared

with a centralized approach. Simulation demonstrates

that our algorithm is highly efficient in analyzing

high-dimensional datasets. Real-world data analysis

reveals that our distributed Cox model yields higher

accuracy in predicting the risk of Alzheimer’s dementia than the conventional Cox model built by

each data provider without data sharing.

Joint work with Guanhong Miao, Lei Yu, Jingyun

Yang, David Bennett and Jinying Zhao.

New Composite Score to Detect Disease Progression in Alzheimer’s Disease

Guogen Shan

University of Florida

Abstract: Composite scores have been increasingly

used in trials for Alzheimer’s disease (AD) to detect

disease progression, such as the AD Composite Score

(ADCOMS) in the lecanemab trial. We proposed to

develop a new composite score based on the statistical

model in the ADCOMS, by removing duplicated

sub-scales and adding the model selection in the partial least squares (PLS) regression. The new AD

composite Score with variable Selection (ADSS) includes 7 cognitive sub-scales. ADSS can increase the

sensitivity to detect disease progression as compared

to the existing total scores, which leads to smaller

sample sizes using the ADSS in trial designs. ADSS

can be utilized in AD trials to improve the success rate

of drug development with a high sensitivity to detect

disease progression in early stages.

Differentially Private Data Collection with Matrix

Masking

Adam Ding

Northeastern University

Abstract: Differential privacy schemes have been

widely adopted in recent years to protect privacy

when doing statistical analysis. To fully ensure privacy protection, the differential privacy guarantee is not

only needed on the final released data set but is also

needed during the whole data collection procedure.

While local differential privacy procedures with noise

perturbation are often used for data collection, the

added noise magnitude is often big that reduces the

statistical utility of the collected data set. We propose

a new Gaussian local differential privacy scheme that

utilizes random orthogonal matrix masking. We prove

that the required additive noise variance to achieve

differential privacy guarantee in data collection is

much lower in the proposed scheme, thus can significantly improve the scope of application for differential privacy in practice.

Joint work with Samuel S. Wu and Shigang Chen.

Joint Modeling in Presence of Informative Censoring on the Retrospective Time Scale with Application to Palliative Care Research

Zhigang Li

University of Florida

Abstract: Joint modeling of longitudinal data such as

quality of life data and survival data is important for

palliative care researchers to draw efficient inferences

because it can account for the associations between

those two types of data. Modeling quality of life on a

retrospective from death time scale is useful for investigators to interpret the analysis results of palliative

care studies which have relatively short life expectancies. However, informative censoring remains a complex challenge for modeling quality of life on the

retrospective time scale although it has been addressed for joint models on the prospective time scale.

To fill this gap, we develop a novel joint modeling

approach that can address the challenge by allowing

informative censoring events to be dependent on patients' quality of life and survival through a random

effect. There are two sub-models in our approach: a

linear mixed effect model for the longitudinal quality

of life and a competing-risk model for the death time

and dropout time that share the same random effect as

the longitudinal model. Our approach can provide

unbiased estimates for parameters of interest by appropriately modeling the informative censoring time.

第164页

155

Model performance is assessed with a simulation

study and compared with existing approaches. A real-world study is presented to illustrate the application

of the new approach.

Joint work with Quran Wu and Michael Daniels.

Invited Session IS058: Recent Development in

Complex Data Analysis

Linear Discriminant Regularized Regression

Xin Bing

University of Toronto

Abstract: Linear Discriminant Analysis (LDA) is an

important classification approach. Its simple linear

form makes it easy to interpret and it is capable to

handle multi-class responses. It is closely related to

other classical multivariate statistical techniques, such

as Fisher's discriminant analysis, canonical correlation

analysis and linear regression. In this paper we

strengthen its connection to multivariate response

regression by characterizing the explicit relationship

between the discriminant directions and the regression

coefficient matrix. This key characterization leads to a

new regression-based multi-class classification procedure that is flexible enough to deploy any existing

structured, regularized, and even non-parametric,

regression methods. Moreover, our new formulation is

amenable to analysis: we establish a general strategy

of analyzing the excess misclassification risk of the

proposed classifier for all aforementioned regression

techniques. As applications, we provide complete

theoretical guarantees for using the widely used

ℓ1-regularization as well as for using the reduced-rank

regression, neither of which has yet been fully analyzed in the LDA context. Our theoretical findings are

corroborated by extensive simulation studies and real

data analysis.

Tracy-Widom Law of Ridge-Regularized F-Matrix

and Applications

Haoran Li

Auburn University

Abstract: In Multivariate Data Analysis, many central

problems can be formulated as a double Wishart

problem where two Wishart matrices, W1 and W2, are

involved. Important cases include MANOVA, CCA,

and tests for linear hypotheses in multivariate linear

regression. The traditional Roy's largest root test relies

on the largest eigenvalue of the F-matrix F=W1W2-1.

In a high-dimensional setting, the test is infeasible due

to the singularity of W2. To fix the singularity, we

propose a ridge-regularized test where a ridge term is

added to W2. We derive the asymptotic Tracy-Widom

distribution of the largest eigenvalue of the regularized F-matrix. Efficient methods for estimating the

asymptotic mean and variance are designed through

the Marchenko-Pastur equation. The power characteristics are studied under a class of local alternatives. A

simulation study is carried out to examine the numerical performance of the proposed tests.

Gradient Synchronization for Multivariate Functional Data, with Application to Brain Connectivity

Yaqing Chen

Rutgers University

Abstract: Quantifying the association between components of multivariate random curves is of general

interest and is a ubiquitous and basic problem that can

be addressed with functional data analysis. An important application is the problem of assessing functional connectivity based on functional magnetic resonance imaging (fMRI), where one aims to determine

the similarity of fMRI time courses that are recorded

on anatomically separated brain regions. In the functional brain connectivity literature, the static temporal

Pearson correlation has been the prevailing measure

for functional connectivity. However, recent research

has revealed temporally changing patterns of functional connectivity, leading to the study of dynamic

functional connectivity. This motivates new similarity

measures for pairs of random curves that reflect the

dynamic features of functional similarity. Specifically,

we introduce gradient synchronization measures in a

general setting. These similarity measures are based

on the concordance and discordance of the gradients

between paired smooth random functions. Asymptotic

normality of the proposed estimates is obtained under

regularity conditions. We illustrate the proposed synchronization measures via simulations and an applica-

第165页

156

tion to resting state fMRI signals from the Alzheimer's

Disease Neuroimaging Initiative (ADNI) and they are

found to improve discrimination between subjects

with different disease status.

Joint work with Shu-Chin Lin, Yang Zhou, Owen

Carmichael, Hans-Georg Müller, Jane-Ling Wang.

High-Dimensional Latent Semiparametric Mixture

Model for Clustering Large Non-Gaussian Data

Lvou Zhang

Shanghai University of Finance and Economics

Abstract: Cluster analysis is a fundamental task in

machine learning. Several clustering algorithms have

been extended to handle high-dimensional data by

incorporating a sparsity constraint in the estimation of

a mixture of Gaussian models. Though it makes some

neat theoretical analysis possible, this type of approach is arguably restrictive for many applications.

In this work we propose a novel latent variable transformation mixture model for clustering in which we

assume that after some unknown monotone transformations the data follows a mixture of Gaussians. Under the assumption that the optimal clustering admits a

sparsity structure, we develop a new clustering algorithm named CESME for high-dimensional clustering.

The use of unspecified transformation makes the

model far more flexible than the classical mixture of

Gaussians. On the other hand, the transformation also

brings quite a few technical challenges to the model

estimation as well as the theoretical analysis of

CESME. We offer a comprehensive analysis of

CESME including identifiability, algorithmic convergence, and statistical guarantees on clustering. In addition, we design a provable spectral initialization that

makes our algorithm easily implemented with fairly

mild initial conditions. Extensive numerical study and

real data analysis show that CESME outperforms the

existing high-dimensional clustering algorithms including CHIME, sparse spectral clustering, sparse

K-means, sparse convex clustering, and IF-PCA.

Joint work with Lulu Wang, Wen Zhou, Boxiang

Wang, Hui Zou.

Contributed Session CS015:High Dimensional

Statistical Inference

Beyond Detection Boundary: Minimax Deficiency

for Two-Sample Mean Tests in High Dimensions

Jingkun Qiu

Peking University

Abstract: The detection boundary is an effective tool

for evaluating a high-dimensional test procedure.

However, it is not a comprehensive performance

measure as the criterion of the detection boundary is

built upon trivial cases where the sum of type I and

type II errors converges to zero or one, which cannot

distinguish between the L2 and higher criticism (HC)

tests under dense signals nor between the L∞ and HC

tests under highly sparse signals. To overcome the

limitation of the detection boundary, we derive the

nontrivial minimax type II error under a controlled

type I error for two-sample hypotheses of means, and

prove a one-to-one correspondence between the signal

strength and the type II error in a non-asymptotic

framework for Gaussian data. Based on this result, we

propose two sharper discordant measures than the

detection boundary, the minimax relative deficiency

and the minimax absolute deficiency, to quantify the

differences in the signal strength such that a high dimensional test and the minimax optimal test could

have the same nontrivial type I and type II errors.

Those two measures are able to recover the higher

order terms not shown in the detection boundary. Using the proposed discordant measures, we provide a

full evaluation of three basic high dimensional tests

for two-sample means and respectively show the superiority of the L2, HC and L∞ tests under the dense,

moderately sparse and highly sparse signal regimes.

To guarantee the adaptation and robustness against the

unknown signal sparsity in practice, we further propose a novel power enhancement test by combining

the L2 and HC tests, which is optimal in terms of the

minimax relative deficiency over the whole signal

sparsity regime. Simulation studies are conducted to

evaluate the proposed test and demonstrate its superiority.

Joint work with Song Xi Chen and Yumou Qiu.

Mediation Analysis in Longitudinal Study with

第166页

157

High-Dimensional Methylation Mediators

Yidan Cui

Shanghai Jiao Tong University

Abstract: Mediation analysis has been widely utilized

to identify potential pathways connecting exposures

and outcomes. However, there remains a lack of analytical methods for high-dimensional mediation analysis in longitudinal data. To tackle this concern, we

proposed an effective and novel approach with variable selection and the indirect effect assessment based

on both linear mixed-effect model and generalized

estimating equation. Initially, we employ intersect

sure independence screening to reduce the dimension

of candidate mediators. Subsequently, we implement

the Sobel test with the Bonferroni correction for indirect effect hypothesis testing. Through extensive simulation studies, we demonstrate the performance of

our proposed procedure with a higher F1 score

(0.8056 and 0.9983 at sample sizes of 150 and 500

respectively) compared with the linear method

(0.7779 and 0.9642 at the same sample sizes), along

with more accurate parameter estimation and a significantly lower false discovery rate. Moreover, we apply

our methodology to explore the mediation mechanisms involving over 730,000 DNA methylation sites

with potential effects between paternal BMI and offspring growing BMI in the Shanghai sleeping birth

cohort data, leading to the identification of two previously undiscovered mediating CpG sites.

Joint work with Zhangsheng Yu.

Estimation and Inference for High-Dimensional

Quantile Regression with Knowledge Transfer

under Distribution Shift

Ruiqi Bai

Fudan University

Abstract: Information from related source studies can

often enhance the findings of a target study. However,

the distribution shift between the source and target

studies may severely impact the efficiency of

knowledge transfer. In this paper, we focus on the

high-dimensional quantile regression with knowledge

transfer under three types of distribution shift: parameter shift, covariate shift, and residual shift. A novel

transferable set and its corresponding detection algorithm are proposed to address this challenge. With

known transferable sources, an estimation algorithm is

designed under an L1-minimization framework. Detection consistency and nonasymptotic estimation

error bounds are established to validate the availability of our method in the presence of distribution shift.

Additionally, a debiased approach is proposed for

statistical inference with knowledge transfer, leading

to sharper asymptotic results. Simulation results as

well as applications on gene expression data and airplane data further demonstrate the effectiveness of our

proposed procedure.

Joint work with Zhongyi Zhu.

Enhancing High-Dimensional Dynamic Covariance

Estimation Model Based on GARCH Family and

Comparative Performance Analysis

Zhangshuang Sun

Shanghai University of Engineering Science

Abstract: With the progressive sophistication of social economies, the accurate estimation of high- dimensional covariance matrices has gained paramount

importance due to its wide-ranging applications in risk

management, portfolio optimization, and financial

forecasting. However, the inherent complexity and

scalability issues posed by high-dimensional dynamic

covariance structures commonly known as the curse

of dimensionality pose significant challenges to traditional estimation techniques. This paper presents a

novel extension of the Dynamic Conditional Angular

Correlation (DCAC) framework, harnessing the power

of Generalized Autoregressive Conditional Heteroscedasticity (GARCH)-family models to enhance the

precision in capturing volatility dynamics.

This study addresses this challenge by extending

the DCAC framework through integration with

GARCH-family models, enabling enhanced precision

in volatility modeling. We judiciously select influential GARCH variants and pair them with two distinct

DCAC functional forms, expanding the framework's

applicability to various market conditions and dependencies. Extensive simulation experiments are

conducted to evaluate and compare the estimation

第167页

158

performance of dynamic correlation matrices produced by the different extended models. These experiments reveal context-specific superiority, with the

DCAC-FIGARCH model excelling in markets exhibiting long-term memory effects post-unforeseen

events, effectively capturing persistent volatility clustering and leverage effects.

Joint work with Guoqiang Wang.

Invariance-Based Inference in High-Dimensional

Regression with Finite-Sample Guarantees

Wenxuan Guo

University of Chicago

Abstract: In this paper, we develop invariance-based

procedures for testing and inference in

high-dimensional regression models. These procedures, also known as randomization tests, provide

several important advantages. First, for the global null

hypothesis of significance, our test is valid in finite

samples. It is also simple to implement and comes

with finite-sample guarantees on statistical power.

Remarkably, despite its simplicity, this testing idea

has escaped the attention of earlier analytical work,

which mainly concentrated on complex

high-dimensional asymptotic methods. Under an additional assumption of Gaussian design, which is common in the literature, we show that our test achieves a

Type II error rate that is minimax optimal against

certain non-sparse alternatives. Second, for partial null

hypotheses, we propose residual-based tests and derive theoretical conditions for their validity. These

tests can be made powerful by constructing the test

statistic in a way that, first, selects the important covariates (e.g., through Lasso) and then orthogonalizes

the nuisance parameters. We illustrate our results

through extensive simulations and applied examples.

One consistent finding is that the strong finite-sample

guarantees associated with our procedures result in

added robustness when it comes to handling multicollinearity and heavy-tailed covariates.

Joint work with Panos Toulis.

Contributed Session CS031:Advance in Statistical

Methods for Large and Complex Data

A New Model Free High-Dimensional Variable

Selection Method

一种新的免模型高维变量选择方法

Changcheng Li

Dalian University of Technology

Abstract: We propose a model-free variable selection

approach, namely constrained kernel regression. Instead of relying on model-based loss functions, the

proposed approach is developed based on conditional

independence relationship measured by conditional

distance covariance/correlation. The constrained kernel regression coefficient vector is defined to be the

vector satisfying the conditional independence constraints. We prove that the proposed approach can

consistently identify the true important predictor set

under high-dimensional model-free settings. We further develop a data driven approach to select the tuning parameter of the proposed approach. The advantage of the proposed procedure is further shown by

various numerical studies. More specifically, the proposed model-free procedure surpasses the existing

model-based methods in the presence of model misspecification while outperforms or at least equates to

the existing ones with correctly specified models.

Probabilistic Embedding, Clustering, and Alignment for Integrating Spatial Transcriptomics Data

with PRECAST

Wei Liu

Sichuan University

Abstract: Spatially resolved transcriptomics involves

a set of emerging technologies that enable the transcriptomic profiling of tissues with the physical location of expressions. Although a variety of methods

have been developed for data integration, most of

them are for single-cell RNA-seq datasets without

consideration of spatial information. Thus, methods

that can integrate spatial transcriptomics data from

multiple tissue slides, possibly from multiple individuals, are needed. Here, we present PRECAST, a data

integration method for multiple spatial transcriptomics

datasets with complex batch effects and/or biological

effects between slides. PRECAST unifies spatial factor analysis simultaneously with spatial clustering and

第168页

159

embedding alignment, while requiring only partially

shared cell/domain clusters across datasets. Using

both simulated and four real datasets, we show improved cell/domain detection with outstanding visualization, and the estimated aligned embeddings and

cell/domain labels facilitate many downstream analyses. We demonstrate that PRECAST is computationally scalable and applicable to spatial transcriptomics

datasets from different platforms. The software that

enables the implementation of PRECAST can be accessed at https://feiyoung.github.io/PRECAST/.

Local False Discovery Rate Estimation with Competition-Based Procedures for Variable Selection

Xiaoya Sun

Academy of Mathematics and Systems Science, Chinese Academy of Sciences

Abstract: Multiple hypothesis testing has been widely

applied to problems dealing with high-dimensional

data, for example, the selection of important variables

or features from a large number of candidates while

controlling the error rate. The most prevailing measure of error rate used in multiple hypothesis testing is

the false discovery rate (FDR). In recent years, the

local false discovery rate (fdr) has drawn much attention, due to its advantage of accessing the confidence

of individual hypotheses. However, most methods

estimate fdr through P-values or statistics with known

null distributions, which are sometimes unavailable or

unreliable. Adopting the innovative methodology of

competition-based procedures, for example, the

knockoff filter or the target-decoy competition, this

paper proposes a new approach, named TDfdr, to fdr

estimation, which is free of P-values or known null

distributions. Extensive simulation studies demonstrate that TDfdr can accurately estimate the fdr with

two competition-based procedures. We applied the

TDfdr method to two real biomedical tasks. One is to

identify significantly differentially expressed proteins

related to the COVID-19 disease, and the other is to

detect mutations in the genotypes of HIV-1 that are

associated with drug resistance. Higher discovery

power was observed compared to existing popular

methods.

Joint work with Xiaoya Sun.

Placebo Tests for Difference-in-Differences

Qiang Chen

Shandong University

Abstract: This paper provides a systematic and rigorous treatment of placebo tests for difference-in-differences (DID), which have become increasingly popular in applied work. We classify DID

placebo tests into in-time, in-space, mixed and external placebo tests and discuss their algorithms under

various DID designs, including standard DID, staggered DID, general DID, continuous DID, DDD, and

cohort DID designs. We formally show that in-space

and mixed placebo tests have exact sizes in finite

samples under the symmetry assumption and can asymptotically control the probability of false rejection

under the approximate symmetry assumption. Moreover, we investigate the size and power of these tests

in finite samples via Monte Carlo simulations. Even if

the (approximate) symmetry assumption fails, placebo

tests are still useful as falsification tests for assessing

the plausibility of underlying assumptions for identification. We also develop the Stata command didplacebo for easy implementation of DID placebo tests, as

illustrated by two empirical applications.

Joint work with Ji Qi, Guanpeng Yan.

Tobit Models for Count Time Series

Fukang Zhu

Jilin University

Abstract: Several models for count time series have

been developed during the last decades, often inspired

by traditional autoregressive moving average (ARMA)

models for real-valued time series, including integer-valued ARMA (INARMA) and integer-valued

generalized autoregressive conditional heteroscedasticity (INGARCH) models. Both INARMA and

INGARCH models exhibit an ARMA-like autocorrelation function (ACF). To achieve negative ACF values within the class of INGARCH models, log and

softplus link functions are suggested in the literature,

where the softplus approach leads to conditional linearity in good approximation. However, the softplus

第169页

160

approach is limited to the INGARCH family for unbounded counts, i.e. it can neither be used for bounded

counts, nor for count processes from the INARMA

family. In this paper, we present an alternative solution, named the Tobit approach, for achieving approximate linearity together with negative ACF values,

which is more generally applicable than the softplus

approach. A Skellam--Tobit INGARCH model for

unbounded counts is studied in detail, including stationarity, approximate computation of moments,

maximum likelihood and censored least absolute deviations estimation for unknown parameters and corresponding simulations. Extensions of the Tobit approach to other situations are also discussed, including

underlying discrete distributions, INAR models, and

bounded counts. Three real-data examples are considered to illustrate the usefulness of the new approach.

Contributed Session CS032:Complex Data Analysis

Bayesian Semiparametric Joint Model of Multivariate Longitudinal and Survival Data with Dependent Censoring

Anmin Tang

Yunnan university

Abstract: We consider a novel class of semiparametric joint models for multivariate longitudinal and survival data with dependent censoring. In these models,

unknown-fashion cumulative baseline hazard functions are fitted by a novel class of penalized-splines

(P-splines) with linear constraints. The dependence

between the failure time of interest and censoring time

is accommodated by a normal transformation model,

where both nonparametric marginal survival function

and censoring function are transformed to standard

normal random variables with bivariate normal joint

distribution. Based on a hybrid algorithm together

with the Metropolis-Hastings algorithm within the

Gibbs sampler, we propose a feasible Bayesian method to simultaneously estimate unknown parameters of

interest, and to fit baseline survival and censoring

functions. Intensive simulation studies are conducted

to assess the performance of the proposed method.

The use of the proposed method is also illustrated in

the analysis of a data set from the International Breast

Cancer Study Group (IBCSG).

Joint work with Niansheng Tang, Dalei Yu.

Construction and Evaluation of Large Language

Models for Pediatrics

儿科大语言模型的构建与评估

Ximing Xu

Children's Hospital of Chongqing Medical University

摘要: 近年来,大语言模型在自然语言处理领域取

得了突破性进展,并在医疗保健等多个领域展现出

巨大的应用潜力。然而,现有的通用大语言模型在

处理儿科文本方面存在局限性,无法充分满足儿科

临床和科研的需求。本报告将介绍儿科大语言模型

的构建与评估方法。首先,我们将探讨儿科大语言

模型的构建过程,包括数据收集、模型训练和参数

调优等关键步骤。其次,我们将介绍儿科大语言模

型的评估方法,包括定量指标和定性分析等多种方

法。最后,我们将分享儿科大语言模型在临床和科

研方面的应用案例,并探讨未来发展方向。

Joint work with Zhensheng Hu, Li Xiao, Qiuhong

Wei, Ying Cui.

Improving Cell-Type-Specific Cis-EQTLs Prioritization by Integrating Bulk RNA-Seq and

ScRNA-Seq Data

Jiashun Xiao

Shenzhen Research Institute of Big Data

Abstract: Most expression quantitative trait loci

(eQTL) studies have traditionally operated at the tissue level, employing bulk RNA-seq data. This approach, however, has been associated with signal loss

and distortion due to unaddressed cellular heterogeneity. Recently, a handful of studies have sought to enhance precision by quantifying gene expression

through the application of single-cell

RNA-sequencing (scRNA-seq) techniques, enabling

cell-type-specific eQTL analyses. Despite these advancements, the limited sample sizes in these studies

are a consequence of the associated high costs. To

address this challenge, we present a novel statistical

method aimed at improving the statistical power of

prioritizing cell-type-specific cis-eQTLs. Our approach involves integrating eQTL summary statistics

第170页

161

from both bulk RNA-seq and scRNA-seq data,

providing a comprehensive strategy to mitigate the

constraints associated with individual techniques.

Joint work with Xinyi Yu.

Gmcoda: Graphical Model for Multiple Compositional Vectors in Microbiome Studies

Huaying Fang

Academy for Multidisciplinary Studies, Capital Normal University

Abstract: Microbes are essential components in the

ecosystem and participate in most biological procedures in environments. The high-throughput sequencing technologies help researchers directly quantify the

abundance of microbes in a natural environment. Microbiome studies explore the construction, stability,

and function of microbial communities with the aid of

sequencing technology. However, sequencing technologies only provide relative abundances of microbes,

and this kind of data is called compositional data in

statistics. The constraint of the constant sum requires

flexible statistical methods for analyzing microbiome

data. Current statistical analysis of compositional data

mainly focuses on one compositional vector such as

bacterial communities. The fungi are also an important component in microbial communities and are

always measured by sequencing internal transcribed

spacer instead of 16S rRNA genes for bacteria. The

different sequencing methods between fungi and bacteria bring two compositional vectors in microbiome

studies. We propose a novel statistical method, called

gmcoda, based on an additive logistic normal distribution for estimating the partial correlation matrix for

cross-domain interactions. A majorization-minimization algorithm is proposed to solve the

optimization problem involved in gmcoda. Through

simulation studies, gmcoda is demonstrated to work

well in estimating partial correlations between two

compositional vectors. Gmcoda is also applied to infer

cross-domain interactions in a real microbiome dataset

and finds potential interactions between bacteria and

fungi.

Real-Time Inference for Streaming Survival Data

from Multiple Heterogeneous Studies with a Cure

Fraction

Bo Han

Yunnan University

Abstract: In survival analysis, it often happens that a

certain fraction of subjects will never experience the

event of interest and thus they are considered to be

cured. This paper proposes a novel procedure to draw

real-time inference for covariate effects on survival

with a cure fraction. For the promotion time cure

model, we first investigate an online method by combining the likelihood function from current data with

the confidence density function of summary statistics

from historical data. It enables borrowing of strength

from summary-level information, thereby relaxing the

assumption of model homogeneity among data batches and enjoying computational efficiency. The consistency and weak convergence of the online estimator

are established and it is shown to achieve statistical

efficiency. We then propose an online data fusion

method for streaming data from multiple heterogeneous studies, which synthesizes inferential information

from all studies to make more effective inference than

from any study alone. The inference procedure is easy

to implement using standard statistical software and

computationally fast without involving resampling.

Our methods are illustrated via simulation studies and

an application to breast cancer data.

Joint work with Niansheng Tang, Liuquan Sun, Ingrid Van Keilegom.

Contributed Session CS033:Statistical Modeling

and Application of Complex Data

Integration Problem of Probability and Non

Probability Samples Based on Large-Scale Interpolation

基于大规模插补的概率与非概率样本的整合问题

Xiaoning Wang

Communication University of China

摘要: 本文通过从非概率样本中抽取一个随机子样

本,得到一组与概率样本相似的观测值。然后,利

用这个子样本的解释变量,我们建立一个预测模型

,该模型能够根据概率样本的解释变量来预测目标

变量的取值。接下来,我们使用这个预测模型对概

第171页

162

率样本中的缺失目标变量进行插补。通过将预测的

目标变量值与非概率样本中的其他观测值进行组

合,我们得到了完整的插补数据集。在得到插补数

据集后,可以利用统计方法进行分析。通过对多个

插补数据集进行分析,并将结果组合起来,可以得

到对目标变量的整合估计。

Monitoring and Application of Several Types of

Binary Integer Time Series Models

几类二元整数值时间序列模型的监控及其应用

Cong Li

Jilin University

摘要: 二元自相关且具有交互相关性的计数序列在

生产生活中很常见,如市长公开电话系统中两个单

位的被投诉数量、相邻地区某种传染性疾病患病人

数、同一地区某两类犯罪发生的次数等。这类数据

不仅反映了行业运营的效率和质量,还可能预示着

潜在的问题和风险,因此对该类数据进行监控至关

重要。二元一阶整数值时间序列 (简记为 BINAR(1))

能较好拟合这类数据。本文采用控制图方法对

BINAR(1)模型的过程均值和过程相关性进行监控,

研究了多种控制图的监控效率、控制图参数以及初

始受控参数取值大小对监控效果的影响等问题。此

外,本文还考虑了参数估计误差对控制图设计的影

响,并在文章的最后给出了实例分析方法。

Equivalence Assessment via the Difference between

Two AUCs in a Matched-Pair Design with Nonignorable Missing Endpoints

Yunqi Zhang

Yunnan University

Abstract: Equivalence assessment via various indices

such as relative risk has been widely studied in a

matched-pair design with discrete or continuous endpoints over the past years. But existing studies mainly

focus on the fully observed or missing at random

endpoints. Nonignorable missing endpoints are commonly encountered in a matched-pair design. To this

end, this paper proposes several novel methods to

assess equivalence of two diagnostics via the difference between two correlated areas under ROC curves

(AUCs) in a matched-pair design with nonignorable

missing endpoints. An exponential tilting model is

utilized to specify the nonignorable missing endpoint

mechanism. Three nonparametric approaches and

three semiparametric approaches are developed to

estimate the difference between two correlated AUCs

based on the kernel-regression imputation, inverse

probability weighted (IPW), and augmented IPW

methods. Under some regularity conditions, we show

the consistency and asymptotic normality of the proposed estimators. Simulation studies are conducted to

study the performance of the proposed estimators.

Empirical results show that the proposed methods

outperform the complete-case method. An example

from clinical studies is illustrated by the proposed

methodologies.

Joint work with Weili Cheng, Puying Zhao.

Modeling Cluster Dynamics in Panel Data via

Time-Varying Mixture Models

Youquan Pei

Shandong University

Abstract: Understanding the dynamic behavior of

clusters in panel data is crucial in economics and finance. This paper introduces a novel time-varying

mixture model designed to capture the potential dynamics of cluster evolution for panel data. Our approach assumes that relationships between variables

are not only time-varying but also exhibit a latent

group structure, which itself may evolve over time.

We estimate the unknown time-varying functions by

employing a local maximum likelihood method facilitated by kernel smoothing, and establish the asymptotic properties of the resultant estimators. Empirical

validation is conducted using both simulated and real-world datasets. In our simulations, the proposed

method effectively identifies true cluster transitions,

as well as variations in coefficients and mixing probabilities over time. For real-world applications, we

analyze the classical Environmental Kuznets Curve

(EKC) data, verifying that the relationships between

income and pollution levels are dynamic and exhibit

changes in group membership over time.

Joint work with Zongxu Li, Heng Peng, Jinfeng Xu.

Dimension Reduction of High-Dimension Categorical Data with Two or Multiple Responses Consid-

第172页

163

ering Interactions Between Responses

Yuehan Yang

Central University of Finance and Economics

Abstract: This paper focuses on modeling the categorical data with two or multiple responses. We study

the interactions between the responses and propose an

efficient iterative procedure based on sufficient dimension reduction. We show that the proposed method reaches the local and global dimension reduction

efficiency. The theoretical guarantees of the method

are provided under the two- and multiple-response

models. We demonstrate the uniqueness of the proposed estimator, further, we prove that the iteration

converges to the oracle least squares solution in the

first two and steps for the two- and multiple-response

model, respectively. For data analysis, the proposed

method is efficient in the multiple-response model and

performs better than some existing methods built in

the multiple-response models. We apply this modeling

and the proposed method to an adult dataset and a

right heart catheterization dataset. Results show that

both datasets are suitable for the multiple-response

model and the proposed method always performs

better than the compared methods.

Contributed Session CS034:Statistical Applications in Interdisciplinary Research

Ensemble LDA via the Modified Cholesky Decomposition

Zhenguo Gao

Shanghai Jiao Tong University

Abstract: A binary classification problem in the

high-dimensional settings is studied via the ensemble

learning with each base classifier constructed from the

linear discriminant analysis (LDA), and these base

classifiers are integrated by the weighted voting. The

precision matrix in the LDA rule is estimated by the

modified Cholesky decomposition (MCD), which is

able to provide us with a set of precision estimates by

considering multiple variable orderings, and hence

yield a group of different LDA classifiers. Such

available LDA classifiers are then integrated to improve the classification performance. The simulation

and the application studies are conducted to demonstrate the merits of the proposed method.

Joint work with Xinye Wang, Xiaoning Kang.

4D Trajectory Modeling and Prediction Based on

Conditional Markov Process

基于条件式马尔可夫过程的 4D 航迹建模与预测

Yao Rong

Yunnan University

Abstract: This report introduces an innovative 4D

flight trajectory modeling method based on Conditionally Markov (CM) processes. This method offers

enhanced physical and mathematical interpretability

compared to existing methods. It effectively utilizes

waypoint information extracted from historical trajectory data and provides optimal estimation of unknown

parameters in the 4D trajectory model. It performs

well in predicting trajectories with varied flight durations on real trajectory data. Meanwhile, the proposed

approach exhibits rapid computation speeds and high

predictive accuracy.

Joint work with Mengjiao Tang, Sanfeng Hu.

Signal Plus Noise Models in the Log Proportional

Regime: When Does Debiasing Help?

Yicheng Zeng

Shenzhen Research Institute of Big Data

Abstract: We consider the popular signal plus noise

model with a low-rank signal and heterogenous noise

in the log proportional regime where the sample size

and data dimension are comparable in logarithm. In

this work, we aim to quantify the effect of the noise

on recovering the low-rank signal, and find out the

situations where debiasing singular values could improve the statistical accuracy, and also understand the

mechanism behind it. We first derive an explicit form

for the bias in the leading singular values of the

noise-corrupted data matrix and then add a debiasing

step into the signal recovery procedure. Under an

entrywise loss measuring the recovery error, we show

that this new estimation has a higher convergence rate

to zero than the classical hard thresholding estimation

when the aspect ratio, i.e., the ratio of dimension and

sample size, is divergent. Thus, we conclude from this

interesting phenomenon that the debiasing procedure

第173页

164

on the singular values could sometimes help us significantly improve the statistical accuracy although

the singular vectors remain unchanged, which is a

new finding in the case of divergent aspect ratio that is

common in the log proportional regime. Furthermore,

we use updated concentration inequalities to the local

laws from random matrix theory and then derive finite

sampled, i.e., non-asymptotic, results for the singular

values and vectors, as well as the estimation error.

Lastly, we study matrix denoising, multi-dimensional

scaling, and clustering as applications.

Joint work with Xin Chen, Qiang Sun, Runze Li.

Riemann Lp Centroid for Estimating Divergence

Matrix under Complex Ellipsoidal Distribution

复椭球分布下散度矩阵估计的黎曼 Lp质心

Mengjiao Tang

Yunnan University

Abstract: A popular method for fusing a set of covariance matrix estimates (with unavailable correlation)

is to solve their geometrical mean or median, which is

defined by a Riemannian geometry of Hermitian positive-definite (HPD) matrices. The most well-known

such geometry is identical to the Fisher information

geometry of multivariate Gaussian distributions with a

fixed mean. This paper identifies the space of HPD

matrices with the manifold of centered (i.e., zero-mean) complex elliptically symmetric (CES) distributions. First, the Fisher information matrix for the

CES distributions defines a different Riemannian

metric on HPD matrices, and the induced Riemannian

geometry is studied. Then, the Riemannian Lp mean of

some HPD matrices is calculated to produce a final

estimation for the scatter matrix (proportional to the

covariance matrix) of a CES distribution. While the

corresponding objective function is proven to be

g-convex, a Riemannian gradient descent algorithm is

given to compute the solution. Finally, numerical

examples are provided to illustrate the derived geometrical structure and its application to target detection.

Joint work with Yao Rong.

Innovating Grass Seed Sampling and Analysis with

AI and Robotics

Yanming Di

Oregon State University

Abstract: Grass seed testing—used for determining

seed lot quality and establishing seed value—is a

fundamental phase of the agricultural marketing system.

Seed sampling—obtaining a representative working

subsample from a primary bulk sample—is a critical

first step in seed testing and is required by the Association of Official Seed Analysis before seed analysis

can commence. Following seed sampling is seed

analysis. For example, in purity tests, seed analysts

need to separate weed seeds and other impurities from

grass seeds. Currently, in the seed lab, it usually takes

months to train seed analysts and years for them to

become experts in seed analysis. Many experienced

analysts later take other positions. Seed analysis is

labor-intensive, requiring the analyst to sit in front of

a workstation looking through a microscope, which

can lead to work-related head and neck issues.

With recent advances in robotics, computer vision,

and AI, an opportunity presents itself for a new wave

of innovations. Our group utilizes AI and robotics to

innovate devices and protocols for sampling grass

seeds and a computer vision system for automated

seed analysis. In this talk, I discuss the challenges that

we faced when using image analysis for grass seed

analysis.

Contributed Session CS035 : Model Averaging/Cross Disciplinary Research in Statistics

Spectrally-Corrected and Regularized LDA for

Spiked Model

Hua Li

Changchun University

Abstract: This paper proposes an improved linear

discriminant analysis called spectrally-corrected and

regularized LDA (SRLDA). This method integrates

the design ideas of the sample spectrally-corrected

covariance matrix and the regularized discriminant

analysis. With the support of a large-dimensional

random matrix analysis framework, it is proved that

SRLDA has a linear classification optimal global op-

第174页

165

timal solution under the spiked model assumption.

According to simulation data analysis, the SRLDA

classifier performs better than RLDA and ILDA and is

closer to the theoretical classifier. Experiments on

different data sets show that the SRLDA algorithm

performs better in classification and dimensionality

reduction than currently used tools.

Joint work with Wenya Luo, Zhidong Bai.

Synchronization of Delayed Neural Networks via

Intermittent Sampled-Data Control

Ying Yang

Yunnan University

Abstract: Delayed neural networks (NNs) have wide

engineering applications. By introducing the chaotic

characteristics into NNs, the chaotic neural network

models can better describe the chaotic features occurring in biological neurons. Considering the delay

phenomenon in chaotic NNs, the existence of time

delay may affect the performance of synchronization

for chaotic NNs, and the introduction of time delay in

chaotic NNs may result in more complex chaotic time

series. To achieve the synchronization of delayed

chaotic NNs, the methods of continuous control and

discontinuous control are employed. Combined with

the advantages of intermittent control and sampled-data control, intermittent sampled-data control

can further reduce the amount of data transmission on

the base of reducing the control cost. In this talk, the

synchronization problem for chaotic time-delay NNs

via intermittent sampled-data control is introduced. To

fully consider the characteristics of the controlled

systems, a mixed two-side-looped functional is constructed, and two inequalities are proposed to estimate

the norm of the system state. By using the novel inequalities, the positive-definite property of the functionals can be removed, which can derive less conservative results. After then, the synchronization criterion and the controller condition are obtained. To illustrate the effectiveness of the presented results, a

numerical example is given.

Robust Probabilistic Principal Component Analysis with Mixture of Exponential Power Distributions

Zhenghui Feng

Harbin Institute of Technology (Shenzhen)

Abstract: This paper introduces the EP-MPPCA

model, which serves as a flexible and robust alternative to conventional Gaussian-based mixtures of

probabilistic principal component analysis (MPPCA)

for high-dimensional data analysis. The EP-MPPCA

model utilizes the exponential power distribution family, making it more adept at handling heterogeneous

data distributions and outliers. We provide algorithms

and estimation methods for the EP-MPPCA model

and evaluated its performance through simulations. In

real data analysis, we demonstrate how the

EP-MPPCA model can be practically applied in two

important applications: unsupervised clustering and

image data reconstruction. Specifically, we show that

the EP-MPPCA model effectively handles outliers in

high-dimensional image data, leading to improved

reconstruction quality. Additionally, the model can

achieve superior clustering results in an unsupervised

manner for high-dimensional data.

Joint work with Xinyi Wang, Xiao Chen, Heng Peng.

Evaluating Distortion between as-Designed and

as-Built Geometries from a Superposition of Resonant Mode Shapes

Qing Li

Iowa State University

Abstract: An accurate, precise, and comprehensive

representation of the shape details of manufactured

components is critical for the design, simulation, control, and optimization of next-generation cyber-based

manufacturing systems. However, in many cases, the

design intent encoded in the mesh model cannot be

applied directly to the as-manufactured part because

the physical part never perfectly matches the geometry of the designed part. Therefore, a systematic

bi-directional mapping framework between

as-designed models and as-manufactured geometries

is needed to connect design and production. To address existing challenges, we establish the MeshFit

framework by finding the superposition of the resonant mode shapes obtained from finite element modal

第175页

166

vibration analysis, inspired by the idea that the object

is likely to deform in similar ways to how it would

naturally mechanically vibrate. In the experiment

study, we obtained the point cloud of the as-built part

and fitted the as-designed mesh to the shape of the

point cloud.

Our proposed work will facilitate data mapping from

different resources and modalities (e.g., dimensional

metrology and nondestructive evaluation methods

such as computed tomography and thermography) in

manufacturing and across the product lifecycle. Hence

our work will enable Manufacturing 4.0, especially

digital twin systems that rely on information exchange

between digital models and physical parts.

Joint work with Lijie Liu, Adarsh Krishnamurthy,

Stephen Holland.

Optimal Conditional Quantile Prediction via Model Averaging of Partially Linear Additive Models

Jing Lv

Southwest University

Abstract: Partially linear additive models (PLAMs)

have been considered one of the most popular semiparametric models for prediction, as they enjoy model

flexibility and interpretability. However, choosing the

linear and nonlinear parts in PLAMs is always a challenging task. In the literature, there are a few studies

that propose to choose the linear part by using a regularization method. As a result, they can identify a

single optimal PLAM. We propose a novel strategy

based on model averaging to obtain an optimal

weighted combination of a series of partially linear

additive candidate models. Our approach provides a

new perspective of accounting for structure uncertainty of PLAMs. It improves prediction accuracy

compared to the estimation method based on each

single PLAM, and reduces the risk of model

mis-specification. Moreover, we consider a conditional quantile process setting that provides a more comprehensive analysis on the relationships between the

response and covariates as well as a more robust prediction. Theoretically, we show that the proposed

method of choosing the weights is asymptotically

optimal in terms of minimizing the out-of-sample

quantile prediction error by allowing misspecification

of each candidate model. The numerical results

demonstrate that our method yields smaller prediction

errors than the conventional regularization methods of

selecting a single PLAM.

Joint work with Shujie Ma.

Contributed Session CS036:Feature Screening and

High Dimensional Data

Feature Screening for Metric Space Valued Responses Based on Fréchet Regression

Bing Tian

School of Economics, Xiamen University

Abstract: To fulfill the requirement of statistical approaches for complex data, we consider the feature

screening problem with more general types of response. Most of the existing feature screening procedures focus on the case that responses lie in a vector

space and, thus, may not meet the need in modern data

analysis. To address the demand, we propose a general

sure independence screening approach to discard

non-informative features when responses are complex

random objects in metric space. Specifically, the proposed method is built upon global Fréchet regression,

thus it does not require the ambient vector space assumption, and only a distance between data objects is

needed. We demonstrate the proposed procedure enjoys the sure screening property under mild regularity

conditions. Simulation studies and two data applications are included to illustrate our proposal.

Joint work with Wei Zhong.

Stability eBH: A Unified Stability Approach to

False Discovery Rate Control

Jiajun Sun

Xiamen University

Abstract: In recent years, various multiple hypothesis

testing methods with false discovery rate (FDR) control, such as knockoff, Gaussian mirror, and data splitting, have been gradually developed. However, these

methods are randomized, which is an undesirable

characteristic in practical applications. Derandomized

knockoffs, a method that utilizes e-values to aggregate

results, offers a viable derandomization approach.

第176页

167

Nonetheless, it suffers from notable drawbacks, including the need for parameter selection and a tendency to exhibit low power due to correlation. In this

paper, we introduce a general stabilization method

applicable to all algorithms with FDR control. Our

approach aggregates e-values generated from multiple

runs of the base algorithm. We employ the arithmetic

mean of these e-values to generate stabilized e-values,

which serve as a new ranking statistic. Subsequently,

we run the e-BH procedure on the stabilized e-values,

transforming the dependence of derandomized

knockoffs on the numerical value of e-values into a

dependence on the ranking. This transformation leads

to higher power without compromising stability. We

prove that for any base procedure with the FDR control property, our method can control the FDR and

achieve higher power under some conditions. Extensive numerical experiments and real-data applications

demonstrate that the proposed method generally exhibits higher power than competitors while maintaining competitive stability.

Joint work with Zhanrui Cai, Wei Zhong.

Robust Online Control Experiments for Multivariate Tests

Shaohua Xu

Nankai University

Abstract: Multivariate testing has recently emerged

as a promising technique in e-commerce, marketing

research, and clinical trials. In contrast to the standard

A/B testing, the goal of multivariate testing is to determine which combination of variations performs the

best out of all of the possible combinations. We consider the problem of robustly allocating treatments to

subjects in a multivariate test when the effects of the

treatments are confounded with a large number of

covariates and subjects are connected through a network. Under this background, we advocate for the

first time to use a mixed effect model to aggregate the

covariate uncertainty and the network structure. Based

on this model, a criterion that measures the regret of

incorrectly specifying the covariance structure is proposed. Under this criterion, the minimax robust experimental scheme for estimating the treatment effects

is derived. Furthermore, a novel experimental scheme

that optimally matches the design with the robust

covariance structure is systematically studied. All

proposed experimental schemes have the following

three strengths: (a) they are robust to the optimality

criteria of estimating treatment effects, (b) they can

efficiently against the misspecification of the covariance structure of the mixed effect model, and (c) they

can be applied to various complex covariates distributions and network structures. In addition, a large

number of numerical simulations indicate that our

experimental schemes offer higher statistical efficiencies than the existing schemes and achieve the current

state of the art under mixed effect models. Finally, the

practicability of our experimental schemes is illustrated by analyzing the real dataset of coupons issued

to truck drivers.

Joint work with Yongdao Zhou.

ARTree: A Deep Autoregressive Model for Phylogenetic Inference

Tianyu Xie

Peking University

Abstract: Designing flexible probabilistic models

over tree topologies is important for developing efficient phylogenetic inference methods. To do that,

previous works often leverage the similarity of tree

topologies via hand-engineered heuristic features

which would require pre-sampled tree topologies and

may suffer from limited approximation capability. In

this paper, we propose a deep autoregressive model

for phylogenetic inference based on graph neural

networks (GNNs), called ARTree. By decomposing a

tree topology into a sequence of leaf node addition

operations and modeling the involved conditional

distributions based on learnable topological features

via GNNs, ARTree can provide a rich family of distributions over the entire tree topology space that have

simple sampling algorithms and density estimation

procedures, without using heuristic features. We

demonstrate the effectiveness and efficiency of our

method on a benchmark of challenging real data tree

topology density estimation and variational Bayesian

phylogenetic inference problems.

第177页

168

Joint work with Cheng Zhang.

Profiled Transfer Learning for High Dimensional

Linear Model

Ziqian Lin

Peking University

Abstract: We develop here a novel transfer learning

methodology called Profiled Transfer Learning (PTL).

The method is based on the approximate-linear assumption between the source and target parameters.

Compared with the commonly assumed vanishing-difference assumption and low-rank assumption in

the literature, the approximate-linear assumption is

more flexible and less stringent. Specifically, the PTL

estimator is constructed by two major steps. Firstly,

we regress the response on the transferred feature,

leading to the profiled responses. Subsequently, we

learn the regression relationship between profiled

responses and the covariates on the target data. The

final estimator is then assembled based on the approximate-linear relationship. To theoretically support

the PTL estimator, we derive the non-asymptotic upper bound and minimax lower bound. We find that the

PTL estimator is minimax optimal under appropriate

regularity conditions. Extensive simulation studies are

presented to demonstrate the finite sample performance of the new method. A real data example about

sentence prediction is also presented with very encouraging results.

Joint work with Junlong Zhao, Fang Wang, Hansheng Wang.

July 14, 14:00-15:40

Invited Session IS032: Model-Agnostic Statistical

Inference

Multiple Testing of Linear Forms for Noisy Matrix

Completion

Lilun Du

City University of Hong Kong

Abstract:Many important tasks of large-scale recommender systems can be naturally cast as testing

multiple linear forms for noisy matrix completion.

These problems, however, present unique challenges

because of the subtle bias-and-variance tradeoff of and

an intricate dependence among the estimated entries

induced by the low-rank structure. In this paper, we

develop a general approach to overcome these difficulties by introducing new statistics for individual

tests with sharp asymptotics both marginally and

jointly, and utilizing them to control the false discovery rate (FDR) via a data splitting and symmetric

aggregation scheme. We show that valid FDR control

can be achieved with guaranteed power under nearly

optimal sample size requirements using the proposed

methodology. Extensive numerical simulations and

real data examples are also presented to further illustrate its practical merits.

CAP: A General Algorithm for Online Selective

Conformal Prediction with FCR Control

Yajie Bao

Shanghai Jiao Tong University

Abstract:We study the problem of post-selection

predictive inference in an online fashion. To avoid

devoting resources to unimportant units, a preliminary

selection of the current individual before reporting its

prediction interval is common and meaningful in

online predictive tasks. Since the online selection

causes a temporal multiplicity in the selected prediction intervals, it is important to control the real-time

false coverage statement rate (FCR) which measures

the overall miscoverage level. We develop a general

framework named CAP (Calibration after Adaptive

Pick) that performs an adaptive pick rule on historical

data to construct a calibration set if the current individual is selected and then outputs a conformal prediction interval for the unobserved label. We provide

tractable procedures for constructing the calibration

set for popular online selection rules. We proved that

CAP can achieve an exact selection-conditional coverage guarantee in the finite-sample and distribution-free regimes. To account for the distribution shift

in online data, we also embed CAP into some recent

dynamic conformal prediction algorithms and show

that the proposed method can deliver long-run FCR

control. Numerical results on both synthetic and real

data corroborate that CAP can effectively control FCR

around the target level and yield more narrowed pre-

第178页

169

diction intervals over existing baselines across various

settings.

Joint work with Yuyang Huo, Haojie Ren, Changliang Zou.

Real-Time Selection under General Constraints via

Predictive Inference

Haojie Ren

Shanghai Jiao Tong University

Abstract : Real-time decision-making gets more

attention in the big data era. Here, we consider the

problem of sample selection in the online setting,

where one encounters a possibly infinite sequence of

individuals collected over time with covariate information available. The goal is to select samples of

interest that are characterized by their unobserved

responses until the user-specified stopping time. We

derive a new decision rule that enables us to find more

preferable samples that meet practical requirements by

simultaneously controlling two types of general constraints: individual and interactive constraints, which

include the widely utilized False Selection Rate (FSR),

cost limitations, diversity of selected samples, etc. The

key elements of our approach involve quantifying the

uncertainty of response predictions via predictive

inference and addressing individual and interactive

constraints in a sequential manner. Theoretical and

numerical results demonstrate the effectiveness of the

proposed method in controlling both individual and

interactive constraints.

Joint work with Yuyang Huo, Lin Lu, Changliang

Zou.

Algorithm-Agnostic Inference after Change Point

Detection

Guanghui Wang

East China Normal University

Abstract:To evaluate the validity of change points

detected by certain algorithms, an apparent approach

is to conduct a two-sample test on data segments surrounding the identified change point. However, this

method often yields invalid p-values due to the \"double dipping\" effect, where the same data is reused for

both change detection and subsequent testing, thereby

compromising statistical validity. While recent methodologies have incorporated selective inference tools

specifically designed for sequences of univariate normal means, their applicability remains limited. This

paper introduces a novel framework for post change

detection inference, offering broader applicability

across various change point models and detection

algorithms.

Joint work with Yinxu Jia, Jixuan Liu, Changliang

Zou.

Invited Session IS089: Statistical Research on the

Digital Economy and Its Impact

数字经济及其影响的统计研究

Compilation Methods for Input-Output Tables in

the Digital Economy

数字经济投入产出表编制方法

Yafei Wang

Beijing Normal University

摘要:数字经济投入产出表具有统计协调和分析工

具的双重作用,能够准确监测数字经济的发展状态

。本文从数字经济的整体架构出发,围绕理论基础

和基本编制流程与方法,提出了数字经济投入产出

表的一般性编制框架,结合《数字经济及其核心产

业统计分类(2021)》与中国投入产出表的部门设定

规则,设计中国数字经济投入产出表的部门类别和

表式结构,并整合多种数据来源,利用 2020 年中

国投入产出表编制了 2020 年中国数字经济投入产

出表。通过数字经济核心产业增加值、数字部门最

终使用构成和传统经济部门数字化投入结构三个

应用实例,可以证明本文的编制成果能够为中观和

宏观数字经济研究提供有效的数据信息,是对进一

步完善中国数字经济统计核算的有益尝试。

Joint work with Mingna Li.

Internet Usage and Economic Resilience of

Low-Income Households

互联网使用与低收入群体家庭经济韧性

Gang Peng

Southwestern University of Finance and Economics

摘要:低收入群体依然面临着一定的返贫风险,聚

焦低收入家庭的抵御风险能力,探究互联网使用对

低收入群体家庭经济韧性的影响,对实现数字经济

实质性推进共同富裕具有重要的现实意义。本文利

第179页

170

用全国住户调查以及 CFPS 数据的五等份收入数据,

测算了各地区低收入群体标准,然后使用

2012—2020 年间的中国家庭追踪调查(CFPS)数

据,对低收入群体家庭经济韧性进行了测度,并实

证研究了互联网使用对其影响及相应作用机制。研

究发现,互联网使用对低收入群体家庭经济韧性具

有显著的正向影响,具体可通过人力资本、社会资

本、促进就业、拓展信息渠道来提高低收入群体的

家庭经济韧性。同时,区域间、城乡间和群体间存

在着三大“数字鸿沟”,客观影响了互联网使用对家

庭经济韧性的积极效应,推进数字基础设施均等化

,实现新型城镇化实质性进展,推动老年人口数字

素养提升是弥合数字鸿沟的重要方式。本文提供了

利用数字经济促进共同富裕的新经验证据,对于如

何增强低收入群体家庭经济韧性,防化低收入群体

规模性返贫提供了一定参考。

Joint work with Ying Dai, Xiaoye Liao, Delin Yang.

International Progress and Implications of Research on Statistical Measurement of the Digital

Economy

数字经济统计测度的国际进展及对中国的启示

Meihui Zhang

Shandong University of Finance and Economics

摘要:数字经济对经济社会统计带来严峻挑战,近

年来,国际组织、官方统计部门和研究学者已经对

数字经济统计测度展开积极研究,取得了较丰富的

研究成果。本报告系统梳理国际上关于数字经济规

模测算、电子商务统计测度、数字中介服务统计测

度、数据资产统计测度等的研究进展,并在此基础

上尝试提出几点数字经济统计测度研究的启示。尝

试为完善数字经济统计测度体系,促进数字经济高

质量发展提供参考。

Analysis of Industrial Correlation Effect of Digital

Economy in Northeast China

东北地区数字经济产业关联效应分析

Guorong Li

Jilin University of Finance and Economics

摘要:本文在 2012、2015、2017 年的多区域投入

产出表基础上,编制多区域数字经济投入产出表,

研究东北地区数字经济产业关联特征和关联效应。

研究发现:数字产品制造业和数字要素驱动业对东

北三省经济发展的带动作用最强;数字产品服务业

对东北地区下游产业总产出的带动作用最强,而数

字要素驱动业则更能带动上游产业发展;各省份最

终需求对数字要素驱动业的诱发效应最为显著,数

字产业更多属于投资依赖型产业和出口依赖型产

业。

Joint work with Fang Chen.

Invited Session IS046: Recent Advances in Causal

Inference

Principled Random Forests: Uncertainty Quantification for Tree Structure Models with Robustness

to Stratification Errors

Zhenyu Wang

Rutgers University

Abstract : Tree models are one of the machine

learning models used to estimate the conditional mean

E[Y|X], known for their interpretability and straightforward application. However, a significant limitation

of these models is the lack of valid inferential tools,

especially in finite sample scenarios. This work introduces Principled Random Forests (PRF), a novel

methodology that uses synthetic random errors to help

address inference problems. Our theoretical development shows that the PRF method has inference performance guarantees under mild conditions. The PRF

method is further generalized to and particularly adept

at analyzing conditional average treatment effects

within the realm of causal inference, accommodating

heterogeneity across different subgroups. A novel

filtering technique is also proposed and the enhancement significantly improves the inference efficiency

of PRF by reducing the length of confidence intervals.

Both numerical simulations and real-world applications demonstrate the effectiveness of our method,

showcasing its potential to advance the utility of tree

models in complex analytical tasks.

Joint work with Minge Xie,Zijian Guo.

Engression: Extrapolation for Nonlinear Regression?

Xinwei Shen

Swiss Federal Institute of Technology Zurich

Abstract:Extrapolation is crucial in many statistical

and machine learning applications, as it is common to

第180页

171

encounter test data outside the training support. However, extrapolation is a considerable challenge for

nonlinear models. Conventional models typically

struggle in this regard: while tree ensembles provide a

constant prediction beyond the support, neural network predictions tend to become uncontrollable. This

work aims at providing a nonlinear regression methodology whose reliability does not break down immediately at the boundary of the training support. Our

primary contribution is a new method called \"engression\" which, at its core, is a distributional regression

technique for pre-additive noise models, where the

noise is added to the covariates before applying a

nonlinear transformation. Our experimental results

indicate that this model is typically suitable for many

real data sets. We show that engression can successfully perform extrapolation under some assumptions

such as a strictly monotone function class, whereas

traditional regression approaches such as least-squares

regression and quantile regression fall short under the

same assumptions. We establish the advantages of

engression over existing approaches in terms of extrapolation, showing that engression consistently provides a meaningful improvement. Our empirical results, from both simulated and real data, validate these

findings, highlighting the effectiveness of the engression method.

Joint work with Nicolai Meinshausen.

On the Possibility of Doubly-Robust Root-n Inference

Matteo Bonvini

Rutgers University

Abstract:We study the problem of constructing an

estimator of the average treatment effect (ATE) that

exhibits doubly-robust asymptotic linearity (DR-AL).

This is a stronger requirement than doubly-robust

consistency. In fact, a DR-AL estimator can yield

asymptotically valid Wald-type confidence intervals

even in the case when the propensity score or the outcome model is inconsistently estimated. On the contrary, the celebrated doubly-robust, augmented-IPW

estimator requires consistent estimation of both nuisance functions for root-n inference. Previous authors

have considered this problem (van der Laan, 2014,

Benkeser et al, 2017, Dukes et al 2021) and provided

sufficient conditions under which the proposed estimators are DR-AL. Such conditions are typically

stated in terms of \"high-level nuisance error rates\"

needed for root-n inference. In this paper, we build

upon their work and establish sufficient and more

explicit smoothness conditions under which a DR-AL

estimator can be constructed. We also consider the

case of slower-than-root-n convergence rates and

study minimax optimality within the structure-agnostic framework proposed by Balakrishnan et

al (2023). Finally, we clarify the connection between

DR-AL estimators and those based on higher-order

influence functions (Robins et al, 2017) and complement our theoretical findings with simulations.

Joint work with Edward Kennedy, Oliver Dukes,

Sivaraman Balakrishnan.

Long-Term Causal Inference under Persistent

Confounding via Data Combination

Yuhao Wang

Tsinghua University

Abstract:We study the identification and estimation

of long-term treatment effects by combining

short-term experimental data and long-term observational data subject to unobserved confounding. This

problem arises often when concerned with long-term

treatment effects since experiments are often

short-term due to operational necessity while observational data can be more easily collected over longer

time frames but may be subject to confounding. In this

paper, we uniquely tackle the challenge of persistent

confounding: unobserved confounders that can simultaneously affect the treatment, short-term outcomes,

and long-term outcome. In particular, persistent confounding invalidates identification strategies in previous approaches to this problem. To address this challenge, we exploit the sequential structure of multiple

short-term outcomes and develop three novel identification strategies for the average long-term treatment

effect. Based on these, we develop estimation and

inference methods with asymptotic guarantees. To

demonstrate the importance of handling persistent

第181页

172

confounders, we apply our methods to estimate the

effect of a job training program on long-term employment using semi-synthetic data.

Joint work with Guido Imbens, Nathan Kallus,

Xiaojie Mao.

Invited Session IS093: Mathematical Foundations

in AI

Two Phases of Scaling Laws for Nearest Neighbor

Classifiers

Pengkun Yang

Tsinghua University

Abstract:A scaling law refers to the observation that

the test performance of a model improves as the

number of training data increases. A fast scaling law

implies that one can solve machine learning problems

by simply boosting the data and the model sizes. Yet,

in many cases, the benefit of adding more data can be

negligible. In this work, we study the rate of scaling

laws of nearest neighbor classifiers. We show that a

scaling law can have two phases: in the first phase, the

generalization error depends polynomially on the data

dimension and decreases fast; whereas in the second

phase, the error depends exponentially on the data

dimension and decreases slowly. Our analysis highlights the complexity of the data distribution in determining the generalization error. When the data distributes benignly, our result suggests that nearest

neighbor classifier can achieve a generalization error

that depends polynomially, instead of exponentially,

on the data dimension.

Joint work with Zhang Jingzhao.

Accelerated Gradient Algorithms with Adaptive

Subspace Search for Instance-Faster Optimization

实例更快的加速梯度下降算法

Cong Fang

Peking University

摘要:机器学习问题需要快速的算法求解。关于一

阶算法的一项开创性的工作来源于 Yuwrii Nesterov

于 1985 年前后设计的基于动量的加速梯度下降算

法。加速梯度下降算法对于最小化一般性满足梯度

Lipschitz 连续的凸函数能取得最优性。这里最优性

的度量来自于满足条件的最难目标函数。然而在实

际中,机器学习问题对应的优化目标可能并没有那

么困难。一个常见观测是目标函数的 Hessian 矩阵

与数据相关,其奇异值往往递减迅速。在这次报告

,我们研究设计对每个实例问题自适应更优的算法

。对于光滑的凸优化问题,我们将提出一个新算法

,通过对空间分块优化,实现实例更优的收敛速度

。当目标函数奇异值的和在常数量级时候,算法将

严格快于加速梯度下降法。

Constrained Policy Optimization with Explicit

Behavior Density for Offline Reinforcement

Learning

Wenjia Wang

The Hong Kong University of Science and Technology (Guangzhou)

Abstract:Due to the inability to interact with the

environment, offline reinforcement learning (RL)

methods face the challenge of estimating the

Out-of-Distribution (OOD) points. Existing methods

for addressing this issue either control policy to exclude the OOD action or make the ?-function pessimistic. However, these methods can be overly conservative or fail to identify OOD areas accurately. To

overcome this problem, we propose a Constrained

Policy optimization with Explicit Behavior density

(CPED) method that utilizes a flow-GAN model to

explicitly estimate the density of behavior policy. By

estimating the explicit density, CPED can accurately

identify the safe region and enable optimization within the region, resulting in less conservative learning

policies. We further provide theoretical results for

both the flow-GAN estimator and performance guarantee for CPED by showing that CPED can find the

optimal ?-function value. Empirically, CPED outperforms existing alternatives on various standard

offline reinforcement learning tasks, yielding higher

expected returns.

Joint work with Jing Zhang, Chi Zhang, Bing-Yi

Jing.

Understanding the In-context Learning Capabilities of Large Language Models

Yong Liu

Renmin University of China

第182页

173

Abstract:Large Language Models (LLM) exhibit

astonishing in-context learning (ICL) capabilities.

Given a few examples, the model demonstrates excellent learning performance on new tasks without updating its parameters. However, the inherent learning

mechanism of ICL remains unclear. Interpreting the

reasoning process of ICL as an implicit gradient update process under a contrastive learning paradigm,

we provide a novel explanation for ICL. Additionally,

from the perspective of contrastive learning, several

ideas are proposed to improve the original ICL method. This result will provide assistance in gaining a

deeper understanding of ICL mechanisms, and further

designing new ICL algorithms based on this understanding.

Invited Session IS059: Recent Developments in

Causal Learning

因果学习最新进展

Principal Stratification with Continuous

Post-Treatment Variables: Nonparametric Identification and Semiparametric Estimation

Zhichao Jiang

Sun Yat-sen University

Abstract:Post-treatment variables often complicate

causal inference. They appear in many scientific

problems, including noncompliance, truncation by

death, mediation, and surrogate endpoint evaluation.

Principal stratification is a strategy to address these

challenges by adjusting for the potential values of the

post-treatment variables, defined as the principal strata. It allows for characterizing treatment effect heterogeneity across principal strata and unveiling the

mechanism of the treatment's impact on the outcome

related to post-treatment variables. However, the existing literature has primarily focused on binary

post-treatment variables, leaving the case with continuous post-treatment variables largely unexplored.

This gap persists due to the complexity of infinitely

many principal strata, which present challenges to

both the identification and estimation of causal effects.

We fill this gap by providing nonparametric identification and semiparametric estimation theory for principal stratification with continuous post-treatment

variables. We propose to use working models to approximate the underlying causal effect surfaces and

derive the efficient influence functions of the corresponding model parameters. Based on the theory, we

construct doubly robust estimators and implement

them in an R package.

Joint work with Sizhu Lu, Peng Ding.

Extreme-Based Causal Effect Learning (EXCEL)

with Unmeasured Light-Tailed Confounding

Wang Miao

Peking University

Abstract: Unmeasured confounding poses a significant challenge in identifying and estimating causal

effects across various research domains. Existing

methods to address confounding often rely on either

parametric models or auxiliary variables, which

strongly rest on domain knowledge and could be fairly

restrictive in practice. In this paper, we propose a

novel strategy for identifying causal effects in the

presence of confounding under an additive structural

equation with light-tailed confounding. This strategy

uncovers the causal effect by exploring the relationship between the exposure and outcome at the extreme,

which can bypass the need for parametric assumptions

and auxiliary variables. The resulting identification is

versatile, accommodating a multi-dimensional exposure, and applicable in scenarios involving unmeasured confounders, selection bias, or measurement

errors. Building on this identification approach, we

develop an extreme-based causal effect learning

(EXCEL) method and further establish its consistency

and non-asymptotic error bound. The asymptotic

normality of the proposed estimator is established

under the linear model. The EXCEL method is applied

to causal inference problems with invalid instruments

to construct a valid confidence set for the causal effect.

Simulations and a real data analysis are used to illustrate the effectiveness of our method in addressing

confounding, showcasing its potential for broad application in causal inference.

Covariate Adjustment in Randomized Experiments

with Missing Outcomes and Covariates

第183页

174

Anqi Zhao

Duke University

Abstract:Covariate adjustment can improve precision

in analyzing randomized experiments. With fully observed data, regression adjustment and propensity

score weighting are asymptotically equivalent in improving efficiency over unadjusted analysis. When

some outcomes are missing, we consider combining

these two adjustment methods with inverse probability

of observation weighting for handling missing outcomes, and show that the equivalence between the

two methods breaks down. Regression adjustment no

longer ensures efficiency gain over unadjusted analysis unless the true outcome model is linear in covariates or the outcomes are missing completely at random. Propensity score weighting, in contrast, still

guarantees efficiency over unadjusted analysis, and

including more covariates in adjustment never harms

asymptotic efficiency. Moreover, we establish the

value of using partially observed covariates to secure

additional efficiency by the missingness indicator

method, which imputes all missing covariates by zero

and uses the union of the completed covariates and

corresponding missingness indicators as the new, fully

observed covariates. Based on these findings, we

recommend using regression adjustment in combination with the missingness indicator method if the linear outcome model or missing complete at random

assumption is plausible and using propensity score

weighting with the missingness indicator method otherwise.

Joint work with Peng Ding, Fan Li.

Flexible Sensitivity Analysis for Causal Inference

in Observational Studies Subject to Unmeasured

Confounding

Peng Ding

University of California, Berkeley

Abstract:Causal inference with observational studies

often suffers from unmeasured confounding, yielding

biased estimators based on the unconfoundedness

assumption. Sensitivity analysis assesses how the

causal conclusions change with respect to different

degrees of unmeasured confounding. Most existing

sensitivity analysis methods work well for specific

types of statistical estimation or testing strategies. We

propose a flexible sensitivity analysis framework that

can deal with commonly-used inverse probability

weighting, outcome regression, and doubly robust

estimators simultaneously. It is based on the

well-known parametrization of the selection bias as

comparisons of the observed and counterfactual outcomes conditional on observed covariates. It is attractive for practical use because it only requires simple

modifications of the standard estimators. Moreover, it

naturally extends to many other causal inference settings, including the causal risk ratio or odds ratio, the

average causal effect on the treated units, and studies

with survival outcomes. We also develop an R package saci to implement our sensitivity analysis estimators.

Joint work with Sizhu Lu.

Invited Session IS030: Large-Scale Inference and

Private Statistical Analysis

A Distribution-Free Empirical Bayes Approach to

Multiple Testing with Side Information

Wenguang Sun

Zhejiang University

Abstract:This article presents the Conformalized

Locally Adaptive Weighting (CLAW) approach for

multiple testing with side information. We propose

innovative data-driven strategies for constructing

pairwise exchangeable scores, which are incorporated

into a generic algorithm that utilizes a mirror process

to control the false discovery rate (FDR). CLAW successfully combines empirical Bayes concepts with

frequentist methodologies, providing a principled and

flexible tool for integrating structural information

from both the test data and auxiliary covariates. It

offers both theoretical rigor and practical adaptability,

ensuring valid and distribution-free inference with

high statistical power. Extensive numerical studies

using both simulated and real data demonstrate that

CLAW effectively controls the FDR and enjoys superior power compared to existing methods.

Joint work with Zinan Zhao.

第184页

175

Evidence Transportation with Aggregate Summary

Information

Ying Sheng

Chinese Academy of Sciences

Abstract:Academy of Mathematics and Systems

Science, Chinese Academy of Sciences With the increasing availability of data in the public domain,

there has been a growing interest in exploiting information from the source population to facilitate the

decision-making processes in the target population.

However, in real-world applications, particularly those

dealing with sensitive areas such as healthcare and

finance, individual-level data are often unavailable,

leaving only aggregate data from the target population.

This paper introduces two methods for transporting

evidence from the source population to the target

population using only covariate summary statistics to

account for distributional shifts and uncertainty in the

aggregate data.

Joint work with Chiung-Yu Huang and Yifei Sun.

Learning Invariant Representations for Algorithmic Fairness and Domain Generalization with

Minimax Optimality

Sai Li

Renmin University of China

Abstract:Machine learning methods often assume

that the test data have the same distribution as the

training data. However, this assumption may not hold

due to multiple levels of heterogeneity in applications,

raising issues in algorithmic fairness and domain generalization. In this work, we address the problem of

fair and generalizable machine learning by invariant

principles. We propose a training environment-based

oracle, FAIRM, which has desirable fairness and domain generalization properties under a diversity-type

condition. We then provide an empirical FAIRM with

finite-sample theoretical guarantees under weak distributional assumptions. We then develop efficient

algorithms to realize FAIRM in linear models and

demonstrate the nonasymptotic performance with

minimax optimality.

Joint work with Linjun Zhang.

Model-free Variable Importance Testing with Machine Learning Methods

Xu Guo

Beijing Normal University

Abstract:In this paper, we investigate variable importance testing problem in a model-free framework.

Some remarkable procedures are developed recently.

Despite their success, existing procedures suffer from

a significant limitation, that is, they generally require

larger training sample and do not have the fastest

possible convergence rate under alternative hypothesis.

In this paper, we propose a new procedure to test variable importance. Flexible machine learning methods

are adopted to estimate unknown functions. Under

null hypothesis, our proposed test statistic converges

to standard chi-squared distribution. While under local

alternative hypotheses, it converges to non-central

chi-square distribution. It has non-trivial power

against the local alternative hypothesis which converges to the null at the fastest possible rate. We also

extend our procedure to test conditional independence.

Asymptotic properties are also developed. Numerical

studies and two real data examples are conducted to

illustrate the performance of our proposed test statistic.

Joint work with Xinyu Zhang , Niwen Zhou, Xuejun

Jiang.

Invited Session IS068: Statistical Applications in

Behavioral Decision and Behavioral Experiments

统计学在行为决策及行为实验中的应用

The Symbiosis of Prosocial Behavior and Network

Groups

亲社会行为与网络群体的共生性

Danyang Jia

Tsinghua University

摘要:在现实的社交活动中,人类表现出复杂多变

的交互能力。不可否认,个体对其社交网络环境的

认识以及利用社交关系为自己谋利的能力是公认

的人类特征。因此,经济博弈与网络科学相结合常

被用来研究社会行为。然而,许多理论模型和实验

研究人为地限制了个体对不同的社交邻居采取不

同行动的能力(即限制了他们的社交网络)。 针

对这一问题,本研究设计了网络群体的决策自由度

第185页

176

可调节的真人行为实验,并将其应用于囚徒困境、

信任博弈和最后通牒博弈,探索人类在社交网络中

的合作、信任和公平行为。 研究结果表明,在所

有三种类型的博弈中,与限制了自由决策的情况相

比,赋予个体在交互过程中自由决策的能力会激发

更多的亲社会行为,产生更高的财富和更低的不平

等。此外,本研究还分析了群体的行为特征,从数

理模型的视角对观察到的实验结果提供更深层次

的理解。这项研究发现,人类行为实际上比当前科

学表明的更亲社会,进而为后续开展更多样化和现

实的行为科学研究提供新的思路。

Inequality Shapes Strategies in the Infinite Repeated Prisoner's Dilemma Games

Xiaogang Li

Yunnan University of Finance and Economics

Abstract:The 'shadow of the future' refers to the

impact of future expectations on people's current

choices. In infinitely repeated prisoner's dilemma

games, a variety of behavioral rules come into play.

However, the factors that mobilize these rules are not

well understood, especially across different frames or

mechanisms. This paper proposes a solution derived

from prospect theory to explore these problems. Experiments reveal several key insights. First, equality

behavior emerges as the absolutely dominant rule,

with subjects expressing the most satisfaction with

equality, followed by advantageous inequality, and the

least satisfaction with disadvantageous inequality.

Second, punishment lowers the neutral reference point,

reinforcing persistence behavior and sometimes triggering a status quo bias among subjects. Third, reward

raises the neutral reference point, leading to an affinity for rewarding equality, as this approach satisfies

both equality and benefit-seeking simultaneously.

Additionally, a new Bush-Mosteller reinforcement

learning model incorporating multiple reference

points yields results consistent with the lab findings.

This model and the experimental results contribute to

a deeper understanding of behavioral rules in game

theory and have broader implications for economic

behavior and decision-making.

Joint work with Lei Shi.

Research on Traceability of Social Network Communication

社交网络传播溯源研究

Chao Gao

Northwestern Polytechnical University

摘要:在社交网络中,恶意或有害信息的快速传播

引起了广泛关注。因此,尽早确定信息的起源并阻

止其进一步传播成为了一项重要的任务,也是传播

链取证的重要工作。首先,我们提出了基于观察点

部署的贪心全阶邻居定位方法,旨在迅速准确定位

传播源。进一步,为有效应对大规模网络环境下的

定位问题。我们进一步提出了轻量级的贪心覆盖快

速源定位方法,该方法在大规模网络中以更高效、

灵活的方式确定传播源。上述观察点部署法对于可

预知或需提前保护的区域,能够利用这些先验信息

能快速准确定位传播源,但我们仍然追求在各种环

境下随时捕获快照并进行定位的能力。因此,考虑

到获取传播快照的低开销性,我们还提出了基于时

间序列的图注意源识别框架。该框架通过分析随机

获取的传播快照,研究用户交互,实现了复杂场景

下及传播场景迁移的精确定位,从而扩大了源定位

策略的应用范围。

A New Framework for the Study of Multi-label

Classification

Wenchen Liu

Shanghai Lixin University of Accounting & Finance

Abstract : Nowadays, multi-label classification

methods are increasingly required by modern applications, such as gene function classification, music categorization and birds classification. In this paper, a

new framework for multi label classification is built.

Starting from marginal loss functions not necessarily

derived from probability distributions, we utilize an

additive over-parametrization with shrinkage to incorporate label dependencies into the criterion. The

non-convex robust loss functions are used to reduce

the influence of mislabeling labels. A joint regularization method by sparsity and rank reduction method to

deal with high dimension data. Masking method is

used to handle missing labels. Simulation and real

data analysis shows the power of this new method.

The new method not only builds a multi-label classification model for yeast genes data, but also captures

第186页

177

the dependencies of the yeast genes data between the

phylogenetic mapping.

Joint work with Yiyuan She, Yincai Tang.

Invited Session IS074: Statistical Learning for

Complex and Challenging Data

Unique Solution and Significance Test of Solution

in Factor Analysis

Haiming Lin

Guangzhou Huashang College

Abstract:This paper briefly introduces the research

of Rao, Anderson, and other professors on factor

models and factor similarity models, and puts forward

the problems of improvement, uniqueness, and significance test of solutions of factor models, using factor

variance contribution descending order method,

standardized principal component method, special

factor solution method, we obtain the unique solution

of the factor model; With the purpose of factor analysis, factor model and its solution, abandoning 3 hidden

deviations of the factor model, and introduce reasonable assumptions, we developed a new factorial minimum error model and its unique solution, which is

rotatable, and the best model to get the optimized

solution of factor analysis under the constraints of its

purpose, the results are the same with the most recently used principal component method and regression method of estimation; A new test criterion was

developed with them, which can be used to test the

significance of factor solutions. Addresses the issues

raised in this article.

Growth Curve Mixture Model with Toeplitz

Structured Covariance

Yating Pan

Yunnan University of Finance and Economics

Abstract:Though playing an important role in longitudinal data analysis, the uses of growth curve models are constrained by the crucial assumption that the

grouping design matrix is known. We propose a

Gaussian mixture model within the framework of

growth curve models which handles the problem

caused by the unknown grouping matrix. This allows

for a greater degree of flexibility in specifying the

model and freeing the response matrix from following

a single multivariate normal distribution. The new

model is considered under Toeplitz structured covariance which cover most of the within-individual correlation types. The maximum likelihood estimation of

the proposed model is studied using the ECM and

ADMM algorithm, which cluster growth curve data

simultaneously. Data-driving methods are proposed to

find various model parameters so as to create an appropriate model for complex growth curve data. Simulation studies are conducted to assess the performance of the proposed methods and real data analysis

on gene expression clustering is made, showing that

the proposed procedure works well in both, model

fitting and growth curve data clustering.

Joint work with Yuerong Li, Jianxin Pan.

Modeling Heat Transfer Properties in an ORC

Direct Contact Evaporator Using RBF Neural

Network Combined with EMD

Qingtai Xiao

Kunming University of Science and Technology

Abstract:Without an intervening wall, the direct

contact evaporator (DCE) has been already technically

proven to improve the overall thermal efficiency of

organic Rankine cycle (ORC) used to recover

low-grade heat sources and transform them into power.

In the estimation of volumetric heat transfer coefficient (VHTC) which is assumed to vary with flow rate,

noises signals caused by various unstable factors (e.g.,

measurement errors) often corrupt the time series of

VHTC. For forecasting the heat transfer performance

of DCE in ORC more accurately, this paper proposes

a novel approach (refers as EMD-RBF-NN), which

combines multi-input radial basis function (RBF)

neural network (NN) and empirical mode decomposition (EMD) method. Specifically, the original VHTC

time series is firstly decomposed by EMD method that

is fully data-driven. Then, the proposed method models the resultant decomposition series with flow rates

of two fluids (dispersed and continuous phases) and

VHTC by using RBF neural network. This simple

technique was illustrated by using the ORC direct

contact evaporator (ORC-DCE) and data processing

第187页

178

system. Via using the experimental datasets of

ORC-DCE, this paper demonstrates that the proposed

EMD-RBF-NN model that associates flow rates of

two phases with VHTC improves the forecasting accuracy of VHTC noticeably comparing with existing

models.

Joint work with Kai Yang, Xiaoxue Zhang, Yinzhen

Tan, Hua Wang.

Joint Mean-Angle Model for Spatial Binary Data

Renwen Luo

BNU-HKBU United International College

Abstract:The analysis of spatially correlated binary

data has received substantial attention in

geo-statistical research but is very challenging due to

the intricacy of the distributional form. Two principal

objectives include examining the dependence of binary response on covariates of interest and quantifying

the covariances or correlations between pairs of outcomes. While the literature has sufficiently addressed

the modelling issue of the mean structure of a binary

response, the characterization of the covariances between pairs of binary responses in terms of covariates

is not clear. In this paper, we propose methods to explain such characterizations by using a latent Gaussian

copula model with alternative hypersphere decomposition of the covariance matrix. Correctly specifying

the covariance matrix is crucial not only for the high

efficiency of mean parameters but also for scientific

interest. The key is to model the marginal mean and

pairwise covariance, simultaneously, for spatial binary

data. Two generalized estimating equations are proposed to estimate the parameters, and the asymptotic

properties of the resulting estimators are investigated.

To evaluate the performance of the methods, we conduct simulation studies and provide real data analysis

for illustration.

Joint work with Cheng Peng, Yang Han, Jianxin Pan.

Invited Session IS081: Theory, Method and Application for Major Problems in Statistical Modernization of China

Path Selection for Achieving \"One Network\" Economic Statistical Accounting From the Source of

Methodology, System, and Technology Application

从方法制度与技术应用源头实现经济统计核算“一

网统”的路径选择

Xinhong Yang

Statistics Bureau of Guangdong Province

摘要:面对当前“数字化”浪潮对统计带来的新挑战

,作为官方统计如何主动积极切入数字政府改革,

以广东为例,利用现代信息技术,探索把分散在各

级统计、发改、经信等多个部门的经济运行信息与

行政记录进行数字化整合共享,着眼智慧统计建设

,积极推进“一网统”改革创新,立足于“管”、关键

在“统”、着眼于“用”,将主要经济指标立体化、全

局化、动态化可溯源符合统计标准采集,从产业、

行业、区域、时间等多维度对经济运行进行颗粒度

解构与动态监测,为领导决策提供全面、准确、及

时的数据决策支撑和预警,为优化产业结构和产业

政策制定提供量化依据。

Research on Modernization and Innovation in

Cultural and Tourism Statistics

文化和旅游业统计现代化发展问题与创新研究

Xiang Li

Beijing Union University

摘要:文化和旅游部的成立标志着行政层面实现了

文旅融合,而文化和旅游业统计现代化发展之路仍

任重道远。从市场角度看,文化和旅游也存在较高

的互补性和融合发展的必然性,需要创新统计制度

,对文化旅游产业总体规模和发展质效进行精确测

度。基于《全国文化文物和旅游统计调查制度》,

本文介绍了我国文化统计与旅游业统计的内容、指

标、方法和最新数据情况,指出我国文化和旅游业

统计现代化发展中存在的行政边界、概念边界与统

计操作性问题,旅游业统计中的科学性、可比性、

大数据质量、调查方法、统计队伍建设问题,以及

文旅融合统计问题,并从文化和旅游统计融合发展

创新、数据源应用创新、以及地方文化和旅游统计

工作实践创新等方面探讨了文化和旅游统计现代

化发展的创新研究问题。

Joint work with Shaohua Shi, Taiyue Wu, Yingjie

Chen.

Measurement, Identification and Response Mechanism of China's Economic Growth Risk under

High Dimensional Data

第188页

179

高维数据下中国经济增长风险的测度、识别与应对

机制

Xiaobin Tang

University of International Business and Economics

摘要:在当前经济环境复杂性与不确定性加剧的背

景下,加强经济风险监测预警、风险传染及治理机

制建设尤为重要。本文首先从不确定性理论出发,

提出了一种组合概率分布预测模型,结合高维实时

经济数据测度中国经济增长风险。进一步地,聚焦

于经济增长下行风险的演变特征和传染机制,通过

构建经济子系统下行风险的时变有向溢出网络,厘

清了下行风险的传染规律。最后,深入评估分析了

不同类型宏观调控政策对经济增长下行风险的应

对效果。研究发现:随着我国经济进入“大稳健”阶

段,经济增长风险在长期内逐渐趋于平稳化,具有

明显的确定性下行特征。工业产出是经济增长风险

的核心驱动成分,但在供给侧结构性改革后对经济

增长风险的主导能力有所减弱,而需求侧的消费和

投资对于经济增长风险的驱动作用在长期内呈逐

渐上升态势。进一步研究发现,工业产出、消费、

投资等实体经济系统的下行风险受到房地产、金融

等虚拟经济系统下行风险的传染效应十分明显,其

中房地产是下行风险的主要来源,而消费是最大的

风险接收方。相比于货币政策,财政政策对经济增

长下行风险具有更好的改善效果和更长的持续时

间,因此应继续坚持以积极的财政政策和稳健的货

币政策来强化宏观政策逆周期和跨周期调节。

Joint work with Maosheng Cui.

Literature Review and Development Prospect of

Economic Statistics Papers

经济统计论文文献评析及发展展望

Jingping Li

Renmin University of China

摘要:数字经济时代,经济统计研究面临着新的挑

战和机会。传统的经济统计数据不适应数字经济时

代现实问题研究的需要,经济统计研究者需要具备

处理大规模数据和应用复杂分析技术的能力,同时

,应用和分析大数据需要面对数据质量和可靠性的

挑战。挑战孕育着机会,数字经济时代提出了大量

新的研究主题,不同数据源的整合能够为经济统计

提供更全面的素材,此外,应用大数据和人工智能

技术可以更准确地捕捉经济活动、趋势和变化,也

为数据分析提供了更为强大的支撑。在此背景下,

经济统计研究者是否研究数字化时代的新问题,是

否使用大数据和数据科学方法与技术从事研究,关

系到经济统计学科发展前景。我们对近年来经济统

计领域研究者发表的论文进行文献分析,提取研究

主题,观察研究主题的演变,并观察研究者使用的

数据分析方法,旨在摸清经济统计学者的研究现状

与动态,洞察学科发展的前沿趋势、研究热点以及

面临的挑战,为经济统计的学术研究和学科建设提

供一点帮助。

Invited Session IS084: Statistical Modeling and

Inference of High-Dimensional Complex Data

高维复杂数据的统计建模与推断

High-dimensional Covariance Matrix Estimation

under Dynamic Volatility Models: Asymptotics and

Shrinkage Estimation

Yi Ding

University of Macau

Abstract : We study the estimation of

high-dimensional covariance matrices and their empirical spectral distributions under dynamic volatility

models. Data under such models have nonlinear dependency both cross-sectionally and temporally. We

establish the condition under which the limiting spectral distribution (LSD) of the sample covariance matrix under scalar BEKK models is different from the

i.i.d. case. We then propose a time-variation adjusted

(TV-adj) sample covariance matrix and prove that its

LSD follows the Marcenko-Pastur law. Based on the

asymptotics of the TV-adj sample covariance matrix,

we develop a consistent population spectrum estimator and an asymptotically optimal nonlinear shrinkage

estimator of the unconditional covariance matrix.

Joint work with Xinghua Zheng.

Robust Estimation of Number of Factors in High

Dimensional Factor Modeling via Spearman's

Rank Correlation Matrix

Zeng Li

Southern University of Science and Technology

Abstract:Determining the number of factors in

high-dimensional factor modeling is essential but

challenging, especially when the data are heavy-tailed.

In this paper, we introduce a new estimator based on

第189页

180

the eigenvalue asymptotic properties of Spearman’s

sample rank correlation matrix under the

high-dimensional setting, where both the dimension

and sample size tend to infinity proportionally. Our

estimator is applicable for scenarios where either the

common factors or idiosyncratic errors follow

heavy-tailed distributions. We prove that the proposed

estimator is consistent under mild conditions. Numerical experiments also demonstrate the superiority of

our estimator compared to existing methods, especially for the heavy-tailed case.

An Efficient Multivariate Volatility Model for

Many Assets

Wenyu Li

The University of Hong Kong

Abstract:This paper develops a flexible and computationally efficient multivariate volatility model,

which allows for dynamic conditional correlations and

volatility spillover effects among financial assets. The

new model has desirable properties such as identifiability and computational tractability for many assets.

A sufficient condition of the strict stationarity is derived for the new process. Two quasi-maximum likelihood estimation methods are proposed for the new

model with and without low-rank constraints on the

coefficient matrices respectively, and the asymptotic

properties for both estimators are established. Moreover, a Bayesian information criterion with selection

consistency is developed for order selection, and the

testing for volatility spillover effects is carefully discussed. The finite sample performance of the proposed methods is evaluated in simulation studies for

small and moderate dimensions. The usefulness of the

new model and its inference tools is illustrated by two

empirical examples for 5 stock markets and 17 industry portfolios, respectively.

Joint work with Yuchang Lin, Qianqian Zhu,

Guodong Li.

Online Change-Point Detection for Matrix-Valued

Time Series with Latent Two Way Factor Structure

Long Yu

Shanghai University of Finance and Economics

Abstract:This paper proposes a novel methodology

for the online detection of changepoints in the factor

structure of large matrix time series. Our approach is

based on the well-known fact that, in the presence of a

changepoint, the number of spiked eigenvalues in the

second moment matrix of the data increases (e.g., in

the presence of a change in the loadings, or if a new

factor emerges). Based on this, we propose two families of procedures - one based on the fluctuations of

partial sums, and one based on extreme value theory -

to monitor whether the first non-spiked eigenvalue

diverges after a point in time in the monitoring horizon, thereby indicating the presence of a changepoint.

Our procedure is based only on rates; at each point in

time, we randomise the estimated eigenvalue, thus

obtaining a normally distributed sequence which is

i.i.d. with mean zero under the null of no break,

whereas it diverges to positive infinity in the presence

of a changepoint. We base our monitoring procedures

on such sequence. Extensive simulation studies and

empirical analysis justify the theory. An R package

implementing the procedure is available on CRAN.

Joint work with Yong He, Xinbing Kong, Trapani

Lorenzo.

Contributed Session CS005: Recent Advances in

Bayesian Analysis

Variational Bayesian Approach for Analyzing Interval-Censored Data under the Proportional

Hazards Model

Wenting Liu

Yunnan University

Abstract:Interval-censored failure time data frequently occur in medical follow-up studies among

others and include right-censored data as a special

case. Their analysis is much difficult than the analysis

of the right-censored data due to their much more

complicated structures and no partial likelihood. This

article presents a variational Bayesian (VB) approach

for analyzing data using a proportional hazards model.

The VB approach obtains a direct approximation of

the posterior density. Compared to the Markov chain

Monte Carlo (MCMC)-based sampling approaches,

第190页

181

the VB approach achieves enhanced computational

efficiency without sacrificing estimation accuracy.

The study includes an extensive simulation to compare the performance of the proposed methods with

two main Bayesian methods currently available and

the classic proportional hazards model. The results

indicate that the proposed methods are effective in

practical situations.

Joint work with Huiqiong Li, Niansheng Tang.

Bayesian Analysis of Partial Functional Linear

Additive Regression Model for Censored Data

Based on Heavy-Tailed Distributions

Zhexin Lu

Changchun University of Technology

Abstract:In functional data analysis (FDA), the

functional linear regression model (FLRM) is a popular method to describe the relationship between a scalar response and a functional predictor. The conventional approach for estimating the FLRM relies on

assuming normality of the error terms. However, such

analyses may not yield robust inference when these

assumptions are questionable. In this paper, we develop a partial functional linear additive regression

model (PFLARM) for handling right- or left-censored,

where the normality assumptions for the random errors are replaced by scale mixtures of normal (SMN)

distributions. The proposed approach enables us to

flexibly model data, accommodating both multimodality and heavy-tailed distributions simultaneously.

We use the B-spline method and functional principal

component analysis (FPCA) are employed to estimate

the additive and slope functions. Furthermore, a highly efficient Markov Chain Monte Carlo (MCMC)

algorithm has been developed for the estimation of

latent variables and other parameters. The performance of the proposed methodology is evaluated

through simulation studies, and the applicability of the

method is demonstrated through a study of Laryngeal

carcinoma.

Joint work with Chunjie Wang.

Bayesian Regression Analysis for Elliptical Dependent Variables

Yian Yu

Southern University of Science and Technology

Abstract:This paper proposes a novel parametric

hierarchical model for functional data with shape

constraints, leveraging a Gaussian process prior to

capture the data dependency and reflect systematic

error, while the underlying curved shape is modeled

through the von Mises-Fisher distribution. The model

definition, Bayesian inference and its information

consistency are discussed. The model's effectiveness

is demonstrated, through accurate reconstruction and

prediction of curved trajectories, by simulated and

real-world examples of functional data with an underlying ellipse shape. The discussion in this paper focuses on two-dimensional problems, but the framework is extendable to higher-dimensional spaces,

making it adaptable to a wide range of applications.

Joint work with Jian Qing Shi.

Emotions Heard: Unveiling the Influence of

Broadcasters’ Emotional Dynamics and Gender on

Live Streaming Conversion

Ziyu Xiong

Peking University

Abstract: Live streaming e-commerce is rapidly

becoming a pivotal channel, revolutionizing the marketing and sale of products and services online. Yet,

the question of what factors can impact the conversion

of casual viewers into committed followers persists as

an intricate challenge, demanding more in-depth scrutiny. By integrating the theories of emotional contagion and gender stereotyping, our paper explores the

impact of broadcasters' emotional expressions including broadcasters' pleasure emotions, arousal emotions,

and emotional fluctuations on customer conversion

and the moderating role of broadcasters’ gender. To

overcome the inability to observe broadcasters' expressions and movements in certain live streams, we

develop a convolutional neural network model to

extract emotions from the broadcasters' voices. Using

a large-scale sample of 5035 automotive live streams,

our study finds that pleasure emotions and arousal

emotions can bolster conversion, while emotional

fluctuations hinder it, through the intermediary effect

第191页

182

of viewers’ emotions and engagements. Additionally,

we discover that the broadcasters' gender serves as a

significant moderator, such that while emotional fluctuations displayed by female broadcasters result in a

more negative impact, their expressions of pleasure

and arousal emotions elicit a stronger positive effect

compared to their male counterparts. These findings

enhance the theoretical understanding of how, why,

and when emotions expressed by broadcasters can

impact customer conversion in live streaming

e-commerce as well as provide valuable guidance for

platforms and broadcasters to optimize marketing

interventions.

Joint work with Yutao Dong, Jing Zhou, Xuening

Zhu, Hansheng Wang.

Bayesian analysis of Nonlinear Structured Latent

Factor Models Using a Gaussian Process Prior

Yimang Zhang

Southern University of Science and Technology

Abstract:Factor analysis models are widely utilized

in social and behavioral sciences, such as psychology,

education, and marketing, to measure unobservable

latent traits. In this article, we introduce a nonlinear

structured latent factor analysis model which is more

flexible to characterize the relationship between manifest variables and latent factors. The confirmatory

identifiability of the latent factor is discussed, ensuring the substantive interpretation of the latent factors.

A Bayesian approach with a Gaussian process prior is

proposed to estimate the unknown nonlinear function

and the unknown parameters. Asymptotic results are

established, including structural identifiability of the

latent factors, consistency of the estimates of the unknown parameters and the unknown nonlinear function. Simulation studies and a real data analysis are

conducted to investigate the performance of the proposed method. Simulation studies show our proposed

method performs well in handling nonlinear model

and successfully identifies the latent factors. Our

analysis incorporates oil flow data, allowing us to

uncover the underlying structure of latent nonlinear

patterns.

Contributed Session CS037: Recent Advances in

Quantile Regression

Model Averaging Based Semiparametric Modelling

for Conditional Quantile Prediction

Chaohui Guo

Chong Qing Normal University

Abstract:In real data analysis, the underlying model

is usually unknown, modeling strategy plays a key

role in the success of data analysis. Stimulated by the

idea of model averaging, we propose a novel semiparametric modelling strategy for conditional quantile

prediction, without assuming the underlying model is

any specific parametric or semiparametric model.

Thanks to the optimality of the selected weights by

leave-one-out cross-validation, the proposed modeling

strategy results in a more accurate prediction than that

based on some commonly used semiparametric models, such as the varying coefficient models and additive models. Asymptotic properties are established of

the proposed modeling strategy together with its estimation procedure. Intensive simulation studies are

conducted to demonstrate how well the proposed

method works, compared with its alternatives under

various circumstances. The results show the proposed

method indeed leads to more accurate predictions than

its alternatives. Finally, the proposed modelling strategy together with its prediction procedure are applied

to the Boston housing data, which result in more accurate predictions of the quantiles of the house prices

than that based on some commonly used alternative

methods, therefore, present us a more accurate picture

of the housing market in Boston.

Joint work with Wenyang Zhang.

Bootstrapping the Double-Weighted Predictability

Test for Predictive Quantile Regression

Xiaohui Liu

Jiangxi University of Finance and Economics

Abstract:In financial econometrics, it is empirically

challenging to test the predictability of lagged predictors with varying levels of persistence in predictive

quantile regression. A recent double-weighted method

developed by Cai et al. (2023) has demonstrated desirable local power properties for both non-stationary

第192页

183

and stationary predictors. In this paper, we propose a

strategy to improve the construction of the auxiliary

variables in the double-weighted method. This improvement makes it applicable to a broader range of

persistent types in empirical analysis. Furthermore, we

propose a novel random weighted bootstrap procedure

to address the challenges involved in conditional density estimation. Simulation results demonstrate the

effectiveness of the proposed test in correcting size

distortion at the lower and upper quantiles. Finally, we

apply the proposed test to respectively re-evaluate the

predictability of the return of S&P 500 and U.S. GDP

growth at various quantile levels.

Recent Developments of a General Minimum

Lower-Order Confounding Criterion

Zhiming Li

Xinjiang University

Abstract:Infractional factorial designs, the aliased

effect-number pattern (AENP) was widely used to

judge the confounding information of all effects aliased with other effects at varying severity degrees.

Based on the AENP, a general minimum lower-order

confounding (GMC) criterion was proposed to choose

the optimal designs. The classification patterns of the

existing criteria, such as maximum resolution, minimum aberration, and clear effects, were some of the

functions of the elements in the AENP. Up to now,

the GMC criterion has been extended to regular designs, block designs, split-plot designs, mixed-level

designs, and orthogonal designs.

Enhancing the Power of OOD Detection via Sample-Aware Model Selection

Falong Tan

Hunan University

Abstract:In this work, we present a novel perspective

on detecting out-of-distribution (OOD) samples and

propose an algorithm for sample-aware model selection to enhance the effectiveness of OOD detection.

Our algorithm determines, for each test input, which

pre-trained models in the model zoo are capable of

identifying the test input as an OOD sample. If no

such models exist in the model zoo, the test input is

classified as an in-distribution (ID) sample. We theoretically demonstrate that our method maintains the

true positive rate of ID samples and accurately identifies OOD samples with high probability when there

are a sufficient number of diverse pre-trained models

in the model zoo. Extensive experiments were conducted to validate our method, demonstrating that it

leverages the complementarity among single-model

detectors to consistently improve the effectiveness of

OOD sample identification. Compared to baseline

methods, our approach improved the relative performance by 65.40% and 37.25% on the CIFAR10 and

ImageNet benchmarks, respectively.

Unconditional Quantile Regression for Streaming

Data Sets

Rong Jiang

Shanghai Polytechnic University

Abstract:In this talk, we are concerned with Unconditional Quantile Regression (UQR) method,

which has gained significant traction as a popular

approach for modeling and analyzing data. However,

much like Conditional Quantile Regression (CQR),

UQR encounters computational challenges when it

comes to obtaining parameter estimates for streaming

datasets. This is attributed to the involvement of unknown parameters in the logistic regression loss function used in UQR, which presents obstacles in both

computational execution and theoretical development.

To address this, we present a novel approach involving smoothing logistic regression estimation. Subsequently, we propose a renewable estimator tailored for

UQR with streaming data, relying exclusively on current data and summary statistics derived from historical data. Theoretically, our proposed estimators exhibit equivalent asymptotic properties to the standard

version computed directly on the entire dataset. Both

simulations and real data analysis are conducted to

illustrate the finite sample performance of the proposed methods.

Joint work with Keming Yu.

Contributed Session CS038: Advance in Statistical

Methods for Complex Data

第193页

184

On the Target-Kernel Alignment: A Unified Analysis with Kernel Complexity

Chao Wang

Shanghai University of Finance and Economics

Abstract: This paper investigates the impact of

alignment between the target function of interest and

the kernel matrix on a variety of kernel-based methods

based on a general loss belonging to a rich loss function family, which covers many commonly used

methods in regression and classification problems. We

consider the truncated kernel-based method (TKM)

which is estimated within a reduced function space

constructed by using the spectral truncation of the

kernel matrix and compare its theoretical behavior to

that of the standard kernel-based method (KM) under

various settings. By using the kernel complexity function that quantifies the complexity of the induced

function space, we derive the upper bounds for both

TKM and KM, and further reveal their dependencies

on the degree of target-kernel alignment. Specifically,

for the alignment with polynomial decay, the established results indicate that under the just-aligned and

weakly-aligned regimes, TKM and KM share the

same learning rate. Yet, under the strongly-aligned

regime, KM suffers the saturation effect, while TKM

can be continuously improved as the alignment becomes stronger. This further implies that TKM has a

strong ability to capture the strong alignment and

eliminate the saturation effect. The minimax lower

bounds are also established to further confirm the

optimality of TKM. Extensive numerical experiments

further support our theoretical findings.

Joint work with Xin He, Yuwen Wang, Junhui Wang.

Smooth Tests for the Equality of Conditional Distributions

Peiwen Jia

Peking University

Abstract:In this paper, we establish a Neyman's

smooth test for the equality of conditional distributions. Unlike the traditional smooth tests always based

on parametric residuals, our method requires nonparametric estimation of the conditional cumulative

distribution function (CDF). The proposed smooth test

statistic is asymptotically chi-square distributed under

the null hypothesis that the conditional distributions of

the two populations are equal. Such asymptotically

distribution-free (ADF) theory for the smooth test

statistic works for the general two sample sizes ?

and ?. We also discuss the cases where the conditioning variables follow the same or different distributions. To address the issue of stochastic denominators

in the CDF estimator, we introduce the density-weighted version of smooth test statistic which

imposes fewer restrictions on the asymptotic variance.

The finite sample size and power properties of our

proposed test are studied, showing that the test performs well in controlling the actual probability of

Type I error once the bandwidth used in the CDF estimator is selected within a reasonable range, and has

decent power under the alternatives.

Joint work with Xiaojun Song.

Reliable Multivariate Deep Regression using Moment-Matching Prior Networks

Qingyi Pan

Tsinghua University

Abstract:When deep neural networks are deployed in

high-stakes applications, uncertainty estimation is

crucial for reliable predictions and decision-making.

Despite rich studies in univariate deep regression,

multivariate deep regression with accurate uncertainty

estimation, especially concerning the covariance matrix, remains largely unexplored. In this paper, we

propose a scalable evidential prior to capturing both

aleatoric and epistemic uncertainty, including the

correlation of the multivariate response vector. Our

method formulates a hierarchical probabilistic framework where the evidential prior is fitted using samples

generated by a neural network based on moment-matching. Extensive empirical results on real-world multivariate regression tasks demonstrate

that our method provides accurate prediction and uncertainty estimation with minimal computational

overhead, significantly outperforming existing methods.

Inference of Heterogeneous Effects in Single-Cell

第194页

185

Genetic Perturbation Screens

Zichu Fu

Tsinghua University

Abstract:The integration of CRISPR screening and

single-cell RNA sequencing has arisen as a powerful

tool for profiling the impact of genetic perturbations

on the entire transcriptome at the single-cell scale.

Although various methods have been developed for

analyzing data from such experiments, many of them

estimate the average perturbation effects across all

cells, overlooking potential heterogeneity induced by

cell state differences. Here we present scCAPE, a tool

designed to facilitate causal analysis of heterogeneous

perturbation effects at single-cell resolution. scCAPE

disentangles perturbation effects from the inherent

cell-state variations and provides nonparametric inferences of perturbation effects at single-cell resolution, permitting a range of downstream tasks including

perturbation effects analysis, genetic interaction analysis, perturbation clustering and prioritizing. We

benchmark scCAPE on several simulated and real

datasets to evaluate its disentangling effect and accuracy in estimating heterogeneous perturbation effects.

By applying scCAPE to data from human CD8+ T

cells and K562 cells, we reveal the heterogeneous

perturbation effects of genes involved in T cell proliferation, cell-cycle arrests and erythroid differentiation,

many of which were undetected by existing methods,

providing novel insights into the functions of these

genes.

Joint work with Lin Hou.

Estimation of Production Frontiers Based on Additive Models to Enhanced Robustness against Outliers: A Relaxed Support Vector Regression Approach

Yuting Zhou

Shanghai University of Engineering Science

Abstract:In the field of production frontier estimation, outliers can distort measurement accuracy significantly. Recent research has combined

non-parametric techniques like Data Envelopment

Analysis (DEA) and Free Disposal Hull (FDH) with

Machine Learning (ML) to improve generalization.

However, there have been limited approaches from an

ML perspective to handle outliers in production frontier estimation.

This study introduces the Additive Support

Vector Frontier (ASVF) and Additive Convex Support

Vector Frontier (ACSVF) models as solutions to the

curse of dimensionality in multi-input scenarios compared to traditional Support Vector Frontier (SVF) and

Convex Support Vector Frontier (CSVF) models. The

ASVF and ACSVF approaches utilize additive regression splines to define transformation functions without

explicitly considering outliers, adhering to microeconomic axioms.

Furthermore, the Relaxed Additive Support

Vector Frontier (RASVF) and Relaxed Additive Convex Support Vector Frontier (RACSVF) are proposed

to enhance model robustness by introducing free

slacks to cope with outliers. Empirical evaluations

using synthetic datasets show that the new approaches

outperform SVF, FDH, CSVF, and DEA, especially in

scenarios involving multiple inputs or outliers. The

effectiveness of these approaches is influenced by the

relationship between variables and the curvature of

the production frontier.

Joint work with Guoqiang Wang.

Contributed Session CS039: Advance in Missing

Data and Treatment Effects

Semiparametric Estimation with Reduced Dimension for the Treatment Effect under Missing Data

Tao Tan

East China Normal University

Abstract : We study semi-supervised learning of

treatment effects under random missingness and propose a semi-parametric estimation method with extended double robustness, which achieves optimal

efficiency with lower requirements on model specifications compared to other estimators. For

high-dimensional covariates, we project multidimensional covariates onto one- or two-dimensional indices

through index models. We propose double robust

estimation based on kernels, which is consistent if

certain conditions are met. We also prove that the

proposed estimator satisfies asymptotic normality and

第195页

186

that its variance is smaller than that of the inverse

probability weighted estimator, and asymptotic variance is estimated through a two-step perturbation

resampling process. To verify that our proposed double robust estimator is more efficient than the inverse

probability weighted estimator, we conduct numerical

simulation studies and apply our estimator to the dataset of air pollution monitoring in Beijing.

Joint work with Shuyi Zhang, Yong Zhou.

Semi-Supervised Inference for Means under MAR

without Inverse Propensity Weighting

Jin Su

East China Normal University

Abstract:We propose a general semi-supervised

framework for estimating population means, wherein

only small sized labeled data is available along with

large amounts of unlabeled data. In semi-supervised

learning, missing at random (MAR) induces \"decaying overlap\", which means the propensity score

???→ 0 uniformly for ? as the sample size ? → ∞.

It aggravates the small-denominator problem for estimators via inverse propensity weighting (IPW). We

propose a class of efficient and adaptive estimators

without IPW. An index model is used for dimension

reduction under high dimension, which combines the

effectiveness of the model-based approach and the

flexibility of nonparametric methods. The proposed

estimators achieve √?π?

-consistency, where √π?

is the probability of observing labels, and are semiparametrically efficient if all specifications are correct.

A ratio-consistent plug-in estimator is developed for

variance estimation. Robustness against model misspecification is verified in both theory and simulation

studies. The results are further validated through applications to customer churn data.

Joint work with Alan T.K. Wan, Shuyi Zhang, Yong

Zhou.

Improving the Efficiency of Estimating Treatment

Effects with External Control Data

Zhisong Zhao

East China Normal University

Abstract:In clinical trials, we focus on the average

treatment effect (ATE) and the quantile treatment

effect (QTE). Due to the small sample sizes, the power of statistical inferences may be unsatisfactory.

However, external control data can be obtained from

other studies. Inspired by a study comparing daratumumab dosing regimens, we propose methods to

combine trial data and external data to improve efficiency. For QTE and ATE, we propose two-stage estimation methods that could utilize external data under

the exchangeability assumption of potential outcome

quantiles or means. The proposed methods are

semi-parametric efficient. Furthermore, we extrapolate the methods to the external population and to the

overall population. We demonstrate the finite sample

effects of the proposed methods through simulations.

In the case of daratumumab studies, our methods can

provide more effective statistical inference.

Joint work with Huijuan Ma, Yong Zhou.

Smoothed Estimation on Optimal Treatment Regime under Semi-Supervised Setting in Randomized Trials

Xiaoqi Jiao

East China Normal University

Abstract:A treatment regime refers to the process of

assigning the most suitable treatment to a patient

based on their observed information. However, prevailing research on treatment regimes predominantly

relies on labeled data, which may lead to the omission

of valuable information contained within unlabeled

data, such as historical records and healthcare databases. Current semi-supervised works for deriving

optimal treatment regimes, either rely on model assumptions or struggle with high computational burden

for even moderate dimensional covariates.

To address this concern, we propose a

semi-supervised framework that operates within a

model-free context to estimate the optimal treatment

regime by leveraging the abundant unlabeled data.

Our proposed approach encompasses three key

steps. Firstly, we employ a single index model to

achieve dimension reduction, followed by kernel regression to impute the missing outcomes in the unlabeled data. Secondly, we propose various forms of

第196页

187

semi-supervised value functions based on the imputed

values, incorporating both labeled and unlabeled data

components.

Lastly, the optimal treatment regimes are derived by

maximizing the semi-supervised value functions.

We establish the consistency and asymptotic normality of the estimators proposed in our framework.

Furthermore, we introduce a perturbation resampling

procedure to estimate the asymptotic variance. Simulations confirm the advantageous properties of incorporating unlabeled data in the estimation for optimal

treatment regimes. A practical data example is also

provided to illustrate the application of our methodology.

This work is rooted in the framework of randomized trials, with additional discussions extending to

observational studies.

Joint work with Mengjiao Peng, Yong Zhou.

Semiparametric Efficient Fusion of Individual and

Summary Data

Wenjie Hu

Peking University

Abstract:Suppose we have available individual data

from an internal study and various types of summary

statistics from relevant external studies. External

summary statistics have been used as constraints on

the internal data distribution, which promised to improve the statistical inference in the internal data;

however, the additional use of external summary data

may lead to paradoxical results: efficiency loss may

occur if the uncertainty of summary statistics is not

negligible and a large estimation bias can emerge even

if the bias of external summary statistics is small. We

investigate these paradoxical results in a semiparametric framework. We establish the semiparametric

efficiency bound for estimating a general functional of

the internal data distribution, which is shown to be no

larger than that using only internal data. We propose a

data-fused efficient estimator that achieves this bound

so that the efficiency paradox is resolved. Besides, we

propose a debiased estimator that can achieve the

same asymptotic distribution as the oracle estimator as

if one knew whether the summary statistics were biased or not. Simulations and application to a Helicobacter pylori infection dataset are used to illustrate the

proposed methods.

Joint work with Ruoyu Wang, Wei Li, Wang Miao.

Contributed Session CS040: Bayesian and Machine Learning

Construction and Application of Quantile Factor

Augmented Quantile Regression Neural Network

Model

Yuting Huang

Lanzhou University of Finance and Economics

Abstract:Macroeconomic forecasting is an important

basis for national macro-control and enterprise micro-decision-making. Factor augmented regression

model, as one of the main forecasting methods, has

been widely used in academia and industry. However,

macroeconomic series usually has some complex

typical characteristics such as nonlinearity and asymmetry, so it is very important to consider these characteristics comprehensively to model macroeconomic

series. Therefore, based on the factor augmented regression model, this paper introduces the nonlinear

quantile regression model to describe the time series,

and constructs a new prediction method: QFA-QRNN.

Based on quantile regression framework, this method

uses neural network to capture nonlinear relations and

output conditional quantiles directly. Without distribution hypothesis, complex features such as heterogeneity and nonlinearity among variables can be described,

which effectively improves the prediction accuracy

and efficiency. In this paper, quantile factor model is

first constructed to extract common factors from

high-dimensional data, and then the common factors

are used as covariables to construct quantile regression model of neural network to study the heterogeneous nonlinear relationship between variables.

Through numerical simulation and case analysis, the

regression model with factor enhancement is selected

as the benchmark model, and the model accuracy is

evaluated based on QRMSE, QMAE and DM tests.

The results show that QFA-QRNN performs best under the three tests of all samples, showing higher accuracy and robustness. Therefore, the new method of

第197页

188

QFA-QRNN proposed in this paper not only effectively predicts the core variables of macro economy,

but also provides new technical support for decision-making.

Joint work with Deyin Fu.

Hypothesis Testing for the Deep Cox Model

Qixian Zhong

Xiamen University

Abstract:Deep learning has become enormously

popular in the analysis of complex data, including

event time measurements with censoring. To date,

deep survival methods have mainly focused on prediction. Such methods are scarcely used in matters of

statistical inference such as hypothesis testing. Due to

their black-box nature, deep- learned outcomes lack

interpretability which limits their use for decision-making in biomedical applications. This paper

provides estimation and inference methods for the

nonparametric Cox model – a flexible family of models with a nonparametric link function to avoid model

misspecification. Here we assume the nonparametric

link function is modeled via a deep neural network. To

perform statistical inference, we utilize sample splitting and cross-fitting procedures to get neural network

estimators and construct test statistic. These procedures enable us to propose a new significance test to

examine the association of certain covariates with

event times. We establish convergence rates of the

neural network estimators, and show that deep learning can overcome the curse of dimensionality in nonparametric regression by learning to exploit

low-dimensional structures underlying the data. In

addition, we show that our test statistic converges to a

normal distribution under the null hypothesis and

establish its consistency, in terms of the Type II error,

under the alternative hypothesis. Numerical simulations and a real data application demonstrate the usefulness of the proposed test.

Testing Causal Effects in Observational Survival

Data using Propensity Score Matching Design

Dingjiao Cai

Henan University of Economics and Law

Abstract:Time-to-event data are very common in

observational studies. Unlike randomized experiments,

observational studies suffer from both observed and

unobserved confounding biases. To adjust for observed confounding in survival analysis, the commonly used methods are the Cox proportional hazards

(PH) model, the weighted log-rank test, and the inverse probability of treatment weighted Cox PH model. These methods do not rely on fully parametric

models, but their practical performances are highly

influenced by the validity of the PH assumption. Also,

there are few methods addressing the hidden bias in

causal survival analysis. We propose a strategy to test

for survival function differences based on the matching design and explore sensitivity of the P-values to

assumptions about unmeasured confounding. Specifically, we apply the paired Prentice-Wilcoxon (PPW)

test or the modified PPW test to the propensity score

matched data. Simulation studies show that the

PPW-typetesth as higher power in situations when the

PH assumption fails. For potential hidden bias, we

develop a sensitivity analysis based on the matched

pairs to assess the robustness of our finding, following

Rosenbaum's idea for nonsurvival data. For a real data

illustration, we apply our method to an observational

cohort of chronic liver disease patients from a Mayo

Clinic study. The PPW test based on observed data

initially shows evidence of a significant treatment

effect. But this finding is not robust, as the sensitivity

analysis reveals that the P-value becomes nonsignificant if there exists an unmeasured confounder with a

small impact.

Combining Dimensionality Reduction Methods

with Neural Networks for Realized Volatility

Forecasting

Lidan He

Nanjing University of Information Science and Technology

Abstract:The application of artificial neural networks

to finance has recently received a great deal of attention from both investors and researchers, particularly

as a forecasting tool. However, when dealing with a

large number of predictors, these methods may overfit

第198页

189

the data and provide poor out-of-sample forecasts.

Our paper addresses this issue by employing two

dif_x0002_ferent approaches to predict realized volatility. On the one hand, we use a two-step procedure

where several dimensionality reduction methods, such

as Bayesian Model Averaging (BMA), Principal

Component Analysis (PCA), and Least Absolute

Shrinkage and Selection Operator (Lasso), are employed in the initial step to reduce dimensionality. The

reduced samples are then combined with artificial

neutral networks. On the other hand, we implement

two single step regularized neural networks that can

shrink the input weights to zero and effectively handle

high-dimensional data. Our findings on the volatility

of different stock asset prices indicate that the reduced

models outperform the compared models without

regularization in terms of predictive accuracy.

Joint work with Andrea Bucci, Zhi Liu.

Contributed Session CS014:Ultrahigh Dimensional Statistical Inference

Mixture Conditional Regression with Ultrahigh

Dimensional Text Data for Estimating Extralegal

Factor Effects

Jiaxin Shi

Peking University.

Abstract: Testing judicial impartiality is a problem of

fundamental importance in empirical legal studies, for

which standard regression methods have been popularly used to estimate the extralegal factor effects.

However, those methods cannot handle control variables with ultrahigh dimensionality, such as those

found in judgment documents recorded in text format.

To solve this problem, we develop a novel mixture

conditional regression (MCR) approach, assuming

that the whole sample can be classified into a number

of latent classes. Within each latent class, a standard

linear regression model can be used to model the relationship between the response and a key feature vector,

which is assumed to be of a fixed dimension. Meanwhile, ultrahigh dimensional control variables are then

used to determine the latent class membership, where

a naive Bayes type model is used to describe the relationship. Hence, the dimension of control variables is

allowed to be arbitrarily high. A novel expectation-maximization algorithm is developed for model

estimation. Therefore, we are able to estimate the key

parameters of interest as efficiently as if the true class

membership were known in advance. Simulation

studies are presented to demonstrate the proposed

MCR method. A real dataset of Chinese burglary offenses is analyzed for illustration purposes.

Joint work with Fang Wang, Yuan Gao, Xiaojun

Song, Hansheng Wang.

Variation of Conditional Mean and Its Application

in Ultrahigh Dimensional Feature Screening

Zhentao Tian

Beijing University of Technology

Abstract: A new metric, called variation of conditional mean (VCM), is proposed to measure the dependence of conditional mean of a response variable

on a predictor variable. The VCM has several appealing merits. It equals zero if and only if the conditional

mean of the response is independent of the predictor;

it can be used for both real vector valued variables and

functional data. An estimator of the VCM is given

through kernel smoothing, and a test for the conditional mean independence based on the estimated

VCM is constructed. The limit distributions of the test

statistic under null hypothesis and alternative hypothesis are deduced respectively. We further use VCM as

a marginal utility to do high-dimensional feature

screening to screen out variables that do not contribute

to the conditional mean of the response given the

predictors, and prove the validity of the sure screening

property. Furthermore, we find the cross variation of

conditional mean (CVCM), a variant of the VCM, has

a faster convergence rate than the VCM under the

conditional mean independence. Numerical comparison shows that the VCM and CVCM performs well in

both conditional independence testing and feature

screening. We also illustrate their applications to real

data sets.

Joint work with Tingyu Lai, Zhongzhan Zhang.

Feature Screening for Ultra High Dimensional

Data via Adapted Sliced Wasserstein Correlation

第199页

190

Coefficient

Juan Li

Yunnan University

Abstract: This paper proposes a novel model-free

feature screening procedure based on the Wasserstein

distance correlation coefficient. The proposed feature

screening procedure has three merits. First, it is applicable to various data types of univariate and multivariate continuous, discrete, and categorical natures.

Second, since Wasserstein distance directly evaluates

the association between two distributions, so the proposed screening procedure enjoys certain robustness

to outliers or observations from the heavy-tailed distributions. Third, numerical studies, including two real

data examples, demonstrate that the proposed screening method either surpasses or matches its competitors

in identifying informative features. Moreover, on the

theoretical side, we have established the sure independence screening and rank consistency property of

the proposed screening procedure.

A Sequential Feature Selection Procedure for High

Dimensional Cox Proportional Hazards Model

with Main and Interaction Effects

Ke Yu

Shanghai Jiao Tong University

Abstract: High-dimensional Cox proportional hazards model (Cox model) is commonly used regression

models in survival analysis. Feature selection problem

in high-dimensional Cox proportional hazards model

is addressed, considering scenarios involving solely

main effects or encompassing both main and interaction effects. A novel sequential feature selection

method is proposed, the selection consistency is established and illustrated by comparing it with existing

methods through extensive numerical simulations. For

main effects, with a remarkable short computational

time, our method achieves higher selection accuracy.

For interaction effects, our method successfully selects features without hierarchical structure constraints.

Additionally, two real data applications are conducted

to demonstrate the advantage of our proposed procedure.

Joint work with Shan Luo.

High-Dimensional Ensemble Kalman Filter with

Localization, Inflation and Iterative Updates

Hao-Xuan Sun

Peking University

Abstract: Accurate estimation of forecast error covariance matrices is an essential step in data assimilation, which becomes a challenging task for

high-dimensional data assimilation. The standard

ensemble Kalman filter (EnKF) may diverge due to

both the limited ensemble size and the model bias. In

this paper, we propose to replace the sample covariance in the EnKF with a high-dimensional tapering

covariance matrix estimator to counter the estimation

problem under high dimensions. A high-dimensional

EnKF scheme combining the covariance localization

with the inflation method and the iterative update

structure is developed. The proposed assimilation

scheme is tested on the Lorenz-96 model with spatially correlated observation systems. The results

demonstrate that the proposed method could improve

the assimilation performance under multiple settings.

Joint work with Shouxia Wang, Xiaogu Zheng and

Song Xi Chen.

Contributed Session CS042: Bayesian and Nonparametric Statistical Inferences

Flow Annealed Kalman Inversion with Sequential

Monte Carlo for Gradient-Free Bayesian Inference

Richard Grumitt

Tsinghua University

Abstract:In many scientific inverse problems we are

confronted by expensive forward models where we do

not have access to gradients of the forward model. In

this regime, traditional sampling methods quickly

become prohibitive, requiring a large number of serial

model evaluations. Ensemble Kalman Inversion (EKI)

has been proposed as an approximate method for

solving Bayesian inverse problems, with typically

rapid convergence properties. However, EKI is only

exact for linear forward models with Gaussian target

distributions. In this talk I will discuss the combination of Flow Annealed Kalman Inversion (FAKI),

which exploits normalizing flows (NF) to relax the

第200页

191

Gaussian ansatz of EKI, with Sequential Monte Carlo

(SMC). In this approach, we move from the prior to

the posterior through a sequence of annealead targets,

with FAKI being used to initialize the particle distribution at each temperature level. The NF is further

exploited to perform preconditioned Markov Chain

Monte Carlo (MCMC) iterations to distribute particles

according to each annealed target. By combining

FAKI with SMC, we are able to correct for both the

Gaussianity and model linearity assumptions of EKI. I

will demonstrate the performance of the method on a

number of challenging Bayesian inverse problems,

where we observe significant acceleration in the convergence rate compared to standard SMC with NF

preconditioning.

Joint work with Minas Karamanis, Uros Seljak.

A Tree-Based Method for Bootstrapping in Data

Envelopment Analysis

Yu Zhao

Tokyo University of Science

Abstract : Data Envelopment Analysis (DEA) is

widely used in management and operational research

as a tool for evaluating efficiencies and productivity

changes. The DEA models are generally formulated as

linear or nonlinear optimization problems and can be

solved using mathematical programming solvers.

Although the standard DEA models are deterministic,

many efforts have been made in previous studies to

introduce statistical analysis into DEA, such as the

bootstrapping algorithm, regression-based approach

(stochastic nonparametric envelopment of data),

among others. A key issue when using bootstrapping

is replicating the data-generating process assumed to

generate the observed data. Conventional bootstrap

DEA takes advantage of the empirical distribution of

efficiencies, which are computed from a deterministic

DEA model, to generate replications of the observations. In this study, we apply a tree-based method that

clusters the observations based on the minimization of

entropy, resulting in a piece-wise Gaussian density

distribution for the observed data. The properties of

the bootstrap DEA estimates and experiments will be

discussed in detail in the presentation.

Statistical Summaries for Bayesian Analysis in

EoR Science

Tom Binnie

Tsinghua University

Abstract:We apply a variety of statistical summaries

of the 21cm signal for EoR analyses. In previous work,

we use the 21cm power spectrum (PS) to distinguish

inside-out and outside-in morphologies of reionization

with mock observations using Bayesian model selection.

We expand our previous work with a model that includes X-ray heating, we look in detail at how many

redshifts an observation must span to decisively distinguish a saturated from non-saturated spin temperature. We also include UV luminosity functions to gain

synergy with HST and JWST observations. Recently,

we distinguished EoR morphologies by integrating

learnt-posterior distributions with pyDelfi when

summarising the light-cone with a 3D-CNN. \"Likelihood-free\" inference provides greater precision than

the 21cm PS and can distinguish morphologies decisively, but it is not as flexible as the PS when it comes

to successful parameter estimation on different models.

Lastly, we replace the standard Fourier transform

within the PS with the Morlet wavelet transform to

construct the Morlet PS, a statistic that is ergodic of

the entire light-cone. Our current work shows a significant precision increase in parameter estimation

when compared to the PS because we evolve wavelets

along the line-of-sight to remove bias from the

light-cone effect. However, as the statistic evolves, the

Bayesian likelihood must include a covariance term

which currently picks up simulation artefacts along

the line-of-sight caused by wrapping coeval cubes

throughout the light-cone length. We are developing a

version of 21cmFAST that contains structure modes

that span the line-of-sight length of the light-cone to

remedy this.

Bayesian Test Equality of Binary Data for a 4×4

Crossover Trial under a Latin-Square Design

Mingan Yang

百万用户使用云展网进行在线电子书制作,只要您有文档,即可一键上传,自动生成链接和二维码(独立电子书),支持分享到微信和网站!
收藏
转发
下载
免费制作
其他案例
更多案例
免费制作
x
{{item.desc}}
下载
{{item.title}}
{{toast}}