The 2nd Joint Conference on Stattistics and Date Science in China

发布时间:2024-7-04 | 杂志分类:其他
免费制作
更多内容

The 2nd Joint Conference on Stattistics and Date Science in China

42and Unmeasured Confounders in a Bayesian FrameworkYanxun XuJohns Hopkins UniversityAbstract: Effectively navigating decision-making demands a comprehensive understanding of causal relationships, especially with unmeasured confounders in the environment. Traditional causal inference methods often relies on auxiliary data sources to identify true causal effects, such as instrumental variables or proxies. Unfortunately, such data might be difficult or impractical to acquire in observational studi... [收起]
[展开]
The 2nd Joint Conference on Stattistics and Date Science in China
粉丝: {{bookData.followerCount}}
文本内容
第51页

42

and Unmeasured Confounders in a Bayesian

Framework

Yanxun Xu

Johns Hopkins University

Abstract: Effectively navigating decision-making

demands a comprehensive understanding of causal

relationships, especially with unmeasured confounders in the environment. Traditional causal inference

methods often relies on auxiliary data sources to identify true causal effects, such as instrumental variables

or proxies. Unfortunately, such data might be difficult

or impractical to acquire in observational studies,

leading to potential inaccuracies and incomplete inference. To address this limitation, we propose a novel

approach that integrates Bayesian joint modeling with

causal inference for effective decision-making under

the presence of unmeasured confounding. By taking

advantage of proper model design and assumptions,

the proposed framework can identify true causal effects without the reliance on additional data sources,

thereby leading to more informed and effective decisions in complex real-world observational scenarios.

Invited Session IS011: Data Science and Engineering

Active Machine Learning for Surrogate Modeling

in Complex Engineering Systems

Xiaowei Yue

Tsinghua University

Abstract: Active learning is a subfield of advanced

statistics and machine learning that focuses on improving the data collection efficiency in expensive-to-evaluate engineering systems. Surrogate models are indispensable in the analysis of complex engineering systems. The quality of surrogate models is

determined by the data quality and the model class but

achieving a high standard of them is challenging in

complex engineering systems. Heterogeneity, implicit

constraints, and extreme events are typical examples

of the factors that complicate systems, yet they have

been underestimated or disregarded in machine learning. This presentation is dedicated to tackling the

challenges in surrogate modeling of complex engineering systems by developing the following machine

learning methodologies. (i) Partitioned active learning

partitions the design space according to heterogeneity

in response features, thereby exploiting localized

models to measure the informativeness of unlabeled

data. (ii) For the systems with implicit constraints,

failure-averse active learning incorporates constraint

outputs to estimate the safe region and avoid undesirable failures in learning the target function. (iii) The

multi-output extreme spatial learning enables modeling and simulating extreme events in composite fuselage assembly. The proposed methods were applied to

real-world case studies and outperformed benchmark

methods. The codes of these algorithms are

open-sourced in our GitHub.

Nonparametric Statistical Inference via Metric

Distribution Function in Metric Spaces

Wenliang Pan

Chinese Academy of Sciences

Abstract: The distribution function is essential in

statistical inference and connected with samples to

form a directed closed loop by the correspondence

theorem in measure theory and the Glivenko-Cantelli

and Donsker properties. This connection creates a

paradigm for statistical inference. However, existing

distribution functions are defined in Euclidean spaces

and are no longer convenient to use in rapidly evolving data objects of complex nature. It is imperative to

develop the concept of the distribution function in a

more general space to meet emerging needs. Note that

the linearity allows us to use hypercubes to define the

distribution function in a Euclidean space. Still, without the linearity in a metric space, we must work with

the metric to investigate the probability measure. We

introduce a class of metric distribution functions

through the metric only. We overcome this challenging step by proving the correspondence theorem and

the Glivenko-Cantelli theorem for metric distribution

functions in metric spaces, laying the foundation for

conducting rational statistical inference for metric

space-valued data. Then, we develop a homogeneity

test and a mutual independence test for non-Euclidean

random objects and present comprehensive empirical

第52页

43

evidence to support the performance of our proposed

methods.

Joint work with Xueqin Wang, Jin Zhu, Junhao Zhu,

Heping Zhang.

Subsampling and Rare Events Data Beyond Binary

Responses

Haiying Wang

University of Connecticut

Abstract: Rare events data are commonly encountered when the events of interest occur with small

probabilities. The responses in the observed data are

dominated by zeros. Subsampling is very effective to

reduce the computational cost of analyzing rare events

data with losing significant estimation efficiency.

Existing investigations on subsampling with rare

events data focus on binary response models. We investigate rare events data beyond binary responses. If

sufficient data points for the non-rare observations are

sampled, there will be no statistical efficiency loss. In

the scenario that there is estimation efficiency loss due

to down sampling, we developed optimal sampling

design to minimize the information loss.

Kernel Function Packet Decomposition: From

State Space Models to Compact Support Bases

核函数包分解:从状态空间模型到紧支撑基底

Liang Ding

Fudan University

Abstract: It is well known that the state space (SS)

model formulation of a Gaussian process (GP) can

lower its training and prediction time both to ?(?)

for n data points. We prove that an m-dimensional SS

model formulation of GP is equivalent to a concept we

introduce as the general right Kernel Packet (KP): a

transformation for the GP covariance K such that

∑ ????

(?)

?(?,??)

?

?=0 = 0 holds for any ? ≤ ?1, 0 ≤

? ≤ ? − 1, and ? + 1 consecutive points ??

, where

??

(?)

?(?) denotes j-th derivative acting on ?. We

extend this idea to the backward SS model formulation, leading to the left KP for next ? consecutive

points: ∑ ????

(?)

?(?,??+?)

?

?=0 = 0 for any ? ≥ ?2? .

By combining both left and right KPs, we can prove

that a suitable linear combination of these covariance

functions yields m KP functions compactly supported

on (?0,?2?) . KPs improve GP prediction time to

?(log ?) or ?(1), enable broader applications including GP’s derivatives and kernel multiplications,

and can be generalized to multi-dimensional additive

and product kernels for scattered data.

A Class of Partition-Based Bayesian Optimization

Algorithms for Stochastic Simulations

Songhao Wang

Southern University of Science and Technology

Abstract: Bayesian optimization (BO) is a popular

simulation optimization approach. Despite its many

successful applications, there remain several practical

issues that have to be addressed. These includes the

non-trivial optimization of the inner acquisition function (search criterion) to find the future evaluation

points and the over-exploitative behavior of some BO

algorithms. These issues can cause BO to select inferior points or get trapped in local optimal regions

before exploring other more promising regions. This

work proposes a new partition-based BO algorithm

where the acquisition function is optimized over a

representative set of finite points in each partition

instead of the whole design space to reduce the computational complexity. Additionally, to overcome

over-exploitation, the algorithm considers regions of

different sizes simultaneously in each iteration,

providing focus on exploration in larger regions especially at the start of the algorithm. Numerical experiments show that these features help in faster convergence to the optimal point.

Joint work with Szu Hui Ng, Haowei Wang.

Invited Session IS051: Recent Advances in

High-Dimensional and Heterogeneous Data Analysis

A General Framework to Extend Sufficient Dimension Reductions to the Cases of the Mixture

Multivariate Elliptical Distributions

Fei Chen

Yunnan University of Finance and Economics

Abstract: In the sufficient dimension reduction

(SDR), many methods depend on some assumptions

第53页

44

on the distribution of predictor vector, such as the

linear design condition (L.D.C.), the assumption of

constant conditional variance, and so on. The mixture

distributions emerge frequently in practice, but they

may not satisfy the above assumptions. In this article,

a general framework is proposed to extend various

SDR methods to the cases where the predictor vector

follows the mixture elliptical distributions, together

with the asymptotic property for the consistency of the

kernel matrix estimators. For illustration, the extensions of several classical SDR approaches under the

proposed framework are detailed. Moreover, a method

to estimate the structural dimension is given, together

with a procedure to check an assumption called homogeneity. The proposed methodology is illustrated

by simulated and real examples.

Joint work with Wenjuan Li, Hongming Pei, Ali

Jiang.

Quantile Regression and Homogeneity Detection of

a Semiparametric Panel Data Model

Rui Li

Shanghai University of International Business and

Economics

Abstract: In this article, we study the quantile regression and homogeneity detection of a varying index

coefficient panel data model with fixed individual

effect that shows nonlinear time trend. Based on

spline approximation of nonparametric functions, we

get the estimates of trend function, link function and

index parameters, and establish the convergence rate

and asymptotic normality accordingly. Noting that the

subjects within a group may share the same trend

functions, which motivates us to further identify the

possible homogeneity of such trend functions by integrating regularization with binary segmentation algorithm. Consequently, we provide more efficient estimates via grouped observations and improve the large

sample properties. Simulation studies and a real analysis of Air pollution data and Integrated surface data

(APD & ISD) are conducted to illustrate the finite

sample performance of our approach.

Joint work with Tao Li, Huacheng Su, Jinhong You.

A Bayesian Mixture of Exponential Family Factor

Models for Uncovering Disease Progression Subtypes

Kai Kang

Sun Yat-sen University

Abstract: Patients affected by neurological disorders

usually present substantial heterogeneity in multi-domain biomarkers and clinical measures. This

heterogeneity arises from differences in disease stage,

unique characteristics, and membership in distinct

latent subtypes. Exploring such complex heterogeneity and identifying disease progression-related markers

is crucial for early diagnosis and developing timely

and targeted interventions. We propose a mixture exponential family trajectory model to integrate markers

from multiple modalities to learn the disease progression. We incorporate continuous neuroimaging and

micro-RNA sequencing biomarkers, categorical clinical symptoms, and ordinal cognitive markers using

appropriate exponential family distributions with

lower-dimensional latent factors. The mixture model

assigns subtype-specific parameters to these distributions for each mixture component, enabling the characterization of patients in heterogeneous latent subgroups. The proposed model can also describe the

nonlinear trajectory of disease deterioration and provide a temporal sequence of decline for each marker.

We develop a Bayesian estimation procedure coupled

with efficient Markov chain Monte Carlo (MCMC)

sampling schemes to perform statistical inference for

the mixture model. The proposed method is assessed

through extensive simulation studies and an application to Parkinson’s Progression Markers Initiative

(PPMI) to learn the temporal ordering and subtypes of

neurodegeneration of Parkinson's disease (PD).

Joint work with Qinxia Wang, Zhanpeng Xu, Yuanjia

Wang.

Linking Brain-Wide Gene Expression and Neuroimaging Data

Shuixia Guo

Hunan Normal University

Abstract: Integrating the data of different omics to

explore the pathogenesis of mental diseases and pre-

第54页

45

dict the disease is the mainstream research direction of

brain science. In this report, we first present a general

analytical framework for integrating genomic, transcriptomic, and brain-imaging data for brain science

research. Then, combined with the Alzheimer's Disease Neuroimaging Initiative (ADNI) project database

and Allen Human Brain Atlas (AHBA) databases, we

performed AD subtype studies based on individualized structural covariance network and further investigated the cognition, disease progression, morphological features, and gene expression profiles differences among subtypes. The two identified AD subtypes have implications for etiological mechanisms

and precision medicine.

Invited Session IS062: Recent Developments in the

Analysis of High-Dimensional and Complex Data

Edge Differentially Private Estimation in the Beta-Model via Jittering and Method of Moments

Fengting Yi

Yunnan University

Abstract: A standing challenge in data privacy is the

trade-off between the level of privacy and the efficiency of statistical inference. Here we conduct an

in-depth study of this trade-off for parameter estimation in the β-model (Chatterjee, Diaconis and Sly,

2011) for edge differentially private network data

released via jittering (Karwa, Krivitsky and Slavkovi´c, 2017). Unlike most previous approaches based

on maximum likelihood estimation for this network

model, we proceed via method of moments. This

choice facilitates our exploration of a substantially

broader range of privacy levels - corresponding to

stricter privacy - than has been to date. Over this new

range we discover our proposed estimator for the parameters exhibits an interesting phase transition, with

both its convergence rate and asymptotic variance

following one of three different regimes of behavior

depending on the level of privacy. Because identification of the operable regime is difficult to impossible in

practice, we devise a novel adaptive bootstrap procedure to construct uniform inference across different

phases. In fact, leveraging this bootstrap we are able

to provide for simultaneous inference of all parameters in the β-model (i.e., equal to the number of vertices), which would appear to be the first result of its

kind. Numerical experiments confirm the competitive

and reliable finite sample performance of the proposed inference methods, next to a comparable maximum likelihood method, as well as significant advantages in terms of computational speed and

memory.

Joint work with Jinyuan Chang, Qiao Hu, Eric D.

Kolaczyk, Qiwei Yao.

Modelling Matrix Time Series via a Tensor

CP-Decomposition

Jing He

Southwestern University of Finance and Economics

Abstract: We consider to model matrix time series

based on a tensor canonical polyadic

(CP)-decomposition. Instead of using an iterative

algorithm which is the standard practice for estimating

CP-decompositions, we propose a new and one-pass

estimation procedure based on a generalized eigenanalysis constructed from the serial dependence structure of the underlying process. To overcome the intricacy of solving a rank-reduced generalized eigenequation, we propose a further refined approach which

projects it into a lower dimensional full-ranked

eigenequation. This refined method can significantly

improve the finite-sample performance. We show that

all the component coefficient vectors in the

CP-decomposition can be estimated consistently. The

proposed model and the estimation method are also

illustrated with both simulated and real data, showing

effective dimension-reduction in modelling and forecasting matrix time series.

Joint work with Jinyuan Chang, Lin Yang, Qiwei

Yao.

High-Dimensional Knockoff-Assisted False Discovery Rate Control

Chenlong Li

Taiyuan University of Technology

Abstract: The two-stage knockoff-assisted Bonferroni-Benjamini-Hochberg-type, introduced by Sarkar &

Tang (2022), stands as a new and effective tool for

第55页

46

multiple inference. Nevertheless, this approach was

not originally designed for variable selection in

high-dimensional settings, a crucial gap in its application. Our research extends the innovative approach of

Sarkar & Tang (2022) to the realm of

high-dimensional data, leveraging the design matrix and its MX knockoff copy (Candès et al., 2018),

to devise Bonferroni-Benjamini-Hochberg method.

The proposed methods offer a robust control of the

false discovery rate, independent of any specific correlation patterns within the explanatory variables. The

technical novelty lies in the demonstration that the

novel multiple testing procedures can control the false

discovery rate in high-dimensional settings. Both

simulations and real-world data applications demonstrate the competitiveness of our proposed methods

against the false discovery rate-controlling method of

Candès et al. (2018).

Joint work with Jinyuan Chang, Cheng Yong Tang,

Zhengtian Zhu.

Online Bootstrap Inference for the Geometric Median

Guanghui Cheng

Guangzhou University

Abstract: In real-world applications, the geometric

median is a natural quantity to consider for robust

inference of location or central tendency, particularly

when dealing with non-standard or irregular data distributions. An innovative online bootstrap inference

algorithm, using the averaged nonlinear stochastic

gradient algorithm, is proposed to make statistical

inference about the geometric median from massive

datasets. The method is computationally fast and

memory-friendly, and it is easy to update as new data

is received sequentially. The validity of the proposed

online bootstrap inference is theoretically justified.

Simulation studies under a variety of scenarios are

conducted to demonstrate its effectiveness and efficiency in terms of computation speed and memory

usage. Additionally, the online inference procedure is

applied to a large publicly available dataset for skin

segmentation.

Contributed Session CS007:Precision Medicine

and Survival Data

Flexible Regression Methods for Estimating Optimal Individualized Treatment Regimes with Scalar

and Functional Covariates

Kaidi Kong

Beijing University of Technology

Abstract: In personalized medicine study, how to

estimate the optimal individualized treatment regime

based on available individual information is a fundamental problem. In recent years, functional data analysis has been extensively appeared in medical research, while the optimal individualized treatment

regime based on combination of scalar covariates and

functional covariates have rarely been studied and the

only few studies are mostly conducted in the context

of randomized trials. In this paper, we propose a flexible regression-based method with scalar and functional covariates. Different from previous studies, our

approach is applicable to both randomized trials and

observational studies. The convergence rates of the

proposed optimal individualized treatment regime

estimators are presented for the both situations, and

sufficient simulation studies and a real data analysis

are conducted to assess the performance of our proposed method.

Joint work with Li Guan, Zhongzhan Zhang.

A Doubly Robust Estimation in Learning Optimal

Individualized Treatment Regimes with Survival

Outcomes

Rujia Zheng

Northeast Normal University

Abstract: An essential aspect of precision medicine

involves the identification of optimal individual

treatment regimes to prolong survival time using survival data. Inverse probability weighting is a popular

method to estimate the value function of precision

medicine. However, it is essential to consider both

unbalancing covariates and censor variable when

dealing with survival data. And it is empirically

shown that the inverse probability weighting estimator

is sensitive to models slight misspecification. To address this issue, we propose the contrast value func-

第56页

47

tion under survival data and provide an estimator for

this function by using the derived optimal covariate

balancing conditions. The estimator is doubly robust,

that is, it is consistent when the propensity score model and the censoring model are correctly specification

simultaneously or the outcome model is correctly

specification. And asymptotic normality of the estimator can be established under standard regularity. A

large number of simulations prove the superiority of

our method and we illustrate it in the application of

cohort GSE6532.

Joint work with Wensheng Zhu.

Generalized Linear Mixed-Effects Joint Model for

Longitudinal and Bivariate Survival Data

Jiming Chen

Yunnan University

Abstract: Joint modeling of longitudinal and survival

data (JMLS) has been widely utilized to investigate

the relationship between longitudinal outcomes and

two survival times in cancer clinical trials. Existing

methods for analyzing JMLS mainly focus on the

assumptions that two time-to-event outcomes are uncorrelated and longitudinal outcomes are continuous.

However, in many HIV/AIDS clinical trials, bivariate

survival times and discrete longitudinal outcomes are

routinely encountered. To this end, this paper proposes

a novel generalized linear mixed-effects JMLS by

introducing a copula function to account for the correlation between two survival times and incorporating

an exponential family distribution to accommodate to

either continuous or discrete longitudinal outcomes,

and allowing covariates to be time-dependent and

random effects to be shared for JMLS. A nonparametric maximum likelihood estimation procedure together

with expectation-maximization algorithm are developed to estimate parameters and nonparametric functions in JMLS. Under some regularity conditions, the

consistency and asymptotic normality of parameter

and nonparametric function estimators are shown.

Simulation studies and a real example from the Framingham Heart Study (FHS) clinical study are illustrated by the proposed methodologies.

Joint work with Niansheng Tang.

A New and Unified Method for Regression Analysis

of Interval-Censored Failure Time Data under

Semiparametric Transformation Models with

Missing Covariates

Yichen Lou

Jilin University

Abstract: This paper discusses regression analysis of

interval-censored failure time data arising from semiparametric transformation models in the presence of

missing covariates. Although some methods have

been developed for the problem, they either apply

only to limited situations or may have some computational issues. Corresponding to these, we propose a

new and unified two-step inference procedure that can

be easily implemented using the existing or standard

software. The proposed method makes use of a set of

working models to extract partial information from

incomplete observations and yields a consistent estimator of regression parameters assuming missing at

random. An extensive simulation study is conducted

and indicates that it performs well in practical situations. Finally, we apply the proposed approach to an

Alzheimer’s Disease study that motivated this study.

Joint work with Yuqing Ma, Mingyue Du.

Bayesian Transformation Model for Spatial Partly

Interval-Censored Data

Mingyue Qiu

Capital Normal University

Abstract: The transformation model with partly interval-censored data offers a highly flexible modeling

framework that can simultaneously support multiple

common survival models and a wide variety of censored data types. However, the real data may contain

unexplained heterogeneity that cannot be entirely

explained by covariates and may be brought on by a

variety of unmeasured regional characteristics. Due to

this, we introduce the conditionally autoregressive

prior into the transformation model with partly interval-censored data and take the spatial frailty into account. An efficient Markov chain Monte Carlo method

is proposed to handle the posterior sampling and

model inference. The approach is simple to use and

第57页

48

does not include any challenging Metropolis steps

owing to four-stage data augmentation. Through several simulations, the suggested method's empirical

performance is assessed and then the method is used

in a leukemia study.

Joint work with Tao Hu.

Contributed Session CS008:Factor Models and

Bayesian Analysis

Forecasting Midprice Movement of HFT Data with

Bayesian Stochastic Attention-Based Neural Bag of

Implicit Feature Models

Xu Liu

Shandong University of Finance and Economics

Abstract: Classification of midprice movements presents a formidable challenge in the realm of

High-frequency Trading (HFT) data analysis. The

complex temporal dependence and invisible volatility

inherent in automated trading systems embed layers of

information that are not readily decipherable for direct

incorporation into traditional statistics and machine

learning processes. Recent studies, such as Passalis &

Tefas (2017), have demonstrated the efficacy of Bag

of Feature (BoF) models in reconstructing sequences

with features and representing them as histograms.

However, while BoF models treat each feature equally

important, they overlook the nuanced relevance of

underlying, implicit features that emerge from the

dynamics of the trading operation. To address this

limitation, we propose the integration of an attention

module designed to prioritize salient features while

disregarding the irrelevant ones, enhancing the model’s sensitivity to informative but implicit data elements. Our approach innovates further by implementing a Bayesian Attention-based Neural Bag of mixed

features (BANBoMF) model that employs a stochastic

approach to align both explicit and implicit features.

In this model, attention weights are derived from a

reparameterizable Gamma distribution, facil- itated

through variational Bayesian inference. This stochastic approach proves superior in capturing the intricate

temporal dependencies characteristic of HFT data, as

highlighted by Fan et al. (2020). The attention mechanism’s effectiveness is further augmented by optimizing the Evidence Lower Bound (ELBO), which is

semi-analytic and updated in conjunction with a sequential prior, thus refining the feature alignment

process. Empirical results from both simulation studies and the real HFT dataset (FI2010) validate that this

optimized, filtering feature-based model significantly

outperforms existing machine learning models. It

offers enhanced precision in classifying and interpreting the future dynamics of midprice movements in

HFT data, particularly by leveraging insights from

implicit feature interactions that were previously underutilized.

Forecasting Using Nonlinear Factor Models with

Nonparametrically Targeted Predictors

Siwei Wang

Hunan University

Abstract: This paper proposes a new factor model

that can summarize the nonlinear information content

contained in observed predictors into factor formation

and subsequent factor-augmented predictions. Nonparametric sieve method is adopted to approximate

the nonlinear functional space of the predictors, with

the partial correlation screening procedure used to

target the relevant nonlinear predictors to reduce predictor dimension and improve prediction efficiency.

Theoretical results including the asymptotic normality

of the predictive estimator and the sure screening

property of the predictor targeting are established.

Next, simulations demonstrate the nice forecasting

performance of our proposed methods. Finally, the

proposed method is successfully applied in inflation

prediction.

Joint work with Yundong Tu.

Sparse Multicategory Generalized Distance

Weighted Discrimination in Ultra-High Dimensions

Tong Su

Yunnan University

Abstract: Distance weighted discrimination (DWD)

is an appealing classification method that is capable of

overcoming data piling problems in high-dimensional

settings. Especially when various sparsity structures

第58页

49

are assumed in these settings, variable selection in

multicategory classification poses great challenges. In

this paper, we propose a multicategory generalized

DWD (MgDWD) method that maintains intrinsic

variable group structures during selection using a

sparse group lasso penalty. Theoretically, we derive

minimizer uniqueness for the penalized MgDWD loss

function and consistency properties for the proposed

classifier. We further develop an efficient algorithm

based on the proximal operator to solve the optimization problem. The performance of MgDWD is evaluated using finite sample simulations and miRNA data

from an HIV study.

Joint work with Yafei Wang, Yi Liu, William G.

Branton, Eugene Asahchop, Christopher Power, Bei

Jiang, Linglong Kong, Niansheng Tang

Efficient and Flexible LLMs-Generated Content

Detection

Wan Tian

Beihang University

Abstract: We propose an efficient content detection

method generated by large language models (LLMs),

referred to as MDCSD, that fully harnesses the universality of LLMs across multiple knowledge domains.

Its essence lies in utilizing the Mahalanobis distancebased confidence score (MDCS) of the text as the

ultimate discriminative feature. The prerequisite for

computing the MDCS is the efficient joint estimation

of the precision matrices corresponding to multiple

knowledge domains. We provide regularization techniques for constructing the joint estimator and offer

corresponding computationally feasible optimization

algorithms. The sparsity and consistency of the joint

estimator are demonstrated. Based on offset Rademacher complexity, we further derive the optimal

excess risk bound for the detection problem. Superiority of MDCSD is validated on the recently released

Human ChatGPT Comparison Corpus (HC3) dataset.

Joint work with Yijie Peng, Zhongfeng Qin.

An Algorithm to Assess Importance of Predictors

in Systematic Review of Prediction Models: A Case

Study with Simulations

Ruohua Yan

Beijing children's hospital

Abstract: Background: How to assess the importance

of predictors in systematic reviews (SR) of prediction

models remains largely unknown. The commonly

used indicators of importance for predictors in individual models included parameter estimates, information entropy, etc., but they cannot be quantitatively

synthesized through meta-analysis.

Methods: We explored the synthesis method of the

importance indicators in a simulation study, which

mainly solved the following four methodological issues: (1) whether to synthesize the original values of

the importance indicators or the importance ranks; (2)

whether to normalize the importance ranks to a same

dimension; (3) whether and how to impute the missing

values in importance ranks; and (4) whether to weight

the importance indicators according to the sample size

of the model during synthesis. Then we used an empirical SR to illustrate the feasibility and validity of

the synthesis method.

Results: According to the simulation experiments, we

found that ranking or normalizing the values of the

importance indicators had little impact on the synthesis results, while imputation of missing values in the

importance ranks had a great impact on the synthesis

results due to the incorporation of variable frequency.

Moreover, the results of means and weighted means of

the importance indicators were similar. In consideration of accuracy and interpretability, synthesis of the

normalized importance ranks by weighted mean was

recommended. The synthesis method was used in the

SR of prediction models for AKI. The importance

assessment results were approved by experienced

nephrologists, which further verified the reliability of

the synthesis method. Conclusions: An importance

assessment of predictors should be included in SR of

prediction models, using the weighted mean of importance ranks normalized to a same dimension in

different models.

Joint work with Chen Wang, Chao Zhang, Xiaohang

Liu, Dong Zhang, Xiaoxia Peng.

Contributed Session CS009:Statistical Machine

第59页

50

Learning: Methodology and Applications

Adaptive Split Balancing for Optimal Random

Forest

Yuqian Zhang

Renmin University of China

Abstract: While random forests are commonly used

for regression problems, existing methods often lack

adaptability in complex situations or lose optimality

under simple, smooth scenarios. In this study, we

introduce the adaptive split balancing forest (ASBF),

capable of learning tree representations from data

while simultaneously achieving minimax optimality

under the Lipschitz class. To exploit higher-order

smoothness levels, we further propose a localized

version that attains the minimax rate under the Hölder

class ℋ?,?for any ? ∈ ℕ and ? ∈ (0,1]. Rather than

relying on the widely-used random feature selection,

we consider a balanced modification to existing approaches. Our results indicate that an over-reliance on

auxiliary randomness may compromise the approximation power of tree models, leading to suboptimal

results. Conversely, a less random, more balanced

approach demonstrates optimality. Additionally, we

establish uniform upper bounds and explore the application of random forests in average treatment effect

estimation problems. Through simulation studies and

real-data applications, we demonstrate the superior

empirical performance of the proposed methods over

existing random forests.

Joint work with Weijie Ji, Jelena Bradic.

Split-and-Merge Based Simultaneous Input and

State Filter for Nonlinear Dynamic Systems

Sanfeng Hu

Yunnan University

Abstract: Systems with a variety of uncertainties are

often encountered in practice. The problem of simultaneous input and state estimation (SISE) for dynamic

systems has received a lot of attention, due to its

widespread presence in many application fields. This

paper investigates the problem of simultaneous input

and state estimation (SISE) for nonlinear dynamic

systems. By the augmented state approach, we convert

this problem into a standard filtering problem. Then,

the split-and-merge technique is utilized for the augmented state estimation. Based on this, a novel

split-and-merge based simultaneous input and state

filter is developed in order to enhance the ability of

dealing with highly nonlinear systems. Simulations

demonstrate the effectiveness and efficiency of the

proposed filter.

Joint work with Liping Guo, Mengjiao Tang, Yao

Rong.

Alteration Detection of Tensor Dependence Structure via Sparsity-Exploited Reranking Algorithm

Li Ma

Xiamen University

Abstract: Tensor-valued data arise frequently from a

wide variety of scientific applications, and many

among them can be translated into an alteration detection problem of tensor dependence structures. In this

article, we formulate the problem under the popularly

adopted tensor-normal distributions and aim at

two-sample correlation/partial correlation comparisons of tensor-valued observations. Through decorrelation and centralization, a separable covariance

structure is employed to pool sample information

from different tensor modes to enhance the power of

the test. Additionally, we propose a novel Sparsity-Exploited Reranking Algorithm (SERA) to further

improve the multiple testing efficiency. Such efficiency gain is achieved by incorporating a carefully

constructed auxiliary tensor sequence to rerank the

p-values. Besides the tensor framework, SERA is also

generally applicable to a wide range of two-sample

large-scale inference problems with sparsity structures,

and is of independent interest. The asymptotic properties of the proposed test are derived and the algorithm

is shown to control the false discovery at the

pre-specified level. We demonstrate the efficacy of the

proposed method through intensive simulations and

two scientific applications.

Joint work with Shenghao Qin, Yin Xia.

Mini-Batch Gradient Descent with Buffer

Haobo Qi

Beijing Normal University

第60页

51

Abstract: In this paper, we studied a buffered

mini-batch gradient descent (BMGD) algorithm for

training complex model on massive datasets. The

algorithm studied here is designed for fast training on

a GPU-CPU system, which contains two steps: the

buffering step and the computation step. In the buffering step, a large batch of data (i.e., a buffer) are

loaded from the hard drive to the graphical memory of

GPU. In the computation step, a standard mini-batch

gradient descent (MGD) algorithm is applied to the

buffered data. Compared to traditional

MGD algorithm, the proposed BMGD algorithm can

be more efficient for two reasons. First, the BMGD

algorithm uses the buffered data for multiple rounds

of gradient update, which reduces the expensive

communication cost from the hard drive to GPU

memory. Second, the buffering step can be executed

in parallel so that the GPU does not have to stay idle

when loading new data. We first investigate the theoretical properties of BMGD algorithms under a linear

regression setting. The analysis is then extended to the

Polyak-Lojasiewicz loss function class.

The theoretical claims about the BMGD algorithm are

numerically verified by simulation studies. The practical usefulness of the proposed method is demonstrated by three image-related real data analysis.

Joint work with Du Huang, Yingqiu Zhu, Danyang

Huang, Hansheng Wang.

Testing Non-inferiority of a New Treatment in

Three-Arm Clinical Trials with Binary Endpoints

Bin Yu

Yunnan University

Abstract: Two-arm non-inferiority trials without a

placebo are usually used to show that the experimental

treatment is not worse than the reference treatment by

a small pre-specified non-inferiority margin because

of ethical concerns. There are often problems in design, analysis and interpretation for two-arm

non-inferiority trials such as the selection of the

non-inferiority margin and the establishment of assay

sensitivity. A three-arm non-inferiority clinical trial

including a placebo is usually conducted to assess the

assay sensitivity and internal validation of a trial.

Consequently, some large sample approaches have

been developed to assess non-inferiority of a new

treatment in a three-arm trial over the years. But, these

methods behave poor when sample sizes in three arms

are small. The objective is to develop some effective

approaches to test three-arm non-inferiority when

sample sizes are small. First of all, this paper derive

the non-inferiority test with assuming assay sensitivity

is claimed. Saddlepoint approximation, exact and

approximate unconditional, and bootstrap-resampling

methods are developed to calculate p-values of the

Wald-type test, score test and likelihood ratio test.

Simulation studies are conducted to compare their

performance in terms of type I error rate and power.

Our empirical results show that saddlepoint approximation method behaves better than asymptotic method

for Wald-type test statistic; approximate unconditional

and bootstrap-resampling methods with score test

statistic perform better in the sense that their corresponding type I error rates are closer to the prespecified nominal level than those of other test procedures

when sample sizes are small. Secondly, the paper

develops a hybrid approach to construct simultaneous

confidence interval for assessing non-inferiority and

assay sensitivity in a three-arm trial. For comparison,

we present normal-approximation-based and bootstrap-resampling-based simultaneous confidence intervals. Simulation studies evidence that the hybrid

approach with the Wilxson score statistic performs

better than other approaches in terms of empirical

coverage probability and mesial-noncoverage probability. Finally, we propose two fully Bayesian approaches, the posterior variance approach and Bayes

factor approach, to determine sample size required in

a three-arm non-inferiority trial with binary endpoints.

Through the simulation studies and a real example, we

found Bayes factor method always leads to smaller

sample sizes than the posterior variance method, utilizing the historical data can reduce the required sample size, simultaneous test requires more sample size

to achieve the desired power than the non-inferiority

test, the selection of the hyperparameters has a relatively large effect on the required sample size. When

only controlling the posterior variance, the posterior

第61页

52

variance criterion is a simple and effective option for

obtaining a rough outcome. When conducting a previous clinical trial, it is recommended to use the Bayes

factor criterion in practical applications.

Joint work with Niansheng Tang.

Contributed Session CS010:Recent Advances in

Statistical Learning Methods

Efficient and Provable Online Reduced Rank Regression via Online Gradient Descent

Xiao Liu

Shanghai Jiao Tong University

Abstract: The Reduced Rank Regression (RRR)

model is frequently employed in machine learning. It

increases efficiency and interpretability by adding a

low-rank restriction to the coefficient matrix, which

can be used to cut down on the number of parameters.

In this paper, we study the RRR issue in an online

setting. Only a small batch of data can be utilized each

time, arriving in a stream. Previous analogous methods have relied on conventional least squares estimation, which is inefficient and does not theoretically

guarantee convergence rate or build connections with

offline strategies. We proposed an efficient online

RRR algorithm based on non-convex online gradient

descent. More importantly, based on a constant order

batch size and appropriate initialization, we theoretically prove the convergence result of the mean estimation error generated by our algorithm. Our result

achieves an optimal rate of up to a logarithmic factor.

We also propose an accelerated version of our algorithm. Our methods compete with the existing method

in terms of accuracy and calculation speed in n synthetic data and real applications.

Joint work with Weidong Liu, Xiaojun Mao.

Acceleration of Stochastic Gradient Descent with

Momentum by Averaging: Finite-Sample Rates

and Asymptotic Normality

Kejie Tang

Shanghai Jiao Tong University

Abstract: Stochastic gradient descent with momentum (SGDM) has been widely used in many machine

learning and statistical applications. Despite the observed empirical benefits of SGDM over traditional

SGD, the theoretical understanding of the role of

momentum for different learning rates in the optimization process remains widely open. We analyze the

finite-sample convergence rate of SGDM under the

strongly convex settings and show that, with a large

batch size, the mini-batch SGDM converges faster

than the mini-batch SGD to a neighborhood of the

optimal value. Additionally, our findings, supported

by theoretical analysis and numerical experiments,

indicate that SGDM permits broader choices of learning rates. Furthermore, we analyze the Polyak-averaging version of the SGDM estimator, establish its asymptotic normality, and justify its asymptotic equivalence to the averaged SGD. The asymptotic

distribution of the averaged SGDM enables uncertainty quantification of the algorithm output and statistical inference of the model parameters.

Joint work with Weidong Liu, Yichen Zhang, Xi

Chen.

An Adaptive ABC-MCMC with Global-Local

Proposals

Xuefei Cao

Nankai University

Abstract: In this paper, we address the challenge of

Markov Chain Monte Carlo (MCMC) algorithms

within the approximate Bayesian Computation (ABC)

framework, which often get trapped in local optima

due to their inherent local exploration mechanism,

particularly for multi-modal distributions and small

ABC threshold. To remedy this, we propose a novel

global-local MCMC algorithm that combines the \"exploration\" capabilities of global proposals with the

\"exploitation\" finesse of local proposals. We integrate

iterative importance resampling into

the likelihood-free framework to establish an effective

global proposal distribution, and adapt a normalizing

flow-based probabilistic distribution learning model to

iteratively improve the algorithmic performance. Furthermore, in order to optimize the efficiency of the

local sampler and overcome the limitations caused by

random walk behavior in high-dimensional space, we

utilize Langevin dynamics to propose candidate pa-

第62页

53

rameters and utilize completely random numbers

(CRN) to enhance the stability of the gradient estimation. We numerically demonstrate that our method is

able to improve sampling efficiency and achieve more

reliable convergence for complex posteriors.

Joint work with Shijia Wang, Yongdao Zhou.

Large Deviation Algorithms for the Thresholding

Bandit Problem

Shan Dai

The Chinese University of Hong Kong (Shenzhen)

Abstract: The Thresholding Bandit problem (TB) is a

popular sequential decision-making problem, which

aims at identifying the systems whose means are

greater than a threshold. Instead of working on the

upper bound of a loss function, our approach stands

out from conventional practices by directly minimizing the loss itself and thus offering an asymptotically

optimal solution theoretically, by leveraging the principles of large deviation theory. To make the asymptotically optimal allocation rule implementable, we

propose a parameter-free Large Deviation (LD) algorithm based on the sample-based approximation to the

unknown rate function. An Approximated Large Deviation (ALD) algorithm is further proposed as a supplement to improve the computation efficiency by

using Taylor expansion approximation. Extensive

experiments are conducted to validate the superiority

of the algorithms compared to existing methods, and

demonstrate their broader applications to discrete

cases and various loss functions.

Joint work with Manjing Zhang, Guangwu Liu, Yulin

He, Philippe Fournier-viger, Zhexue Huang.

Computationally Efficient Methods for Estimating

Co-heritability of Multivariate Phenotypes Using

Biobank Data

Yuhao Deng

University of Michigan

Abstract: Biobank data provide a rich source to understand the degree of co-heritability among multiple

disease phenotypes from shared genetic etiology.

However, due to a large number of disease phenotypes

of heterogenous types, estimating the co-heritability is

both statistically and computationally challenging. In

this work, we propose a joint model with latent polygenic effects for all the phenotypes, in which the random effects due to genetic etiology are modelled separately from the latent environmental effects. Computationally, we propose a two-stage procedure to first

estimate the heritability and environmental correlation

for a single phenotype, and then estimate the

co-heritability between any two phenotypes by maximizing a pairwise pseudo-likelihood function. We

extract nucleic family and apply divide-and-conquer

approaches so that our algorithm can easily scale to

analyzing the biobank data. Our numerical algorithms

involve at most five-dimensional integration regardless of the number of the disease phenotypes, so are

computationally efficient and reliable. Finally, the

proposed method is illustrated through simulations

and application to the UK biobank data.

Joint work with Yuanjia Wang, Donglin Zeng.

Contributed Session CS011:Causal Inference and

Applications

Efficient Nonparametric Inference of Causal Mediation Effects with Nonignorable Missing Confounders

Jiawei Shan

Renmin University of China

Abstract: We consider causal mediation analysis with

confounders subject to nonignorable missingness in a

nonparametric framework. Our approach relies on

shadow variables that are associated with the missing

confounders but independent of the missingness

mechanism. The mediation effect of interest is shown

to be a weighted average of an iterated conditional

expectation, which motivates our Sieve-based Iterative Outward (SIO) estimator. We derive the rate of

convergence and asymptotic normality of the SIO

estimator, which do not suffer from the ill-posed inverse problem. Essentially, we show that the asymptotic normality is not affected by the slow convergence rate of nonparametric estimators of nuisance

functions. Moreover, we demonstrate that our estimator is locally efficient and attains the semiparametric

efficiency bound under certain conditions. We accu-

第63页

54

rately depict the efficiency loss attributable to missingness and identify scenarios in which efficiency loss

is absent. We also propose a stable and

easy-to-implement approach to estimate asymptotic

variance and construct confidence intervals for the

mediation effects. Finally, we evaluate the finite-sample performance of our proposed approach

through simulation studies, and apply it to the CFPS

data to show its practical applicability.

Joint work with Wei Li, Chunrong Ai.

Multiply Robust Estimation for General Multivalued Treatment Effects with Missing Outcomes

Xiaorui Wang

Southern University of Science and Technology

Abstract: Interventions with multivalued treatments

are common in medical and health research, leading to

a growing interest in developing estimators for multivalued treatment effects using observational data. In

practice, missing outcome data is a common occurrence, which poses significant challenges to the estimation of treatment effects. In this paper, we propose

two multiply robust estimators for estimating the

general multivalued treatment effects, with outcome

missing at random, including the average treatment

effect (ATE), quantile treatment effect (QTE) and

expectile treatment effect (ETE). The resulting estimators are root-n consistent and asymptotically normal, provided that the candidate models for the propensity score and the probability of being observed or

outcome regression contain the correct model. Extensive simulation studies are conducted to investigate

the finite-sample performance of the proposed estimators. The proposed methods are also applied to a

real-world dataset of CHARLS with about 21% outcome missing, estimating the ATE, QTE and ETE of

three types of social activities on the cognitive function of middle-aged and elderly people in China.

Joint work with Jing Yang, Yinfeng Wang, Yanlin

Tang, Jian Qing Shi.

Matching-Based Policy Learning with Observational Data

Xuqiao Li

Sun Yat-sen University

Abstract: Treatment heterogeneity is ubiquitous

across many areas, motivating practitioners to search

for the optimal policy that maximizes the expected

outcome based on individualized characteristics.

However, most existing policy learning methods rely

on weighting-based approaches, which may suffer

from high instability in observational studies. To enhance the robustness of the estimated policy, we propose a matching-based estimator of the policy improvement upon a randomized baseline and optimize

this estimation over a policy class after bias correction.

We derive a non-asymptotic high probability bound

for the regret of the learned policy and show that the

convergence rate is almost ?

−1/2

. The promising

finite sample performance of the proposed method is

demonstrated in extensive simulation studies and a

real data application.

Joint work with Ying Yan.

Prospective and Retrospective Causal Inferences

Based on the Potential Outcomes Framework

Chao Zhang

Beijing Technology and Business University

Abstract: In this paper, we discuss both prospective

and retrospective causal inference, building on Neyman's potential outcomes framework. For prospective

causal inference, we review criteria for confounders

and surrogates to avoid the Yule-Simpson paradox and

the surrogate paradox, respectively. Turning to retrospective causal inference, we introduce the concepts

of posterior causal effects given observed evidence to

quantify the causes of effects for the case with multiple causes that could affect each other. The posterior

causal effects provide a unified framework for deducing both effects of causes in prospective causal inference and causes of effects in retrospective causal inference. We compare the medical diagnosis approaches based on Bayesian posterior probabilities and posterior causal effects for classification and attribution.

Joint work with Zhi Geng, Shaojie Wei, Xueli Wang,

Chunchen Liu.

Causal Attribution with Confidence

第64页

55

Ping Zhang

Peking University

Abstract: To assess the causes of effects, researchers

have defined attribution parameters such as the probability of causation (PC), which is a parameter of the

joint distribution of potential outcomes. Unlike the

average causal effect, PC cannot be identified even in

randomized experiments or studies without unobserved confounding. Identification of PC requires

additional assumptions, typically the monotonicity or

no-prevention assumption, which may fail to hold in

many practical cases. Without the monotonicity assumption, PC is partially identified with lower and

upper bounds. However, these bounds are non-smooth

and the standard estimation and inference theory cannot be directly applied, which hinders the practical use

of PC to a certain extent. In this paper, we consider

conducting statistical inference for PC based on its

sharp bounds in two settings of randomized experiments. We develop an easy-to-implement method for

constructing a valid confidence interval for PC based

on the sharp bound without covariate information. We

propose a novel estimator of the sharp bound with

additional covariates, and derive the result for the

proposed estimator to be asymptotically normal. The

proposed estimator is based on the influence function

and allows the use of nonparametric or machine

learning methods for estimating nuisance functions by

using cross-fitting. Confidence intervals for the partially identified PC are further constructed based on

the proposed estimator.

Joint work with Ruoyu Wang, Wang Miao.

Contributed Session CS012:Recent Advances in

Statistical Inference

Statistical Inference for Power Autoregressive

Conditional Duration Models with Stable Innovations

Yuxin Tao

Tsinghua University

Abstract: The paper proposes a first-order power

autoregressive conditional duration model with positive alpha-stable innovations (sPACD), and studies the

properties of maximum likelihood estimation within a

unified framework of stationary and explosive cases.

The proposed model effectively addresses the excess

kurtosis in the durations. Further, the power form of

the model structure mitigate the issue of overpredicting short durations in the standard ACD model. The

estimation for asymptotic covariance matrix is discussed. Strict stationarity test statistics and a modified

Kolmogorov-type test statistic are established for

stationarity testing and diagnostic checking in both

stationary and explosive scenarios. Monte Carlo simulation studies illustrate the good performance of the

MLE in finite samples. An empirical example is analyzed to illustrate the usefulness of sPACD models.

Joint work with Dong Li.

Exploring Causal Effects of Hormone- and Radio-Treatments in an Observational Study of

Breast Cancer under Semi-competing Risks Setting

Tonghui Yu

Nanyang Technological University

Abstract: Breast cancer patients after surgery may

suffer from relapse or death in the duration of follow-up. This phenomenon, known as semi-competing

risk, demands advanced statistical tools for unbiased

analysis. Despite progress in estimation and inference

within semi-competing risks regression, its application

to causal inference is still in its early stages. This article aims to establish a frequentist and semi-parametric

framework based on copula models, facilitating valid

causal inference, net quantity estimation, and sensitivity analysis for unmeasured factors under

right-censored semi-competing risks data. We also

study the non-parametric identification and propose

novel procedures to enhance parameter estimation. We

apply the proposed framework to a breast cancer study

and detect the time-varying causal effects of hormoneand radio-treatments on patients' relapse-free survival

and overall survival.

Joint work with Mengjiao Peng, Yifan Cui, Elynn

Chen, Chixiang Chen.

A Statistical Review on the Optimal Fingerprinting

Approach in Climate Change Studies

Hanyue Chen

第65页

56

Peking University

Abstract: We provide a statistical review of the \"optimal fingerprinting\" approach presented in Allen and

Tett (1999) in light of the severe criticism of McKitrick (2022). Our review finds that the \"optimal fingerprinting\" approach would survive much of McKitrick (2022)’s criticism by enforcing two conditions

related to the conduct of the null simulation of the

climate model, and the accuracy of the null setting

climate model. The conditions we proposed are simpler and easier to verify than those in McKitrick

(2022). We provide additional remarks on the residual

consistency test in Allen and Tett (1999), showing that

it is operational for checking the agreement between

the residual covariance matrices of the null simulation

and the physical internal variation under certain conditions. We further provide the reason why the Feasible Generalized Least Square method, much advocated by McKitrick (2022), is not regarded as operational

by geophysicists.

Joint work with Song Xi Chen.

A Novel Approach for Identifying Genetic Associations with Heterogeneous Cancer Subtypes Risk

Sheng Fu

Nankai University

Abstract: Breast cancer is a complex disease with

diverse molecular subtypes, each with distinct etiologies, clinical presentations, and outcomes. Detecting

the association between disease subtypes and common

germline variants is a challenging task due to the disease's heterogeneity. Existing methods are less effective in the presence of interaction effects, and they are

computationally intensive. To address this issue, we

propose a novel model named TOPO, that efficiently

detects variants exhibiting subtype heterogeneity. Our

comprehensive test procedure combines three different model structures, including fixed-effect and random-effect two-stage polytomous models, and uses a

fast and powerful procedure to combine three

P-values. Through extensive simulation studies, we

show that TOPO has well controlled type-one-error,

and superior performance in statistical power and

computational time. We apply TOPO to the up-to-date

largest genome-wide association study of 138,209

breast cancer cases and 121,663 controls from the

Breast Cancer Association Consortium. After filtering

out known risk loci, we identified eight novel variants

(P-value < 5×10-8

). Our findings highlight the importance of considering tumor heterogeneity in identifying new loci, enhancing our understanding of

breast cancer's etiologic heterogeneity, and informing

subtype-specific genetic scores for precision prevention.

Joint work with Xihao Li, Zilin Li, Jin Jin, Kai Yu,

Haoyu Zhang.

Transfer Learning with General Estimating Equations

Han Yan

Peking University

Abstract: We consider statistical inference for parameters defined by general estimating equations under the covariate shift transfer learning. Different from

the commonly used density ratio weighting approach,

we undertake a set of formulations to make the statistical inference semiparametric efficient with simple

inference. It starts with re-constructing the estimation

equations to make them Neyman orthogonal, which

facilitates more robustness against errors in the estimation of two key nuisance functions, the density

ratio and the conditional mean of the moment function.

We present a divergence-based method to estimate the

density ratio function, which is amenable to machine

learning algorithms including the deep learning. To

address the challenge that the conditional mean is

parametric-dependent, we adopt a nonparametric multiple-imputation strategy that avoids regression at all

possible parameter values. With the estimated nuisance functions and the orthogonal estimation equation, the inference for the target parameter is formulated via the empirical likelihood without sample

splittings. We show that the proposed estimator attains

the semiparametric efficiency bound, and the inference can be conducted with the Wilks' theorem. The

proposed method is further evaluated by simulations

and an empirical study on a transfer learning inference

for ground-level ozone pollution.

第66页

57

Joint work with Song Xi Chen.

July 13, 08:30-11:20

Plenary Talk 4: Build an End-to-End Scalable and

Interpretable Data Science Ecosystem by Integrating Statistics, ML, and Domain Sciences

Xihong Lin

Harvard University

Abstract: The data science ecosystem encompasses

data fairness, statistical, ML methods and tools, interpretable data analysis, and trustworthy decision-making. Rapid advancements in ML have revolutionized data utilization and enabled machines to

learn from data more effectively. Statistics, as the

science of learning from data while accounting for

uncertainty, plays a pivotal role in addressing complex

real-world problems and facilitating trustworthy decision-making. In this talk, I will discuss the challenges

and opportunities involved in building an end-to-end

scalable and interpretable data science ecosystem that

integrates statistics, ML, and domain science. I will

illustrate key points using the analysis of whole genome sequencing data and electronic health records

by discussing a few scalable and interpretable statistical and ML methods, tools and data science resources.

This talk aims to ignite proactive and

thought-provoking discussions, foster collaboration,

and cultivate open-minded approaches to advance

scientific discovery.

Plenary Talk 5:

Plenary Talk 5: Statistics and its Applications in

Forensic Science and the Criminal Justice System

Alicia Carriquiry

Iowa State University

Abstract: Statistical thinking should play a critical

role in the civil and criminal justice systems in the

United States, yet many forensic methods that are still

admitted in US courts have no scientific or statistical

justification. Forensic applications present unique

challenges for statisticians. For example, much of the

data that arise in forensics are non-standard, so even

defining analytical variables may require

out-of-the-box thinking. As a result, the usual statistical approaches may not enable addressing many of the

questions of interest to jurors, legal professionals and

forensic practitioners. Today's presentation introduces

some of the statistical and algorithmic methods proposed by CSAFE [Center for Statistics and Applications in Forensic Evidence www.forensicstats.org]

researchers that have the potential to impact forensic

practice in the US. Two examples are used for illustration: the analysis of questioned handwritten documents and of marks imparted by firearms on bullets or

cartridge cases. In both examples, the question we

address is one of source: do two or more items have

the same source? In the first case, we apply ''traditional'' statistical modeling methods, while in the

second case, we resort to algorithmic approaches.

Much of the research carried out in CSAFE is collaborative and while mission-driven, is also academically

rigorous and novel.

Plenary Talk 6: Generative Adversarial Learning

with Optimal Input Dimension and Its Adaptive

Generator Architecture

Huazhen Lin

Southwestern University of Finance and Economics

Abstract: In this talk, we investigate the impact of the

input dimension on the generalization error in generative adversarial networks (GANs). We first provide

both theoretical and practical evidence to validate the

existence of an optimal input dimension (OID) that

minimizes the generalization error. Then, to identify

the OID, we introduce a novel framework called generalized GANs (G-GANs), which includes existing

GANs as a special case. By incorporating the group

penalty and the architecture penalty developed in the

paper, the proposed G-GANs have several intriguing

features. First, our framework offers adaptive dimensionality reduction from the initial dimension to a

dimension necessary for generating the target distribution. Second, this reduction in dimensionality also

shrinks the required size of the generator network

architecture, which is automatically identified by the

proposed architecture penalty. Both reductions in

dimensionality and the generator network significantly improve the stability and the accuracy of the esti-

第67页

58

mation and prediction. Theoretical support for the

consistent selection of the input dimension and the

generator network is provided. Third, the proposed

algorithm involves an end-to-end training process, and

the algorithm allows for dynamic adjustments between the input dimension and the generator network

during training, further enhancing the overall performance of the G-GANs. Extensive experiments conducted with simulated and benchmark data demonstrate the superior performance of the proposed

G-GANs. Moreover, the features generated based on

the input dimensions identified by G-GANs align with

visually significant features.

Joint work with Zhiyao Tan and Ling Zhou.

July 13, 14:00-15:40

Invited Session IS053: Recent Advances in Statistical Learning

Dynamic Prediction with Individualized Feature

Selection

Lu Tian

Stanford University

Abstract: Today, physicians have access to a wide

array of tests for diagnosing and prognosticating

medical conditions. Ideally, they would apply a

high-quality prediction model, utilizing all relevant

features as input, to facilitate appropriate decisionmaking regarding treatment selection or risk assessment. However, not all features used in these prediction models are readily available to patients and physicians without incurring some costs. In practice, predictors are typically gathered as needed in a sequential

manner, while the physician continually evaluates

information dynamically. This process continues until

sufficient information is acquired, and the physician

gains reasonable confidence in making a decision.

Importantly, the prospective information to collect

may differ for each patient and depend on the predictor values already known. In this paper, we present a

novel dynamic prediction rule designed to determine

the optimal order of acquiring prediction features in

predicting a clinical outcome of interest. The objective

is to maximize prediction accuracy while minimizing

the cost associated with measuring prediction features

for individual subjects. To achieve this, we employ

reinforcement learning, where the agent must decide

on the best action at each step: either making a clinical

decision with available information or continuing to

collect new predictors based on the current state of

knowledge. To evaluate the efficacy of the proposed

dynamic prediction strategy, extensive simulation

studies have been conducted. Additionally, we provide

two real data examples to illustrate the practical application of our method.

Joint work with Bryan Cai, Ying Cui and Haoda Fu.

High-Order Statistical Expansion in Functional

Estimation

Cun-Hui Zhang

Rutgers University

Abstract: We study the estimation of a given function

of an unknown high-dimensional mean vector based

on independent observations. The key element of our

approach is a new method which we call High-Order

Degenerate Statistical Expansion. It leverages the use

of classical multivariate Taylor expansion and degenerate U-statistic and yields an explicit formula. In the

univariate case, the formula expresses the error of the

proposed estimator as the sum of a Taylor- Hoeffding

series and an explicit remainder term in the form of

the Riemann-Liouville integral as in the Taylor expansion around the true mean vector. The Taylor-Hoeffding series replaces the power of the average

noise in the classical Taylor series by its degenerate

version to give a Hoeffding decomposition as a

weighted sum of degenerate U-products of the noises.

A similar formula holds in general dimension. This

makes the proposed method a natural statistical version of the classical Taylor expansion. The proposed

estimator can be viewed as a jackknife estimator of

the Taylor-Hoeffding series and can be approximated

by bootstrap. Thus, the jackknife, bootstrap and Taylor

expansion approaches all converge to the proposed

estimator. We develop risk bounds for the proposed

estimator under proper moment conditions and a central limit theorem under a second moment condition

even in expansions of higher than the second order.

We apply this new method to several smooth and

第68页

59

non-smooth problems under minimum moment constraints.

Joint work with Fan Zhou and Ping Li.

Navigating the Societal Landscape of Generative

AI: Opportunities and Challenges from a Statistical Perspective

Weijie Su

University of Pennsylvania

Abstract: Generative AI, particularly large language

models, has rapidly emerged as a transformative innovation in data science and machine learning. As

these technologies increasingly influence human decision-making processes, they raise important societal

questions that demand careful consideration. In this

talk, we explore three key concerns from a statistical

viewpoint. First, we discuss the challenge of creating

fair AI systems that adequately represent and serve

minority groups, ensuring equitable outcomes across

diverse populations. Second, we delve into the complex task of reliably combating misinformation by

developing robust watermarking techniques for text

generated by large language models, aiming to maintain the integrity of information in the public sphere.

Third, we examine the intricate issue of using potentially copyrighted data to train AI models, navigating

the balance between leveraging valuable resources

and respecting intellectual property rights. Throughout

this talk, we will not only tackle these pressing challenges posed by generative AI but also highlight the

substantial opportunities they present for the field of

statistics to make meaningful contributions to the

responsible development of generative AI.

Network Regression and Supervised Centrality

Estimation

Haipeng Shen

The University of Hong Kong

Abstract: The centrality in a network is often used to

measure nodes’ importance and model network effects

on a certain outcome. Empirical studies widely adopt

a two-stage procedure, which first estimates the centrality from the observed noisy network and then infers the network effect from the estimated centrality,

even though it lacks theoretical understanding. We

propose a unified modeling framework, under which

we first prove the shortcomings of the two-stage procedure, including the inconsistency of the centrality

estimation and the invalidity of the network effect

inference. Furthermore, we propose a supervised centrality estimation methodology, which aims to simultaneously estimate both centrality and network effect.

The advantages in both regards are proved theoretically and demonstrated numerically via extensive

simulations and a case study in predicting currency

risk premiums from the global trade network.

Invited Session IS022: Frontier of Statistics Machine Learning

An Adaptive Transfer Learning Framework for

Functional Classification

Yang Bai

Shanghai University of Finance and Economics

Abstract: In this paper, we study the transfer learning

problem in functional classification, aiming to improve the classification accuracy of the target data by

leveraging information from related source datasets.

To facilitate transfer learning, we propose a novel

transferability function tailored for classification

problems, enabling a more accurate evaluation of the

similarity between source and target dataset distributions. Interestingly, we find that a source dataset can

offer more substantial benefits under certain conditions than another dataset with an identical distribution to the target dataset. This observation renders the

commonly-used debiasing step in the parameter-based

transfer learning algorithm unnecessary under some

circumstances to the classification problem. In particular, we propose two adaptive transfer learning algorithms based on the functional Distance Weighted

Discrimination (DWD) classifier for scenarios with

and without prior knowledge regarding informative

sources. Furthermore, we establish the upper bound

on the excess risk of the proposed classifiers, providing the statistical gain via transfer learning mathematically provable. Simulation studies are conducted to

thoroughly examine the finite-sample performance of

第69页

60

the proposed algorithms. Finally, we implement the

proposed method to Beijing air-quality data, and significantly improve the prediction of the PM2.5 level

of a target station by effectively incorporating information from source datasets.

Modeling Spatio-Temporal Extremes with Conditional Variational Autoencoders

Likun Zhang

University of Missouri

Abstract: Extreme weather events are widely studied

in the fields of agriculture, climatology, ecology, and

hydrology, to name a few. Enhanced scientific understanding of the spatio-temporal dynamics of extreme

events could significantly improve policy formulation

and decision-making within these domains. We formulate a novel approach to model spatio-temporal

extremes by conditioning on a time series (e.g., the El

Niño-Southern Oscillation (ENSO) index) via a conditional variational auto encoder (extreme-CVAE).

The prominent alignment of extremal dependences

showcase the model's ability to be a spatio-temporal

extreme emulator. Along with a decoding path, a

convolutional neural network was built to investigate

the relationship between the time series dynamics and

parameters within the latent space, thereby inheriting

the intrinsic temporal dependence structures. An extensive simulation validated the effectiveness and time

efficiency of the model. We conducted an analysis of

the monthly precipitation data which adequately

demonstrates both the time efficiency and model performance of our approach in real-world scenarios.

Joint work with Xiaoyu Ma and Christopher K.

Wikle.

Bayesian Biclustering and Its Application in Education Data Analysis

Weining Shen

University of California, Irvine

Abstract: We propose a novel nonparametric Bayesian item response theory model to estimate clusters at

the question level while simultaneously allowing for

heterogeneity at the examinee level under each question cluster, characterized by the mixture of Binomial

distributions. We present some theoretical results that

guarantee the identifiability of the proposed model and show that the model can correctly identify

question-level clusters asymptotically. We also provide a tractable sampling algorithm to obtain valid

posterior samples from the proposed model. Compared to the existing methods, the model manages to

reveal the multi-dimensionality of the examinee's

proficiency level in handling different types of questions parsimoniously by imposing a nested clustering

structure. The proposed model is evaluated via a series of simulations as well as applied to an English

proficiency assessment data set. This data analysis

example nicely illustrates how the model can be used

by test makers to distinguish different types of students and aid in the design of future tests.

Topological Analysis of Seizure-Induced Alterations in the Causal Pathways of Effective Brain

Connectivity

Anass El Yaagoubi

King Abdullah University of Science & Technology

Abstract: Traditional Topological Data Analysis

(TDA) methods, such as Persistent Homology (PH),

rely on distance measures (e.g., cross-correlation,

coherence, partial correlation, and partial coherence)

that are symmetric by definition. This approach successfully examined the topological patterns in functional brain connectivity in neuroscience. However, it

overlooked the directional dynamics crucial for understanding effective brain connectivity. This paper

proposes the Causality-Based Topological Ranking

(CBTR) method, which integrates Causal Inference

(CI) to assess effective brain connectivity with Hodge

Decomposition (HD) to rank brain regions based on

their mutual influence. The CBTR method demonstrated its ability to identify causal pathways in simulated multivariate time series data accurately. By applying the CBTR method to seizure-related EEG signals, our approach pinpoints key brain regions impacted by seizures, offering deeper insights into the

brain's hierarchical structure and the dynamics of

causal pathways during these events.

Joint work with Moo K. Chung and Hernando Om-

第70页

61

bao.

Invited Session IS036: New Advances in Complex

Data Analyses

Federated Learning of Robust Individualized Decision Rules with Application to Heterogeneous

Multi-Hospital Sepsis Population

Lu Tang

University of Pittsburgh

Abstract: This paper introduces a new objective

function and federated learning (FL) algorithm for

deriving individualized decision rules (IDRs) that are

robust to distributional uncertainty in heterogeneous

data sources. It is motivated by the need to uniformly

improve decision-making across multiple hospitals of

a single health system. Traditional approaches assume

that data are sampled from a single population of interest. With multiple hospitals that vary in patient

populations and treatments, an IDR that is effective in

one hospital may not be as effective in another. Due to

distributional heterogeneity, the performance achieved

by a globally optimal IDR varies greatly across sites,

preventing it from being safely applied to unseen sites.

Furthermore, data from various hospitals cannot be

pooled due to privacy concerns. To address these

challenges, we developed an FL framework to learn

IDRs from distributed data. The proposed framework

introduces a conditional maximin objective to enhance

individual outcomes across sites, ensuring robustness

against site variations. The proposed method effectively improves the generalizability of IDRs for the

management of sepsis in a hospital network.

Fixed and Random Covariance Regression Analyses

Tao Zou

The Australian National University

Abstract: Covariance regression analysis is an approach to linking the covariance of responses to a set

of explanatory variables X, where X can be a vector,

matrix, or tensor. Most of the literature on this topic

focuses on the \"Fixed-X '' setting and treats X as nonrandom. By contrast, treating explanatory variables X

as random, namely the \"Random-X '' setting, is often

more realistic in practice. This article aims to fill this

gap in the literature on the estimation and model assessment theory for Random-X covariance regression

models. Specifically, we construct a new theoretical

framework for studying the covariance estimators

under the Random-X setting, and we demonstrate that

the quasi-maximum likelihood estimator and the

weighted least squares estimator are both consistent

and asymptotically normal. In addition, we develop

pioneering work on the model assessment theory of

covariance regression. In particular, we obtain the

bias-variance decompositions for the expected test

errors under both the Fixed-X and Random-X settings.

We show that moving from a Fixed-X to a Random-X

setting can increase both the bias and the variance in

expected test errors. Subsequently, we propose estimators of the expected test errors under the Fixed-X

and Random- X settings, which can be used to assess

the performance of the competing covariance regression models. The proposed estimation and model assessment approaches are illustrated via extensive simulation experiments and an empirical study of stock

returns in the US market.

Latent Subgroup Analysis with Observational Data

Xinzhou Guo

The Hong Kong University of Science and Technology

Abstract: When working with candidate subgroups

defined by latent memberships, we must analyze the

latent subgroups in a valid and computationally feasible way. The classical one-stage framework, which

models the joint likelihood of all variables, may not be

feasible with observational data when there are many

potential confounders. The two-stage framework,

which estimates the latent class model and performs

subgroup analysis with estimated subgroup memberships, can accommodate potential confounders but

may suffer from bias issue due to misclassification of

latent subgroup memberships. In this paper, we investigate the maximum misclassification rate that a valid

two-stage framework can tolerate in the presence of

potential confounders and built on spectral clustering,

propose a two-stage approach to achieve the desired

第71页

62

misclassification rate and estimate and infer latent

subgroup effects consistently with observational data

in broad practical scenarios. The proposed method can

accommodate high dimensional potential confounders

and is computationally efficient. We demonstrate the

merit of the proposed method through simulation and

real data analysis.

Joint work with Yuanhui Luo, Yuqi Gu.

Invited Session IS037: New Statistical Methods for

Causal Inference and Hidden Factor Learning

Online Statistical Inference for Low-Rank Tensor

Learning

Wei Sun

Purdue University

Abstract: The study of online decision-making problems that leverage contextual information has drawn

notable attention due to their significant applications

in fields ranging from healthcare to autonomous systems. In modern applications, such context format can

be rich and can often be formulated as higher-order

tensors. Moreover, while existing online decision

algorithms mainly focus on reward maximization, less

attention has been paid to statistical inference. To fill

in these gaps, in this work, we consider an online

decision-making problem with high-order contextual

information where the true model parameters have a

low-rank structure. We propose a fully online procedure to conduct statistical inference with adaptively

collected data. The low-rank structure of the model

parameter and the adaptivity nature of the data collection process make this difficult: standard low-rank

estimators are biased and cannot be obtained in a sequential manner while existing inference approaches

in sequential decision-making algorithms fail to account for the low-rankness and are also biased. To

address these, we introduce a new online debiasing

procedure to simultaneously handle both sources of

bias.

Semi-Confirmatory Factor Analysis for High- Dimensional Data with Interconnected Community

Structures

Shuo Chen

University of Maryland

Abstract: Confirmatory factor analysis (CFA) is a

statistical method for identifying and confirming the

presence of latent factors among observed variables

through the analysis of their covariance structure. Compared to alternative factor models, CFA

offers interpretable common factors with enhanced

specificity and a more adaptable approach to modeling covariance structures. However, the application of

CFA has been limited by the requirement for prior

knowledge about \"non-zero loadings\" and by the lack

of computational scalability (e.g., it can be computationally intractable for hundreds of observed variables). We propose a data-driven semi-confirmatory

factor analysis (SCFA) model that attempts to alleviate these limitations. SCFA automatically specifies

\"non-zero loadings\" by learning the network structure

of the large covariance matrix of observed variables,

and then offers closed-form estimators for factor

loadings, factor scores, covariances between common

factors, and variances between errors using the likelihood method. Therefore, SCFA is applicable to

high-throughput datasets (e.g., hundreds of thousands

of observed variables) without requiring prior

knowledge about \"non-zero loadings\". Through an

extensive simulation analysis benchmarking against

standard packages, SCFA exhibits superior performance in estimating model parameters with a much

reduced computational time. We illustrate its practical

application through factor analysis on a

high-dimensional RNA-seq gene expression dataset.

Joint work with Yifan Yang.

Assumption Matters: A Semiparametric Analysis

of the Average Treatment Effect on the Treated

Jiwei Zhao

University of Wisconsin-Madison

Abstract: In this paper, we consider estimation of

average treatment effect on the treated (ATT), an interpretable and relevant causal estimand to policy

makers when treatment assignment is endogenous. By

considering shadow variables that are unrelated to the

treatment assignment but related to interested outcomes, we establish identification of the ATT. Then

第72页

63

we focus on efficient estimation of the ATT by characterizing the geometric structure of the likelihood,

deriving the semiparametric efficiency bound for ATT

estimation and proposing an estimator that can

achieve this bound. We rigorously establish the theoretical results of the proposed estimator. The finite

sample performance of the proposed estimator is

studied through comprehensive simulation studies as

well as an application to our motivating study.

Mendelian Randomization Analysis with Pleiotropy-Robust Log-Linear Models for Binary Outcomes

Jinzhu Jia

Peking University

Abstract: Mendelian randomization (MR) uses genetic variants as instrumental variables to infer causality between traits. In dealing with a binary outcome,

there are two challenging barriers on the way toward a

valid MR analysis, that is, the inconsistency of the

traditional ratio estimator and the existence of horizontal pleiotropy. Recent MR methods mainly focus

on handling pleiotropy with summary statistics. Many

of them cannot be easily applied to one-sample MR.

We propose two novel individual data-based methods,

respectively named random-effects and fixed-effects

MR-PROLLIM, to surmount both barriers.

MR-PROLLIM adopts risk ratio (RR), which can be

further converted to odds ratio (OR), to define the

causal effect. The random-effects MR-PROLLIM

allows correlated pleiotropy and weaker instruments.

The fixed-effects MR-PROLLIM can be implemented

with only a few selected variants. We demonstrate in

this study that the random-effects MR-PROLLIM

exhibits high statistical power while yielding fewer

false-positive detections than its competitors. The

fixed-effects MR-PROLLIM is less biased than the

classical median estimator and higher powered than

the classical mode estimator. We also found (i) the

traditional ratio method tended to underestimate binary exposure effects to a large extent, and transforming

MR-PROLLIM RR to OR provided better estimates;

(ii) about 26.5% of the real trait pairs we analyzed

were detected to have significant correlated pleiotropy;

(iii) compared to random-effects MR-PROLLIM results, the pleiotropy-sensitive method showed estimated relative biases ranging from -103.7% to 178.0%

for inferred non-zero effects. MR-PROLLIM exhibits

the potential to facilitate a more rigorous and robust

MR analysis for binary outcomes.

Invited Session IS063: Recent Topics in Machine

Learning

A Statistical Analysis of an Image Classification

Problem

Juntong Chen

University of Twente

Abstract: The availability of massive image databases resulted in the development of scalable machine

learning methods such as convolutional neural network (CNNs) filtering and processing these data.

While the very recent theoretical work on CNNs focuses on standard nonparametric denoising problems,

the variability in image classification datasets does,

however, not originate from additive noise but from

variation of the shape and other characteristics of the

same object across different images. To address this

problem, we consider a simple supervised classification problem for object detection on grayscale images.

While from the function estimation point of view,

every pixel is a variable and large images lead to

high-dimensional function recovery tasks suffering

from the curse of dimensionality, increasing the number of pixels in our image deformation model enhances the image resolution and makes the object classification problem easier. We propose and theoretically

analyze two different procedures. The first method

estimates the image deformation by support alignment.

Under a minimal separation condition, it is shown that

perfect classification is possible. The second method

fits a CNN to the data. We derive a rate for the misclassification error depending on the sample size and

the number of pixels. Both classifiers are empirically

compared on images generated from the MNIST

handwritten digit database. The obtained results corroborate the theoretical findings.

Joint work with Sophie Langer and Johannes

Schmidt- Hieber.

第73页

64

Inference with Adaptively Collected Data

Koulik Khamaru

Rutgers University

Abstract: Sequential data collection has become a

widely embraced method for bolstering the efficiency

of data-gathering endeavors. However, despite its

evident advantages, this approach frequently brings

forth intricacies to the statistical inference process.

One notable challenge arises with the ordinary least

squares (OLS) estimator within adaptive linear regression models, showcasing non-normal asymptotic

behavior. This phenomenon presents hurdles for precise inference and interpretation. This presentation

addresses two pivotal aspects: a) Illuminating instances and reasons behind the inferential challenges

encountered with adaptively collected data. b) Offering potential strategies for rectifying these issues.

Joint work with Cun-Hui Zhang and Mufang Ying.

Holdout Predictive Checks for Bayesian Model

Criticism

Gemma Moran

Rutgers University

Abstract: Bayesian modeling helps applied researchers articulate assumptions about their data and develop models tailored for specific applications. Tha- nks

to good methods for approximate posterior inference,

researchers can now easily build, use, and revise

complicated Bayesian models for large and rich data. These capabilities, however, bring into focus the

problem of model criticism. Researchers need tools to

diagnose the fitness of their models, to understand

where they fall short, and to guide their revision. In

this paper we develop a new method for Bayesian

model criticism, the Holdout Predictive Check

(HPC). HPCs are built on Posterior Predictive

Checks, a seminal method that checks a model by

assessing the posterior predictive distribution on the

observed data. However, PPCs use the data twice-both

to calculate the posterior predictive and to evaluate it

which can lead to uncalibrated p-values. HPCs, in

contrast, compare the posterior predictive distribution

to a draw from the population distribution, a heldout

dataset. This method blends Bayesian modeling with

frequentist assessment. Unlike the PPC, we prove that

the HPC is properly calibrated. Empirically, we study

HPC on classical regression, a hierarchical model of

text data, and factor analysis.

Joint work with David Blei and Rajesh Ranganath.

Wasserstein F-Tests for the Fréchet Regression on

the Bures-Wasserstein Manifold

Hongzhe Li

University of Pennsylvania

Wasserstein F-Tests for the Fréchet Regression on

the Bures-Wasserstein Manifold

Hongzhe Li

University of Pennsylvania

Abstract: This paper considers the problem of regression analysis with random covariance matrix as the

outcome with a set of Euclidean covariates in the

framework of Fréchet regression on the Bures-Wasserstein manifold. Such regression problems

have many applications in single cell genomics and

neuroscience, where we have covariate matrix measured over a large set of samples. Fréchet regression on

the Bures-Wasserstein manifold is formulated as estimating the conditional Fréchet mean, denoted by

?(?). A non-asymptotic √?-rate of convergence (up

to log ? factors) of the Fréchet mean is obtained

uniformly for ‖?‖ ≲ √log ?, which is crucial for

deriving the limiting null distribution and power of

our proposed statistical test for the null hypothesis of

no association. In addition, a central limit theorem for

the point estimate ?̂

?(?) is obtained, leading to a test

for covariate effects. The null distribution of the test

statistic is shown to converge to a weighted sum of

independent chi-squares which implies that the proposed test has the desired significance level asymptotically. Also, the power performance of the test is

demonstrated against a sequence of contiguous alternatives. Simulation results show the accuracy of the

asymptotic distributions. The proposed methods are

applied to a single cell gene expression data set that

shows the change of gene expression co-expression as

people age.

Joint work with Haoshu Xu.

第74页

65

Invited Session IS025: Independence Test and Association Analysis

Test of Conditional Independence in Factor Models via Hilbert-Schmidt Independence Criterion

Qing Cheng

Southwestern University of Finance and Economics

Abstract: This work is concerned with testing conditional independence under a factor model setting. We

propose a novel multivariate test for non-Gaussian

data based on the Hilbert-Schmidt independence criterion (HSIC). Theoretically, we investigate the convergence of our test statistic under both the null and

the alternative hypotheses and devise a bootstrap

scheme to approximate its null distribution, showing

that its consistency is justified. Methodologically, we

generalize the HSIC-based independence test approach to a situation where data follow a factor model

structure. Our test requires no nonparametric smoothing estimation of functional forms including conditional probability density functions, conditional cumulative distribution functions and conditional characteristic functions under the null or alternative, is

computationally efficient and is dimension-free in the

sense that the dimension of the conditioning variable

is allowed to be large but finite. Further extension to

nonlinear, non-Gaussian structure equation models is

also described in detail and asymptotic properties are

rigorously justified. Numerical studies demonstrate

the effectiveness of our proposed test relative to that

of several existing tests.

Joint work with Kai Xu.

Testing the Effects of High-dimensional Covariates

via Aggregating Cumulative Covariances

Kai Xu

Anhui Normal University

Abstract: In this article, we test for the effects of

high-dimensional covariates on the response. In many

applications, different components of covariates usually exhibit various levels of variation, which is ubiquitous in high-dimensional data. To simultaneous- ly

accommodate such heteroscedasticity and high dimensionality, we propose a novel test based on an

aggregation of the marginal cumulative covariances,

requiring no prior information on the specific form of

regression models. Our proposed test statistic is

scale-invariance, tuning-free and convenient to implement. The asymptotic normality of the proposed

statistic is established under the null hypothesis. We

further study the asymptotic relative efficiency of our

proposed test with respect to the state-of-art universal

tests in two different settings: one is designed for

high-dimensional linear model and the other is introduced in a completely model-free setting. A remarkable finding reveals that, thanks to the scale-invariance

property, even under the high- dimensional linear

models, our proposed test is asymptotically much

more powerful than existing competitors for the covariates with heterogeneous variances while maintaining high efficiency for the homoscedastic ones.

A Distribution Free Conditional Independence Test

with Applications to Causal Discovery

Yaowu Zhang

Shanghai University of Finance and Economics

Abstract: This paper is concerned with test of the

conditional independence. We first establish an equivalence between the conditional independence and the

mutual independence. Based on the equivalence, we

propose an index to measure the conditional dependence by quantifying the mutual dependence among the

transformed variables. The proposed index has several

appealing properties. (a) It is distribution free since

the limiting null distribution of the proposed index

does not depend on the population distributions of the

data. Hence the critical values can be tabulated by

simulations. (b) The proposed index ranges from zero

to one, and equals zero if and only if the conditional

independence holds. Thus, it has nontrivial power

under the alternative hypothesis. (c) It is robust to

outliers and heavy-tailed data since it is invariant to

conditional strictly monotone transformations. (d) It

has low computational cost since it incorporates a

simple closed-form expression and can be implemented in quadratic time. (e) It is insensitive to tuning

parameters involved in the calculation of the proposed

index. (f) The new index is applicable for multivariate

第75页

66

random vectors as well as for discrete data. All these

properties enable us to use the new index as statistical

inference tools for various data. The effectiveness of

the method is illustrated through extensive simulations

and a real application on causal discovery.

Rank-Based Indices for Testing Independence between Two High-dimensional Vectors

Yeqing Zhou

Tongji University

Abstract: To test independence between two

high-dimensional random vectors, we propose three

tests based on the rank-based indices derived from

Hoeffding’s D, Blum-Kiefer-Rosenblatt’s R and

Bergsma-Dassios-Yanagimoto’s τ*. Under the null

hypothesis of independence, we show that the distributions of the proposed test statistics converge to

normal ones if the dimensions diverge arbitrarily with

the sample size. We further derive an explicit rate of

convergence. Thanks to the monotone transformationinvariant property, these distribution-free tests can be

readily used to generally distributed random vectors

including heavily-tailed ones. We further study the

local power of the proposed tests and compare their

relative efficiencies with two classic distance covariance/correlation based tests in high-dimensional settings. We establish explicit relationships between D, R,

τ* and Pearson’s correlation for bivariate normal random variables. The relationships serve as a basis for

power comparison. Our theoretical results show that

under a Gaussian equicorrelation alternative: (i) the

proposed tests are superior to the two classic distance

covariance/correlation based tests if the components

of random vectors have very different scales; (ii) the

asymptotic efficiency of the proposed tests based on D,

τ* and R are sorted in a descending order.

Joint work with Kai Xu, Liping Zhu, Runze Li.

Invited Session IS066: Semiparametric Modeling

for Complex Survival Data

Privacy Preserving Survival Analysis Based on

Federated Gradient Boosted Trees

Hong Wang

Central South University

Abstract: Data driven machine learning models have

found increasing applications in survival analysis over

the past decade. One major issue with these models is

the need of sufficient training samples. However, due

to privacy, safety and other concerns, survival data are

usually distributed across various institutions and

cannot be aggregated directly. In this research, we aim

to analyze distributed and private survival data within

a federated learning framework. Different from previous deep learning based federated models, the proposed FedSurX approach is implemented with efficient histogram-based gradient boosted survival trees.

Experiment results have shown that the proposed

approach outperforms state-of-the-art deep survival

models such as DeepSurv, DeepHit, CoxCC and

NnetSurvival by a large margin in terms of predictive

accuracy. To deal with the data heterogeneity issue, a

data augmentated component is also provided and

demonstrated via extensive simulated and real datasets.

Joint work with Xinyi Zhang.

Bayesian Estimation of Partial Functional Tobit

Censored Quantile Regression Model

Chunjie Wang

Changchun University of Technology

Abstract: The study proposes a partial functional

Tobit censored quantile regression model (PFTCQR)

to describe the relationship between response variables that are left- or right-censored and a combination

of functional and scalar predictors. The functional

principal component analysis (FPCA) and moment

method are employed to estimate the slope and covariance functions of the functional predictors. Furthermore, an efficient Markov Chain Monte Carlo

(MCMC) algorithm is developed, leveraging the location-scale mixture representation of the asymmetric

Laplace distribution (ALD), to estimate latent variables and other parameters. The performance of the

proposed methodology is evaluated through simulation studies, and the applicability of the method is

demonstrated through a study of Laryngeal carcinoma.

Joint work with Xinyuan Song.

第76页

67

Factor-Augmented Transformation Models for

Interval-Censored Failure Time Data

Shuwei Li

Guangzhou University

Abstract: Interval-censored failure time data frequently arise in various scientific studies where each

subject experiences periodical examinations for the

occurrence of the failure event of interest, and the

failure time is only known to lie in a specific

time interval. In addition, collected data may include

multiple observed variables with a certain degree of

correlation, leading to severe multicollinearity issues.

This study proposes a factor-augmented transformation model to analyze interval-censored failure

time data while reducing model dimensionality and

avoiding multicollinearity elicited by multiple correlated covariates. We provide a joint modeling framework by comprising a factor analysis model to group

multiple observed variables into a few latent factors

and a class of semiparametric transformation models

with the augmented factors to examine their and other

covariate effects on the failure event. Furthermore, we

propose a nonparametric max- imum likelihood estimation approach and develop a computationally stable

and reliable expectation- maximization algorithm for

its implementation. We establish the asymptotic properties of the proposed estimators and conduct simulation studies to assess the empirical performance of the

proposed method. An application to the Alzheimer's

Disease Neuro- imaging Initiative study is provided. An R package ICTransCFA is also available for

practitioners.

Joint work with Hongxi Li, Liuquan Sun, Xinyuan

Song.

Invited Session IS070: Statistical Inference for

Biological and Medical Data

Microbial Causal Mediation Inference with Microbiome/Metagenomic Data

Huilin Li

New York University

Abstract: Emerging evidence indicates that microbiome plays a mediating role in human health and

disease. Understanding this connection could drive the

development of targeted microbiome- focused interventions. In this talk, I will present a rigorous Sparse

Microbial Causal Mediation Model (SparseMCMM)

to investigate the causal mediating role of the

high-dimensional and compositional microbiome in a

typical three-factor (treatment, microbiome and outcome) causal study design. The proposed

SparseMCMM quantifies and tests the overall mediation effect and the component-wise mediation effect

of microbiome under the counter- factual framework.

Recently, we extend Sparse- MCMM to investigate

microbiome’s role in health disparities, depicting a

plausible path from a non- manipulable exposure (e.g.,

ethnicity or region) to the outcome through microbiome (SparseMCMM_HD).

Joint work with Chan Wang.

Adaptive Estimation in Multivariate Response

Regression with Hidden Variables

Yang Ning

Cornell University

Abstract: This paper studies the estimation of the

coefficient matrixΘin multivariate regression with

hidden variables, ? = (Θ)

?? + (?

)

?? + ? ,

where ? is a m-dimensional response vector, ? is

a p-dimensional vector of observable features, ? represents a K-dimensional vector of unobserved hidden

variables, possibly correlated with ? , and E is an

independent error. The number of hidden variables K is unknown and both m and p are allowed but

not required to grow with the sample size n. Since

only ? and ? are observable, we provide necessary

conditions for the identifiability of Θ. The same set of

conditions are shown to be sufficient when the error E is homoscedastic. Our identifiability proof is

constructive and leads to a novel and computationally

efficient estimation algorithm, called HIVE. The first

step of the algorithm is to estimate the best linear

prediction of ? given ? in which the unknown coefficient matrix exhibits an additive decomposition

of Θ and a dense matrix originated from the correlation between ? and the hidden variable ?. Under the

row sparsity assumption on Θ, we propose to mini-

第77页

68

mize a penalized least squares loss by regularizing Θ via a group-lasso penalty and regularizing the

dense matrix via a multivariate ridge penalty.

Non-asymptotic deviation bounds of the in-sample

prediction error are established. Our second step is to

estimate the row space of ? by leveraging the covariance structure of the residual vector from the first

step. In the last step, we remove the effect of hidden

variable by projecting ? onto the complement of the

estimated row space of ? . Non-asymptotic error

bounds of our final estimator are established. The

model identifiability, parameter estimation and statistical guarantees are further extended to the setting

with hesteroscedastic errors.

Manifold Learning for Noisy and High- Dimensional Datasets

Xiucai Ding

University of California, Davis

Abstract: Manifold learning theory has garnered

considerable attention in the modeling of expansive

biomedical datasets, showcasing its ability to capture

data essence more effectively than traditional linear

methodologies. Nevertheless, prevalent algorithms

like Diffison Maps (DM), Laplacian Eigenmaps (LE),

t-stochastic neighborhood (t-SNE), locally linearly

embedding (LLE) are primarily designed for

low-dimensional and clean datasets, whereas contemporary biomedical datasets tend to be

high-dimensional and noisy. This presentation addresses the adaptation of these algorithms to effectively accommodate the challenges posed by high

dimensionality and noise in modern datasets assuming

that the datasets are sampled from some smooth manifolds and corrupted by high-dimensional noise. The

keys are to properly design the graphs and carefully

choose the parameters in an adaptive fashion.

Sensitivity Analysis for Quantiles of Hidden Biases

in Matched Observational Studies

Xinran Li

University of Chicago

Abstract: Causal conclusions from observational

studies may be sensitive to unmeasured confounding.

In such cases, a sensitivity analysis is often conducted,

which, in general, tries to infer the minimum amount

of hidden biases needed in order to explain away the

observed association between treatment and outcome.

If the needed bias is large, then the treatment is likely

to have significant effects. The Rosenbaum sensitivity

analysis is a modern approach for conducting sensitivity analysis in matched observational studies. It investigates what magnitude the maximum of hidden biases

from all matched sets needs to be in order to explain

away the observed association. However, such a sensitivity analysis can be overly conservative and pessimistic, especially when investigators suspect that

some matched sets may have exceptionally large hidden biases. In this paper, we generalize Rosenbaum's

framework to conduct sensitivity analysis on quantiles

of hidden biases from all matched sets, which are

more robust than the maximum. Moreover, the proposed sensitivity analysis is simultaneously valid over

all quantiles of hidden biases and is thus a free lunch

added to the conventional sensitivity analysis. The

proposed approach works for general outcomes, general matched studies and general test statistics. In

addition, we demonstrate that the proposed sensitivity

analysis also works for bounded null hypotheses when

the test statistic satisfies certain properties. An R

package implementing the proposed approach is

available online.

Joint work with Dongxiao Wu.

Invited Session IS078: Statistics in Earth Science

Applications

Detection and Attribution Analysis with Estimating

Equations

Tianying Wang

Colorado State University

Abstract: Climate change detection and attribution

has played a central role in establishing the influence

of human activities on the climate. Optimal fingerprinting, a linear regression with errors-in- variables

(EIV), has been widely used in detection and attribution analyses of climate change. The method regresses

observed climate variables on the expected climate

responses to the external forcings, which are measured

第78页

69

with EIVs. The reliability of the method depends critically on proper point and interval estimation of the

regression coefficients. The confidence intervals constructed from the prevailing method, total least

squares (TLS), have been reported to be too narrow to

match their nominal confidence levels. We propose a

novel framework to estimate the regression coefficients based on an efficient, bias-corrected estimating

equations approach. The confidence intervals are constructed with a pseudo residual bootstrap variance

estimator that takes advantage of the available control

runs. Our regression coefficient estimator is unbiased

with a smaller variance than the TLS estimator. Our

estimation of sampling variability of the estimator has

a low bias compared to that from TLS, which is substantially negatively biased. The resulting confidence

intervals for the regression coefficients have

close-to-nominal level coverage rates, which ensures

valid inferences in detection and attribution analysis.

By applying this advanced method to HadCRUT5

observational data and CMIP6 multimodel simulations, our study reevaluates temperature detection and

attribution at global and regional levels, strengthens

the existing detection and attribution conclusions at

the global scale, and providing evidence of the effect

of anthropogenic forcings in various regions.

Detecting Marine Heatwaves Below the Sea Surface Globally Using Physics-Informed Statistical

Learning

Furong Li

Ocean University of China

Abstract: Extreme warm water events, known as

marine heatwaves (MHWs), cause a variety of adverse

impacts on the marine ecosystem. They are occurring

ubiquitously across the global ocean. Yet monitoring

MHWs below the sea surface is still challenging due

to the sparsity of in-situ temperature observations.

Here, we propose a statistical learning method guided

by ocean dynamics and optimal prediction theory, to

detect subsurface MHWs based on the observable sea

surface temperature and sea surface height. This

physics-informed statistical learning method shows

good skills in detecting subsurface MHWs in the oceanic epipelagic zone over many parts of the global

ocean. It outperforms both the classical ordinary least

square regression and popular deep learning methods

that do not utilize ocean dynamics, with clear physical

interpretation for its outperformance. Our study provides a useful statistical learning method for near

real-time monitoring of subsurface MHWs at a global

scale and highlights the importance of incorporating

physical information for enhancing the efficiency and

interpretability of statistical learning.

Joint work with Xiang Zhang, Zhao Jing, Bohai

Zhang, Tianshi Du and Xiaohui Ma.

A Bayesian Nonstationary Model for Spatial Binary Data with Application to Characterizing Surface

Marine Heatwaves

Bohai Zhang

Beijing Normal University-Hongkong Baptist University

Abstract: Spatial binary data are ubiquitous nowadays in many disciplines, such as presence and absence of species in ecology, cloud mask/sea-ice detection in remote sensing, incidences of diseases in epidemiology, to name a few. In this talk, we present a

spatial generalized linear model for spatial binary data,

which is driven by a latent nonstationary Gaussian

process. The non-stationarity of the latent process is

achieved by domain partitioning method based on

random spanning trees. We will give the exact Bayesian inference method of model parameters through

data augmentation and equip the proposed method

with fast computation strategy. The performance of

the proposed method is then demonstrated through

simulation studies. Last, we will apply the proposed

method to identify the surface marine heatwaves

events.

Global Carbon Flux Estimated Using EnKF with

In-situ and Satellite ??? Observations

Wu Su

Peking University

Abstract: Accurate estimation of carbon removal by

terrestrial ecosystems and oceans is crucial to the

success of global carbon mitigation initiatives. The

第79页

70

emergence of multi-source ??2 observations offers

prospects for an improved assessment of carbon fluxes. However, the utility of these diverse observations

has been impeded by their heterogeneity, leading to

much variation in estimated carbon fluxes. To harvest

the diverse data types, this paper develops a Multi-observation Carbon Assimilation System (MCAS),

which simultaneously integrates both satellite and

ground-based observations. MCAS modifies the ensemble Kalman filter to apply different inflation factors to different types of observation errors, addressing the heterogeneity between satellite and in-situ data.

The carbon flux calculations derived from MCAS

have been proven to outperform those obtained from a

single source on the commonly used independent

validation datasets, and demonstrating a 20% reduction in error compared to existing carbon flux products. We use MCAS to conduct ecosystem and ocean

carbon flux inversion for the period of 2016-2020,

which reveals that the 5-year average global net terrestrial and ocean sink were at 1.84 ?? 0.60 and

2.74 ?? 0.49 petagrams, absorbing approximately

47% of human-caused ??2 emissions together,

which were consistent with the Global Carbon Project

estimates of 1.82 and 2.66 petagrams.

Joint work with Binghao Wang, Hanyue Chen, Lin

Zhu, Xiaogu Zheng and Song Xi Chen.

Invited Session IS065: Robust Inference in

High-Dimensional Complex Data

Two-way Homogeneity Pursuit for Quantile Network Vector Autoregression

Xuening Zhu

Fudan University

Abstract: While the Vector Autoregression (VAR)

model has received extensive attention for modelling

complex time series, quantile VAR analysis remains

relatively underexplored for high-dimensional time

series data. To address this disparity, we introduce a

two-way grouped network quantile (TGNQ) autoregression model for time series collected on

large-scale networks, known for their significant heterogeneous and directional interactions among nodes.

Our proposed model simultaneously conducts node

clustering and model estimation to balance complexity

and interpretability. To account for the directional

influence among network nodes, each network node is

assigned two latent group memberships that can be

consistently estimated using our proposed estimation

procedure. Theoretical analysis demonstrates the consistency of membership and parameter estimators

even with an overspecified number of groups. With

the correct group specification, estimated parameters

are proven to be asymptotically normal, enabling valid

statistical inferences. Moreover, we propose a quantile

information criterion for consistently selecting the

number of groups. Simulation studies show promising

finite sample performance, and we apply the methodology to analyze connectedness and risk spillover

effects among Chinese A-share stocks.

Joint work with Wenyang Liu.

Bellman Conformal Inference: Calibrating Prediction Intervals for Time Series

Lihua Lei

Stanford University

Abstract: We introduce Bellman Conformal Inference

(BCI), a framework that wraps around any time series

forecasting models and provides calibrated prediction

intervals. Unlike the existing methods, BCI is able to

leverage multi-step ahead forecasts and explicitly

optimize the average interval lengths by solving a

one-dimensional stochastic control problem (SCP) at

each time step. In particular, we use the dynamic programming algorithm to find the optimal policy for the

SCP. We prove that BCI achieves long-term coverage

under arbitrary distribution shifts and temporal dependence, even with poor multi-step ahead forecasts.

We find empirically that BCI avoids uninformative

intervals that have infinite lengths and generates substantially shorter prediction intervals on volatility

forecasting problems when compared with existing

methods.

Joint work with Zitong Yang, Emmanuel Candès.

Transfer Learning for Spatial Autoregressive

Models

Wei Zhong

第80页

71

Xiamen University

Abstract: The spatial autoregressive (SAR) modelhas

been widely applied in various empirical economic

studies to characterize the spatial dependence among

subjects. However, the precision of estimating the

SAR model diminishes when the sample size of the

target data is limited. In this paper, we propose a new

transfer learning framework for the SAR model to

borrow the information from similar source data to

improve both estimation and prediction. When the

informative source data sets are known, we introduce

a two-stage algorithm, including a transferring stage

and a debiasing stage, to estimate the unknown parameters and also establish the theoretical convergence rates for the resulting estimators. If we do not

know which sources to transfer, a transferable source

detection algorithm is proposed to detect informative

sources data based on spatial residual bootstrap to

retain the necessary spatial dependence. Its detection

consistency is also derived. Simulation studies

demonstrate that using informative source data, our

transfer learning algorithm significantly enhances the

performance of the classical two-stage least squares

estimator. In the empirical application, we apply our

method to the election prediction in swing states in the

2020 U.S. presidential election, utilizing polling data

from the 2016 U.S. presidential election along with

other demographic and geographical data. The empirical results show that our method outperforms traditional estimation methods.

Joint work with Hao Zeng, Xingbai Xu.

CP Factor Model for Dynamic Tensors

Rong Chen

Rutgers University

Abstract: Observations in various applications are

frequently represented as a time series of multidimensional arrays, called tensor time series, preserving the

inherent multidimensional structure. In this paper, we

present a factor model approach, in a form similar to

tensor CP decomposition, to the analysis of

high-dimensional dynamic tensor time series. As the

loading vectors are uniquely defined but not necessarily orthogonal, it is significantly different from the

existing tensor factor models based on Tucker-type

tensor decomposition. The model structure allows for

a set of uncorrelated one-dimensional latent dynamic

factor processes, making it much more convenient to

study the underlying dynamics of the time series. A

new high order projection estimator is proposed for

such a factor model, utilizing the special structure and

the idea of the higher order orthogonal iteration procedures commonly used in Tucker-type tensor factor

model and general tensor CP decomposition procedures. Theoretical investigation provides statistical

error bounds for the proposed methods, which shows

the significant advantage of utilizing the special model structure. Simulation study is conducted to further

demonstrate the finite sample properties of the estimators. Real data application is used to illustrate the

model and its interpretations.

Invited Session IS052: Recent Advances in Sequencing and Imaging Data Analysis

Latent Subgroup Identification in Image-on-Scalar

Regression

Yajuan Si

University of Michigan

Abstract: Image-on-scalar regression has been a popular approach to modeling the association between

brain activities and scalar characteristics in neuroimaging research. The associations could be heterogeneous across individuals in the population, as indicated by recent large-scale neuroimaging studies, e.g.,

the Adolescent Brain Cognitive Development (ABCD)

study. The ABCD data can inform our understanding

of heterogeneous associations and how to leverage the

heterogeneity and tailor interventions to increase the

number of youths who benefit. It is of great interest to

identify subgroups of individuals from the population

such that: 1) within each subgroup the brain activities

have homogeneous associations with the clinical

measures; 2) across subgroups the associations are

heterogeneous; and 3) the group allocation depends on

individual characteristics. Existing image-on-scalar

regression methods and clustering methods cannot

directly achieve this goal. We propose a latent subgroup image-on-scalar regression model (LASIR) to

第81页

72

analyze large-scale, multi-site neuroimaging data with

diverse sociodemographics. LASIR introduces the

latent subgroup for each individual and group-specific,

spatially varying effects, with an efficient stochastic

expectation maximization algorithm for inferences.

We demonstrate that LASIR outperforms existing

alternatives for subgroup identification of brain activation patterns with functional magnetic resonance

imaging data via comprehensive simulations and applications to the ABCD study. We have released our

reproducible codes for public use with the software

package available at https://github.com/ zikaiLin/lasir

.

Biomarker Identification Using High Dimensional

Inference with An Application in CAR-T Cell Immunotherapy

Vicky Wu

Fred Hutchinson Cancer Center

Abstract: Chimeric antigen receptor T-cell (CAR-T)

therapy shows great efficacy for blood cancer patients,

while it’s largely unknown why some patients are not

responding, which makes it crucial to identify biomarkers to predict response in CAR-T studies. One

of the research bottlenecks is the lack of adequate

methods to identify biomarkers from high dimensional

data where many biomarkers are highly correlated and

the sample size (n) is limited. Motivated by the real

data from CAR-T study, we proposed to use a cutting-edge high dimensional inference (HDI) method to

identify biomarkers associated with different types of

clinical outcomes. HDI can deal with \"large p small

n\" situations, and provide \"de-biased\" estimates for

coefficient, and importantly, allows computation of

confidence interval and p-value without refitting,

enabling easier interpretation by researchers. We have

applied this HDI approach to identify risk factors

associated impaired hematopoietic recovery after

CD19 CAR-T cell therapy. Furthermore, we have

developed an open-source R Shiny tool HDI_Shiny

that allow researchers to upload their own correlative

data and use HDI or other statistical model to conduct

multivariable analysis, with different types of clinical

outcomes including continuous, binary, and survival.

An ongoing project is to use tree-based methods to

address this challenge. Through extensive simulations,

we compared the performance of different methods

including HDI, Stability, Stepwise regression, Random Forest (RF), XGBoost, Bayesian additive regression trees (BART), and conditional inference trees for

both continuous (e.g., CAR-T cell peak) and survival

(e.g., progression-free survival) outcomes. As expected, a larger n, a larger effect size β, and a smaller

number of true risk factors p0 showed better performance as a larger c-index. The dimensionality p has

negligible impact, especially when β is large. We observed HDI and BART outperforms other methods

through most scenarios, while the MCMC procedure

in BART is computationally intensive.

Functional Tensor Regression

Tongyu Li

Peking University

Abstract: Tensor regression has attracted significant

attention in statistical research. This study tackles the

challenge of handling covariates with smooth varying

structures. We introduce a novel framework, termed

functional tensor regression, which incorporates both

the tensor and functional aspects of the covariate. To

address the high dimensionality and functional continuity of the regression coefficient, we employ a low

Tucker rank decomposition along with smooth regularization for the functional mode. We develop a functional Riemannian Gauss--Newton algorithm that

demonstrates a provable quadratic convergence rate,

while the estimation error bound is based on the tensor covariate dimension. Simulations and a neuroimaging analysis illustrate the finite sample performance

of the proposed method.

Joint work with Fang Yao, Anru Zhang.

Streaming Tensor Factorization

Xiwei Tang

University of Virginia

Abstract: Multidimensional streaming data arises in

many modern applications, which have gained explosive attention in recent years. Conventional tensor

decomposition techniques usually lack the power to

第82页

73

handle such streaming tensor data over time. The

proposed method combines tensor-train decomposition with recursive least-squares filtering to track the

evolving tensor effectively. By minimizing a weighted

least-squares objective function, the algorithm accounts for missing values and enforces time-variation

constraints on the tensor-train cores as new data arrives. Our algorithm effectively tracks tensors from

noisy, incomplete, and high-dimensional observations

in both static and dynamic environments. Its performance is validated through multiple experiments on

both synthetic and real-world data.

Joint work with James Lee.

Invited Session IS085: Industrial Big Data and

Intelligent Statistical Analysis

工业大数据和智能化统计分析

Application of Intelligent Statistical Analysis in

Biotechnology and Medical Treatment

智能化统计分析在生技医疗上之应用

Bangchang Xie

Fu Jen Catholic University

摘要: 智能化统计分析在生技医疗领域中的应用日

益广泛。随着大数据和人工智能技术的发展,统计

分析方法已从传统的手工计算转变为依赖先进算

法和机器学习模型的智能化分析。这些技术在生物

技术和医疗领域发挥着至关重要的作用,包括疾病

预测、个性化治疗方案的制定、药物研发以及医疗

图像分析等。本文将探讨智能化统计分析的基本原

理和方法,并通过具体案例展示其在生技医疗中的

实际应用,强调其对提高医疗服务质量和推动医疗

科技进步的重要性。

Statistical Measurement and Application of Industrial Big Data

工业大数据统计测度及其应用

Yong Li

Chengdu University of Information Technology

摘要: 基于智能大数据时代对统计学的新认知下,

阐述工业大数据的基本内涵和特征,从信息物理系

统、数字孪生体、元宇宙和大模型视角,探索工业

大数据的统计测度及其应用。

Promote Sign Consistency in Cure Rate Model for

Credit Scoring

Chenlu Zheng

Fu Jian Police College

Abstract: Cure rate models are widely adopted in

credit scoring. Cure rate models consist of two parts:

an incident part which predicts the probability of default, and a latency part which predicts when they are

likely to default. In the standard cure rate model, there

are no constraints on the relations between the coefficients in the two model parts. However, in practical

applications, the two model parts are quite related. It

is desirable that there may be some relations between

the two sets of the coefficients corresponding to the

same covariates. Existing works have considered incorporating a joint distribution or structural effect,

which is too restrictive. In this paper, we consider a

more flexible model that allows the two sets of covariates can be in different distributions and magnitudes. In many practical cases, it is hard to interpret

the results when the two sets of the coefficients of the

same covariates have conflicting signs. Therefore, we

proposed a sign consistency cure rate model with a

sign-based penalty to improve interpretability. To

accommodate high-dimensional data, we adopt a

group lasso penalty for variable selection. Simulations

and a real data analysis demonstrate that the proposed

method has competitive performance compared with

alternative methods.

Industrial Intelligence and Corporate Environmental Performance: An Industry-Linked Perspective

Jinfang Tian

Shandong University of Finance and Economics

Abstract: Industrial intelligence provides new productive forces for green and high-quality development,

but objectively requires a clear judgment on whether it

can trigger the \"smart environment paradox\" in the era

of Industry 4.0. Based on the theory of industrial

linkage and the input-output technique of value-added

decomposition, this paper studies the environmental

effects of industrial intelligence at the enterprise level.

The study found that: (1) Industrial intelligence significantly improves the environmental performance of

第83页

74

the core enterprises through the synergy of human-machine cooperation; (2) Industrial intelligence

stimulates downstream enterprises' environmental

governance through the green demonstration effect

and affects the environmental costs of upstream enterprises through the pollution transfer effect, resulting in significant positive forward linkage effects and

negative backward linkage effects on the environmental performance of enterprises; (3) In heavily

polluting industries, industries with high supply chain

concentration, and enterprises with high levels of

digitization, industrial intelligence significantly promotes green and high-quality development of enterprises, and the environmental effects released through

forward and backward linkages are more pronounced;

(4) Further research found that policy empowerment

and investment empowerment help weaken the negative backward linkage effects. The findings of this

paper provide theoretical support and policy recommendations for both solving the \"smart environment

paradox\" and promoting the national strategy of \"developing new productive forces for new industrialization\".

Joint work with Xiaoqi Ren, Yunliang Wang.

Invited Session IS076: Statistical Learning on

Multi-Source and Complicated Data

Decentralized Learning of Quantile Regression: A

Smoothing Approach with Two Bandwidths

Zhongyi Zhu

Fudan University

Abstract: Distributed estimation has attracted a significant amount of attention recently due to its advantages in computational efficiency and data privacy

preservation. In this article, we focus on quantile regression over a decentralized network. Without a coordinating central node, a decentralized network improves system stability and increases efficiency by

communicating with fewer nodes per round. However,

existing related works on decentralized quantile regression either have slow (sub-linear) convergence

speed or rely on some restrictive modelling assumptions (e.g. homogeneity of errors). We propose a novel

method for decentralized quantile regression which is

built upon the smoothed quantile loss. However, we

argue that the smoothed loss proposed in the existing

literature using a single smoothing bandwidth parameter fails to achieve fast convergence and statistical

efficiency simultaneously in the decentralized setting,

which we refer to as the speed-efficiency dilemma.

We propose a novel quadratic approximation of the

quantile loss using a big bandwidth for the Hessian

and a small bandwidth for the gradient. Our method

enjoys a linear convergence rate and has optimal statistical efficiency. Numerical experiments and real

data analysis are conducted to demonstrate the effectiveness of our method.

Joint work with Jianwei Shi and Heng Lian.

Tuning-Free Sparse Clustering via Alternating

Hard-Thresholding

Niansheng Tang

Yunnan University

Abstract: Model-based clustering is a commonlyused technique to partition heterogeneous data into

homogeneous groups. When the analysis is to be

conducted with a large number of features, analysts

face simultaneous challenges in model interpretability,

clustering accuracy, and computational efficiency.

Several Bayesian and penalization methods have been

proposed to select important features for model-based

clustering. However, the performance of those methods relies on a careful algorithmic tuning, which can

be time-consuming for high-dimensional cases. In this

paper, we propose a new sparse clustering method

based on alternating hard-thresholding. The new

method is conceptually simple and tuning-free. With a

user-specified sparsity level, it efficiently detects a set

of key features by eliminating a large number of features that are less useful for clustering. Based on the

selected key features, one can readily obtain an effective clustering of the original high-dimensional data

under a general sparse covariance structure. Under

mild conditions, we show that the new method leads

to clusters with a misclassification rate consistent to

the optimal rate as if the underlying true model were

used. The promising performance of the new method

is supported by both simulated and real data exam-

第84页

75

ples.

Joint work with Wei Dong, Chen Xu, Jinhan Xie.

Orthogonality Specification Testing with Complex

Survey Data

Puying Zhao

Yunnan University

Abstract: In this paper, we consider specification

analysis for linear quantile models in the context of

complex surveys. An orthogonal projection-based

nonparametric test is proposed for the correct specification of a linear conditional quantile function over a

continuum of quantile levels. The proposed test statistic not only can be used to assess the validity of

post design-based estimation inferences regarding the

effect of conditional variable on the distribution of

outcomes, but also can successfully incorporate design effects of complex survey data. We derive the

limiting distribution of the proposed test statistic under the null and alternative hypothesis. In particular,

we show that estimation of the unknown model parameters has no asymptotic impact on the proposed

test statistic. To implement the test in practice, we

propose a multiplier bootstrap procedure and establish

its validity. The performance of the proposed method

is evaluated through simulations, and the utility of the

methodology is demonstrated by a real-world example.

Joint work with Mengmeng Xu.

Communication-Efficient Distributed Subgroup

Learning

Wensheng Zhu

Northeast Normal University

Abstract: Multicentre research has become a prevalent practice in the era of big data to cope with the

heavy storage and computing burden brought by massive data, and exploring the latent clustering structure

of objects is a primary and fundamental task in many

fields. In this article, we study the subgroups analysis

in the distributed scenarios, which can be applied to

areas such as cross-institutional pharmaceutical research. To achieve efficient communication and privacy-protected estimation and grouping, we propose

and investigate a Distributed Surrogate Fusion Penalized Regression (DSFPR) approach. Specifically, we

introduce a surrogate objective to the global pairwise

fusion penalized least squares approach. Our approach

consists of two stages. In the first stage, we establish a

preliminary grouping structure through local subgroup

analysis. Subsequently, based on the grouping structure obtained in the first stage, we construct the surrogate objective function. To address parallel problem- solving, we design a distributed alternating direction method of multiplier algorithm, which is costeffective in communication and does not involve the

transmission of personal information. We introduce

the sub-oracle property for estimation in local subgroup analysis and establish theoretical properties for

the final estimation under both correct and incorrect

preliminary grouping structures. Finally, simulations

and real data analysis validate the effectiveness of our

approach.

Joint work with Yining Zhou.

Special Session SS1: Bernoulli Session on Statistical Methodology & Theory

Enhancing High-Dimensional Statistical Learning

Through Random Matrix Theory: Reviving Pseudo-Inverses

Nestor Parolya

Delft University of Technology

Abstract: Modern high-dimensional statistics and

machine learning contend with vast datasets. Conventional algorithms, tailored for smaller dimensions,

falter in the face of real-world large datasets. Random

Matrix Theory (RMT) provides tools to address this

\"curse of dimensionality\", aiding in the refinement or

reconstruction of suboptimal algorithms and estimators.In this presentation, we establish a connection

between modern RMT and high-dimensional statistics

combined with machine learning, which we refer to as

'high-dimensional statistical learning'. Subsequently,

we present our recent results concerning the regularized learning and estimation of large covariance matrices using the Moore-Penrose pseudoinverse and

Tikhonov regularization combined with statistical

shrinkage techniques.Our findings contribute to con-

第85页

76

structing improved shrinkage estimators for the precision matrix, particularly in scenarios where the number of variables p is comparable to the sample size n,

resulting in p/n converging to a constant c>1 (singular

sample covariance matrix). We conclude with a real-data application in finance, demonstrating the superiority of the proposed methods over benchmarks like

non-linear shrinkage and cross-validation techniques

in machine learning.

Joint work with Taras Bodnar.

Operations with Concentration Inequalities

Cosme Louart

The Chinese University of Hong Kong (Shenzhen)

Abstract: The award of this year's Abel Prize to

Michel Talagrand has shed new light on the importance of concentration in measure theory in the

whole field of probability. Taking up his legacy, we

will present some tools for performing operations on

concentration inequalities similar to those obtained by

Talagrand and others. More precisely, we will express

the concentration of the sum, the product and general

non-Lipschitz transformations on so-called \"concentrated vectors'', considering not only Gaussian but also

large-tailed decay. Remarkably, these operations allow

the natural appearance of parallel sum and parallel

product, originally introduced in electrical engineering.

As a simple application of these techniques, an efficient proof of the Hanson-Wright inequality will be

presented.

Asymptotic Properties of K-means and Its Modification under High Dimensional Settings

Kento Egashira

Tokyo University of Science

Abstract: While k-means has shown promise as a

methodology, its theoretical underpinnings remain

underexplored, particularly in high-dimensional contexts. In this presentation, we seek to delve into our

current understanding of k-means. Initially, we establish the asymptotic properties of k-means under mild

conditions, even amidst high-dimensional data. Then,

we explore the asymptotic properties of kernel

k-means without necessitating the specification of a

particular kernel function. This enables us to discern

the disparities between kernel k-means and conventional k-means. Leveraging these foundational insights, we consider a modification for k-means for

high dimensional data. Subsequently, numerical simulation studies are given and we discuss the performance of k-means for high dimensional data.

Joint work with Kazuyoshi Yata, Makoto Aoshima.

Invited Session IS056: Recent Advances in Statistical Network Analysis - Methodology and Applications

Autoregressive Networks: Node Heterogeneity,

Homophily, and Beyond

Binyan Jiang

The Hong Kong Polytechnic University

Abstract: Statistical modeling of network data is an

important topic in various areas. Although many real

networks are dynamic in nature, most existing statistical models and related inferences for network data

are confined to static networks, and the development

of the foundation for dynamic network models is still

in its infancy. In this talk, I will introduce an autoregressive model which directly depicts the dynamic

change of the edges over time for dynamic network

processes. In particular, I will present some recent

results when different stylized network features such

as node heterogeneity and link homophily are considered jointly with the autoregressive structure.

A Sparse Latent Space Model for Text Network

Xiaoyue Maggie Niu

The Pennsylvania State University

Abstract:Many real world networks contain rich text

information in the edges, such as email communications and interactions between social media users. We

propose a new sparse latent space network model that

utilizes the text information in the edges and accomodates nodes' differential preferences on each text topic.

We establish a set of identifiability conditions for the

estimation of the proposed model and formulate a

projected gradient descent algorithm to estimate the

parameters. We further investigate the theoretical

properties of the proposed model and estimation algo-

第86页

77

rithm. The effectiveness of our method is demonstrated through simulations and an analysis of the

Enron email dataset.

Joint work with Maoyu Zhang, Biao Cai, Dong Li,

Jingfei Zhang.

A Latent Space Model for Weighted Keyword

Co-occurrence Networks with Applications in

Knowledge Discovery in Statistics

Rui Pan

Central University of Finance and Economics

Abstract:Keywords are widely recognized as pivotal

in conveying the central idea of academic articles. In

this article, we construct a weighted and dynamic

keyword co-occurrence network and propose a latent

space model for analyzing it. Our model has two special characteristics. First, it is applicable to weighted

networks; however, most previous models were primarily designed for unweighted networks. Simply

replacing the frequency of keyword co-occurrence

with binary values would result in a significant loss of

information. Second, our model can handle the situation where network nodes evolve over time, and assess the effect of new nodes on network connectivity.

We utilize the projected gradient descent algorithm to

estimate the latent positions, and establish theoretical

properties of the estimators. In the real data application, we study the keyword co-occurrence network

within the field of statistics. We identify popular

keywords over the whole period as well as within

each time period. For keyword pairs, our model provide a new way to assess the association between

them. Finally, we observe that the interest of statisticians in the emerging research areas is gradually

growing in recent years.

Semiparametric Analysis of Directed Network

Formations

Ting Yan

Central China Normal University

Abstract:We propose a semiparametric model for

dyadic link formations in directed networks. The

model contains a set of degree parameters that measure different effects of popularity or outgoingness

across nodes, a regression parameter vector that reflects the homophily effect resulting from the nodal

attributes or pairwise covariates associated with edges,

and a set of latent random noises with unknown distributions. Our interest lies in inferring the unknown

degree parameters and homophily parameters. The

dimension of the degree parameters increases with the

number of nodes. Under the high-dimensional regime,

we develop a kernel based least squares approach to

estimate the unknown parameters. The major advantage of our estimator is that it does not encounter

the incidental parameter problem for the homophily

parameters. We prove consistency of all the resulting

estimators of the degree parameters and homophily

parameters. We establish high-dimensional central

limit theorems for the proposed estimators and provide several applications of our general theory, including testing the existence of degree heterogeneity,

testing sparse signals and recovering the support.

Simulation studies and a real data application are

conducted to illustrate the finite sample performance

of the proposed methods.

Joint work with Lianqiang Qu, Lu Chen, Yuguo

Chen.

Invited Session IS102: Statistical Measurement,

Evaluation and Decision

统计测度、评价与决策分会场

Theory, Optimization and Application of Policy

Evaluation Methods: Case Studies of DID, SCM,

RDD and PSM

政策评估方法的理论、优化与应用:以 DID、SCM、

RDD 和 PSM 为例

Facang Zhu

Zhejiang Gongshang University

Abstract: In recent years, the importance of causal

inference models in the field of policy evaluation has

been significantly increased, leading to in-depth research conducted by domestic and international

scholars, and further advancing the credibility revolution in econometrics. This article provides a review of

four major statistical methods of causal inference: the

Difference-in-Differences method, the Synthetic Control Method, Regression Discontinuity Design, and

第87页

78

Propensity Score Matching. Firstly, the article traces

the origin of these methods and analyzes their basic

ideas, assumptions, and principles. Secondly, it discusses the methodological innovations and developments of these methods, including improvements in

model settings, variable selection, and robustness tests.

Finally, through specific case studies, this article

demonstrates the practical application and effectiveness of these improved methods in multiple fields,

reflecting their practical value and far-reaching influence in social science research. This review not only

summarizes the latest research results of these methods, but also provides direction and reference for future research, aiming to help researchers select appropriate causal inference methods and promote in-depth

research and development in the field of policy evaluation.

Employer Brand Evaluation Method Based on

Online Emotional Analysis and LDA-Kano Model

Shouzhen Zeng

Ningbo University

Abstract: To further explore the emotional information in online reviews, this paper proposes an emotional analysis method for online reviews of employer

brands using the LDA-Kano-TOPSIS model. Firstly,

LDA topic clustering is employed to extract user

needs from online reviews of employer brands. Secondly, the identified user needs are classified by using

the Kano model, upon which an indicator structure

model for online reviews of employer brands is constructed. And then combine the comment number ratio

of each subject indicator to revise the indicator combination weight. Meanwhile, the sentiment dictionary

is used to analyze the sentiment tendency of employer

brand online reviews. Last but not least, utilize the

intuitionistic fuzzy TOPSIS evaluation method combined with the modified indicator weight calculation

of the Kano model to calculate the evaluation value of

employer brand online review sentiment. The results

show that the intuitionistic fuzzy theory can well describe the emotional tendency of online reviews of

employer brands, and the constructed combined evaluation method can objectively reflect the information

of employees' job preferences, which is reasonable

and effective.

Online Commodity Recommendation Model for

Interaction between User Ratings and Intensity-Weighted Hierarchical Sentiment: A Case Study

of LYCOM

Chonghui Zhang

Zhejiang Gongshang University

Abstract: The online commodity recommendation

(OCR) model mines users’ historical behavior characteristics and recommends products that may be of

interest according to user preferences. Online reviews

are among the most important information sources for

OCR. However, the explicit and implicit emotion

words in online review texts have different structures

in the expression of multi-attribute emotions. To fully

utilize review information and improve the recommendation accuracy, we propose an OCR model that

considers the interaction of multiple attributes and

hierarchical emotions and calculates a score weighted

by emotion intensity. First, to balance the efficiency

and accuracy of information extraction while considering the coexistence of explicit and implicit expressions in online review text, a multi-attribute hierarchical emotion lexicon construction method is proposed. Second, based on the advantage of intuitionistic fuzzy sets in terms of information expression

superiority, multi-attribute review text information

expression of the affective polarity and intensity of

online review text is realized. Then, combined with

the weighted singular value decomposition and factorization machine method, we propose an OCR model for interactions between multi-attribute emotions

and scores through fusion and recombination of the

eigenvectors of users and products. Finally, LYCOM's

tourism products are used as an example to verify the

effectiveness of the proposed method.

Contributed Session CS019: Statistical Applications in Economics and Medicine

\"The Belt and Road\" Network Analysis and Spatial Spillover Effect

“一带一路”网络分析及空间溢出效应

第88页

79

Anqi Zhao

Chongqing Technology and Business University

摘要: 去年是“一带一路”倡议提出十周年,为了展

示该政策成果,测度对我国和沿线国家的影响,对

2005~2021 年“一带一路”65 个沿线国家的进出口贸

易数据进行网络分析,揭示网络密度、中心性、同

配性等特征,并建立空间计量模型,测度“一带一

路”对国家进口额的溢出效应。结果显示:自“一带

一路”倡议提出以来,贸易网络发展活跃,国家间

的贸易联系在广度和深度上均有提升,中国占据“

绝对核心”位置。中国贸易进口额受新加坡的影响

最大,反过来对俄罗斯、马来西亚、新加坡、沙特

阿拉伯等国家的溢出效应最大。创新点有两方面。

第一,使用 PageRank 中心性来衡量国家在多大程

度上可被视为“枢纽”,并使用加权有向的同配系数

来反映同质性原理。第二,本文假定空间权重矩阵

是未知的,建立具有时变均值的空间计量模型,从

而度量“一带一路”的空间溢出效应。

Joint work with Jinhai Li and Mingjing Chen.

Network Analysis and Spatial Correlation of

Common Prosperity

共同富裕的网络分析与空间关联

Jinhai Li

Chongqing Technology and Business University

摘要: 基于 2012 至 2019 年中国省域面板数据,本

文首先构建共同富裕指标体系,然后通过引力模型

将指标转化为以关联强度为纽带的空间关联网络,

从网络角度分析共同富裕,并且建立广义矩阵回归

模型,提取共同富裕的公共网络特征,分析共同富

裕关联强度的空间相似性,揭示共同富裕在一级指

标下的差异性和动态演化特征。结果显示,中国共

同富裕按关联强度的相似性可分为五个区域;中国

31 个省市的共同富裕水平在 2012 至 2019 年期间所

处的相对地位没有发生明显的改变;全国各地区在

社会、经济、文化、生态和共享方面发展情况存在

稀疏差异。本文的创新点在于:通过引力模型将面

板数据转化为空间关联网络,从网络的角度研究共

同富裕,为该领域的研究提供了新的视角;摒弃传

统的空间计量模型,构建了广义矩阵回归模型,通

过矩阵低秩分解提取公共网络特征,进而将 31 个

省市通过 k-means 聚类划分为五个区域,区别于现

有文献按照地理或者经济距离进行外生分类的方

法;通过网络分析和公共网络特征的分类揭示了省

市共同富裕空间关联关系;既从时间角度研究省市

共同富裕关联强度的动态特征,又从社会、经济、

文化、生态和共享等角度对共同富裕进行差异识别

Statistical Measurement and Spatial Difference of

Chinese Path to Modernization Level Based on

Provincial Dimension

基于省域维度中国式现代化水平的统计测度及空

间差异

Zhiyang Yu

North Minzu University

摘要: 党的二十大报告强调,“要以中国式现代化全

面推进中华民族伟大复兴”。通过构建中国式现代

化评价指标体系,有助于准确把握中国式现代化的

科学内涵、深刻了解中国式现代化水平的真实全貌

,进一步推动中华民族共同体建设,评价和改善现

阶段我国各省(区、市)工作状况。本文从中国式

现代化的五大特征出发,构建人口规模巨大的现代

化、全体人民共同富裕的现代化、物质文明和精神

文明相协调的现代化、人与自然和谐共生的现代化

、走和平发展道路的现代化 5 个维度的评价指标体

系,覆盖 12 个二级指标和 40 个三级指标,采用

熵值法进行权重确定及综合评价、Dagum 基尼系数

及其分解方法测算中国式现代化水平的区域差异

以及差异的来源,采用核密度估计方法分析中国式

现代化水平的动态演进特征。

Joint work with Shaojuan Ma.

Using the NP-UNet Model to Automatically Quantify the Non Perfusion Area of Retinal

UWF-SS-OCTA in Patients with Ischemic Retinal

Disease

利用 NP-UNet 模型自动量化缺血性视网膜疾病患

者视网膜 UWF-SS-OCTA 的无灌注区

Yuting Liao

Sun Yat-sen University

摘要: 在视网膜疾病的预测和诊断中,眼底影像的

生物标志物分割、识别和检测对疾病的诊断和医学

指标评估至关重要。以往的研究主要依赖于传统的

OCTA 图像和医生的人工标注,这种方法需要大量

的时间和人力成本,并对眼科医生的专业水平有较

高的要求。本研究旨在构建高效的 NP-UNet 模型,

准确地描绘超广角扫频光相干断层扫描血管成像

第89页

80

(UWF-SS-OCTA) 的无灌注区,并计算缺血指数

(ischemic index,ISI),以辅助临床医生对缺血性视

网膜疾病 (IRDs) 无灌注区的评估。研究过程包含

三个主要组成部分:1)选取增强的视网膜血管内

层 OCTA 图像构建图像数据集,利用形态学等方法

提取人工标注,对图像进行数据增强处理;2)利

用 NP-UNet 模型分割和量化无灌注区、相对缺血区

域和及黄斑中心凹无血管区;3)将经过后处理的

模型分割结果用于辅助医生后期标注,以提供更加

精确的人工及 AI 的标注融合。实验结果显示,

NP-UNet 模型能精准分割单一特定区域及多目标

区域组合,具有较好的分割准确率、AUC 和 mIoU

表现。将模型扩展到未遇到的 CRVO、DR 和视网

膜血管炎时依然表现良好。对于大的无灌注区,模

型表现出与人工标注相媲美的表现,小的无灌注区

模型标注优于人工标注。这不仅减少了诊断过程中

医生标注时间长,标注精细度不足的问题,也促进

了 AI 模型在未来医学影像分析中的应用和发展,

为临床 IRDs 的诊疗提供帮助。

Joint work with Yanjie Zhu, Yukang Jiang, Ting Tian

and Yan Luo.

Contributed Session CS016:Statistical Inference in

Complex Data Analysis

Penalized Sparse Covariance Regression with High

Dimensional Covariates

Yuan Gao

Peking University

Abstract: Covariance regression offers an effective

way to model the large covariance matrix with the

auxiliary similarity matrices. In this work, we propose

a sparse covariance regression (SCR) approach to

handle the potentially high-dimensional predictors

(i.e., similarity matrices). Specifically, we use the

penalization method to identify the informative predictors and estimate their associated coefficients simultaneously. We first investigate the Lasso estimator

and subsequently consider the folded concave penalized estimation methods (e.g., SCAD and MCP).

However, the theoretical analysis of the existing penalization methods is primarily based on i.i.d. data,

which is not directly applicable to our scenario. To

address this difficulty, we establish the

non-asymptotic error bounds by exploiting the spectral properties of the covariance matrix and similarity

matrices. Then, we derive the estimation error bound

for the Lasso estimator and establish the desirable

oracle property of the folded concave penalized estimator. Extensive simulation studies are conducted to

corroborate our theoretical results. We also illustrate

the usefulness of the proposed method by applying it

to a Chinese stock market dataset.

Joint work with Zhiyuan Zhang, Zhanrui Cai,

Xuening Zhu, Tao Zou, Hansheng Wang.

Quantile Forward Regression in High-Dimensional

Distributional Counterfactual Analysis

Hongqi Chen

Hunan University

Abstract: This paper introduces a novel quantile forward regression approach for constructing distributional artificial counterfactuals. In the context of

counterfactual analysis, where the number of control

units frequently surpasses the pre-treatment time dimension, our quantile forward regression provides an

approach to mitigate this challenge. The methodology

involves the step-wise selection of control units from

a candidate set. We establish the theoretical properties

of quantile forward regression, encompassing a bound

on its weak submodularity ratio, asymptotic convergence results, and an assessment of its asymptotic

efficacy. Through extensive Monte Carlo simulations,

we showcase the superior finite sample performance

of the quantile forward regression approach compared

to the l1-penalization approach. Our evaluation focuses on counterfactual prediction accuracy and the

selection of control units. Finally, we demonstrate the

application of quantile forward regression in an empirical study, analyzing the impact of an anti-corruption campaign on luxury watch importation.

A Tuning-Free Robust Approach for High- Dimensional Regression with Missing Covariates

Wenjun Wang

Yunnan University

Abstract: We consider a tuning-free, robust, and efficient approach to high-dimensional regression using

missing covariates. An appropriate and simple strate-

第90页

81

gy where a conditional expectation given by observed

covariates imputes the missing entries of the covariates. Jaeckel’s dispersion function (Jaeckel, 1972)

with Wilcoxon scores provides the loss function, and

two stages of the variable selection and parameter

estimation are performed. The L1 penalty for the

LASSO type regularized estimator is applied. A

non-convex penalty enhanced version of the second

stage is then proposed to reduce the estimation errors

and improve the efficiency. Under certain mild conditions, the pivotal gradient function for the Wilcoxon

rank loss function under the imputation data gain is

theoretically guaranteed. The L1-penalized Wilcoxon

rank approach achieves the same near-oracle rate as

the Lasso, and the second-stage enhancement, which

is an estimator of the SCAD-penalized Wilcoxon rank

approach (i.e., nonconvex penalty function), possesses

the oracle properties with a high probability. Numerically, the robustness and efficiency of the second-stage enhancement under imputation by the conditional expectation are significant, particularly for

heavy-tailed error distributions in various cases.

Parameter Identification of Random Wing Model

Based on Deep Feature Fusion

基于深度特征融合的随机机翼模型参数辨识

Xiaolong Wang

Shaanxi Normal University

摘要: 目前,从一条样本路径中辨识随机非线性动

力学系统的参数还存在诸多难点,如系统具有非高

斯过程噪声,相空间状态部分可观测,参数随时间

变化等。针对这些困难,本文以 Lévy 色噪声驱动

的二元机翼模型为研究对象,提出一种参数估计神

经网络(Parameter estimation neural network, PENN)

来辨识系统参数,包括流速和机翼的几何、结构参

数等。神经网络架构包括特征提取、特征融合和参

数辨识三个阶段:(1)利用 LSTM 神经网络将样本路

径转化为局部深度学习特征,(2)把变长局部特征融

合为定长全局特征,(3) 利用两个 ResNet 神经网络

,将全局特征分别映射为系统参数及代表估计不确

定性的标准差。数值实验表明 PENN 可以从机翼俯

仰或沉浮的部分可观测一维、二维时间序列中估计

全部系统参数,获得的深度学习特征是非线性动力

学系统的良好特征表示,包含了系统参数的关键信

息。进一步提出特征级融合策略,使 PENN 可以融

合大量短样本路径的信息,增强参数辨识精度;提

出一种基于滑动窗口的局部信息融合算法,使

PENN 可以从时变系统的一条样本路径中辨识全部

参数。

A Distribution Factor Clustering Method Based on

Gaussian Mixture Model

基于高斯混合模型的分布因子聚类方法

Yingqiu Zhu

University of International Business and Economics

摘要: 随着信息技术的发展,人类社会产生的数据

规模越来越庞大、形式越来越复杂,对聚类分析形

成了巨大挑战。在越来越多的应用场景中,观测数

据具有相互关联、层次嵌套的结构,使传统聚类方

法难以直接适用。通常的解决方案是采用特征工程

方法将观测信息压缩为低维特征向量进行聚类,但

这将带来不可避免的信息损失。为充分利用观测数

据,本文以分布函数表示聚类对象,大大降低信息

损失。进而本文提出基于高斯混合模型的分布因子

模型,该模型将聚类对象的观测数据分解为两部分

,一是以高斯成分表示的公共因子,反映数据中具

有共性的典型模式;二是载荷矩阵,矩阵中每个载

荷向量反映个体的异质性特征。估计得到载荷向量

后即可对不同个体实现聚类划分。本文提出方法具

有优良的统计学效率,能够证明在一定假设条件下

聚类误差率能够随着观测个体数目的发散而趋近

于 0。基于模拟数据和股票收益、大气污染实际数

据的实验表明,该方法能够区分具有不同特征模式

的个体,解决多维数据的分布函数聚类问题,并为

金融风险管理、空气质量的差异化治理等现实问题

提供决策支持。

Joint work with Danyang Huang and Bo Zhang.

Contributed Session CS017:Statistical Modeling

for Complex Networks

Atypical Dynamic Network Reconfiguration and

Genetic Mechanisms in Patients with Major Depressive Disorder

Hairong Xiao

Hunan Normal University

Abstract: Background: Brain dynamics underlie

complex forms of flexible cognition or the ability to

shift between different mental modes. However, the

第91页

82

precise dynamic reconfiguration based on multi-layer

network analysis and the genetic mechanisms of major depressive disorder (MDD) remains unclear.

Methods: Resting-state functional magnetic resonance imaging (fMRI) data were acquired from the

REST-meta-MDD Project, including 555 patients with

MDD and 536 healthy controls (HCs). A time-varying

multi-layer network was constructed, and dynamic

modular characteristics were used to investigate the

network reconfiguration. Additionally, partial least

squares regression analysis was performed using

transcriptional data provided by the Allen Human

Brain Atlas (AHBA) to identify genes associated with

atypical dynamic network reconfiguration in MDD.

Results: In comparison to HCs, patients with

MDD exhibited lower global and local recruitment

coefficients. The local reduction was particularly evident in the salience and subcortical networks. Spatial

transcriptome correlation analysis revealed an association between gene expression profiles and atypical

dynamic network reconfiguration observed in MDD.

Further functional enrichment analyses indicated that

positively weighted reconfiguration- related genes

were primarily associated with metabolic and biosynthetic pathways. Additionally, negatively enriched

genes were predominantly related to programmed cell

death-related terms.

Conclusions: Our findings offer robust evidence

of the atypical dynamic reconfiguration in patients

with MDD from a novel perspective. These results

offer valuable insights for further exploration into the

mechanisms underlying MDD.

Joint work with Dier Tang, Chuchu Zheng, Zeyu

Yang, Wei Zhao and Shuixia Guo.

Citation Counts Prediction of Statistical Publications Based on Multi-layer Academic Networks via

Neural Network Model

Tianchen Gao

Xiamen University

Abstract: Citation counts is a crucial factor in evaluating the quality of research papers. Therefore, it is

vital to accurately predict citation counts and explore

the mechanisms underlying citations. In this study, we

focus on predicting the citation counts in the field of

statistics. We collect 55,024 academic papers published in 43 statistics journals between 2001 and 2018.

Furthermore, we collect and clean a high-quality dataset and then construct multi-layer networks from

different perspectives, including journal networks,

author citation networks, co-citation networks,

co-authorship networks, and keyword co-occurrence

networks. Additionally, we extract 77 factors for citation counts prediction, including 22 traditional and 55

network-related factors. To address the issues of zero-inflated and over-dispersed citation counts, a neural

network model is designed to achieve high prediction

accuracy. Furthermore, we adopt a

leave-one-feature-out approach to investigate the importance of these factors. The proposed neural network model achieves an MAE value of 7.352, which

outperforms other machine learning models in the

comparison. Thus, this study provides a useful guide

for researchers to predict citation counts and can be

easily extended to other research fields.

Joint work with Jingyuan Liu, Rui Pan, Hansheng

Wang.

A Community Detection Algorithm Based on

Spectral Co-clustering and Weight

Self-Adjustment in Attributed Stochastic Co-block

Models

Yuxin Zhang

University of Science and Technology of China

Abstract: Degree-corrected stochastic co-block model (DC-ScBM) is extensively used for detecting the

community structure of directed network. In practice,

node attribute is another possible source of information one can use. To incorporate both the node

attribute and edge direction information, we develop

the spectral co-clustering and feature weight

self-adjustment algorithm (Spcc-SA) for the node

attributed DC-ScBM. By minimizing normalized cuts

(Ncut), this algorithm detects the sending and receiving communities, together with the weight of node

attributes in an iterative way. Simulation studies verify

the robustness and effectiveness of our Spcc-SA algorithm under various node attribute and network to-

第92页

83

pology. We also apply our method to the real data sets

including Enron email, world trade and Weddell Sea

network.

Joint work with Jie Liu, Yang Yang.

Two-Step Estimation Procedure for Copula-Based

Parametric Regression Models for

Semi-Competing Risks Data

Qingmin Zhang

Yunnan University

Abstract: For semi-competing risks data, the

non-terminal and terminal event are always associated

and can be influenced by covariates. To evaluate the

covariates effect on the association and two

events, we employ regression modelling for the

semi-competing risks data under the copula-based

framework. Due to complexity of copula structure, we

propose a new method that combines the novel

two-step algorithm with Bound Optimization By

Quadratic Approximation (BOBYQA) method. This

method can mitigate the influence of initial value and

are more robust. The simulations validate the performance of the proposed method. We apply our proposed method to the Amsterdam Cohort Study (ACS)

real data, some improvements can be found.

Constructing a Three-Dimensional Eddy-Resolving

Hydrographic Dataset in West Pacific Ocean Using

Ensemble Kalman Filter

Binghao Wang

Peking University

Abstract: In this work, an implementation of the

ensemble Kalman filter scheme in high dimension is

developed for the purpose of assimilating observations

into a high-resolution nonlinear numerical model of

the West Pacific Ocean. In high-dimensional case, the

ensemble-evaluated covariance matrix is usually underestimated, and spurious correlations exist between

a state variable and remote observations because of a

finite ensemble size. So we consider estimating the

covariance matrix together with a multiplicative inflation factor to adjust the covariance matrix. In addition,

since different types of observations are assimilated,

including the insitu and satellite data, we should address the heterogeneity issues by applying different

inflation factors to the observation error covariance

matrices. However, the EnKF itself is sensitive to

covariance inflation. A banded version of the covariance matrix via Cholesky decomposition is also considered.

Contributed Session CS018:Complex Data Analysis

Estimation of Non-gaussian Factors Using Higher-Order Multi-cumulants in Weak Factor Models

Guanglin Huang

Southwestern University of Finance and Economics

Abstract: We estimate the latent factors in

high-dimensional non-Gaussian panel data using the

eigenvalue decomposition of the product between the

higher-order multi-cumulant and its transpose. The

proposed Higher order multi-cumulant Factor Analysis (HFA) approach comprises an eigenvalue ratio test

to select the number of non-Gaussian factors and uses

the eigenvector to estimate the factor loadings. Unlike

covariance-based approaches, HFA remains reliable

for estimating the non-Gaussian factors in weak factor

models with Gaussian error terms. Simulation results

confirm that HFA estimators improve the accuracy of

factor selection and estimation compared to covariance-based approaches. We illustrate the use of HFA

to detect and estimate the factors for the FREDMD

data set and use them to forecast the monthly S&P

500 equity premium.

Joint work with Lu Wanbo and Kris Boudt.

One-Way and Two-Way Matrix Factor Models for

Matrix Sequences

Mingjing Chen

Chongqing Technology and Business University

Abstract: Matrix factor models gain much attention

since they were proposed in 2019. Since the dimensions of rows and columns can be simultaneously

reduced by a same set of factors, the models are

termed the two-way matrix factor models. As the

two-way structure is so prevailing in the literature that

little work has been devoted to the one-way models in

which one dimension is reduced by one set of factors,

第93页

84

the other by another. In this paper, we combine the

one-way and two-way structures in one model and

point out the difficulties in estimating the models. The

first difficulty is that we do not know whether both

factor structures coexist or not? If we know, it is hard

to distinguish the roles of factors. To solve these

problems, we propose a novel method.

Structural Dimension Reduction in Bayesian Networks

Yi Sun

Xinjiang University

Abstract: The paper introduces a novel technique

known as structural dimension reduction to address

the challenges associated with exact probability calculation in Bayesian networks. This method aims to

collapse a Bayesian network onto a minimum and

localized one while ensuring that the probability calculations between the original and reduced networks

remain consistent.

An efficient polynomial-time algorithm is devised to identify the minimum localized Bayesian

network by determining the unique d-convex hull

containing the variables of interest from the original

network. Experiments demonstrate that the probabilistic inference method based on d-convex hulls significantly improves efficiency while ensuring that the

inference results are consistent with traditional methods such as variable elimination and belief propagation algorithms.

Two-Stage Gene Selection of Prior Information

Fusion and Its Application in Omics Data Mining

融合先验信息的两阶段基因选择及其在组学数据

挖掘中的应用

Pei Wang

Henan University

摘要: 组学数据的特征选择已被广泛应用于识别癌

症驱动基因。虽然研究者们已经提出了一系列的方

法,但已有的基因选择方法较少考虑将已知的癌症

驱动基因作为先验知识。本报告介绍我们课题组近

期提出的两类融合先验信息的两阶段基因选择理

论及其应用。第一类方法是首先通过组 LASSO、

主成分或因子分析等把先验基因提供的信息融合

为少数的几个综合变量,然后再以这些综合变量作

为响应,建立一系列的 LASSO 惩罚回归模型,实

现关键基因的筛选。第二类方法是首先把先验基因

逐个作为响应,构建 LASSO 惩罚的回归模型筛选

备选基因;然后以备选基因作为协变量,建立

LASSO 惩罚的逻辑回归模型,同时实现关键基因

的筛选和样本的分类。在转录组、单细胞组等各类

癌症组学数据中的仿真以及与多种已有方法的对

比表明,所提出的方法可以有效的选择癌症信息基

因,显著提高基因选择的精度以及样本分类准确率

,并具有鲁棒性。本报告介绍的方法可以更加准确

地找到与特定疾病或生物过程相关的更广泛的基

因,从而深入理解疾病发生和发展机制,并为疾病

诊断、治疗和药物研发提供更加可靠的基础。

Power-Enhanced Projection Test for High- Dimensional Mean Vectors

Xia Chen

Shaanxi Normal University

Abstract: The projection test has been widely researched and applied as an effective path to solve the

high-dimensional mean vector test problem. In this

paper, for the problem of compensating the computational difficulty of the power loss method in the

high-dimensional case due to the data-splitting procedure, we propose a method called power-enhanced

projection test (PPT), which adds an extra power enhancement term to the original projection test statistic

to make up for the lost power, and thus does not have

to estimate the optimal projection method multiple

times. Theoretically we show that this term is asymptotically zero under the null hypothesis and therefore

does not affect the control of Type I errors. The conclusions are verified in numerical studies.

Joint work with Yang Liu.

第94页

85

July 13, 16:00-17:40

Invited Session IS048: Recent Advances in Deep

Learning Theory

深度学习理论最新进展

Self-Supervised Transfer Learning

Yuling Jiao

Wuhan University

Abstract: Self-supervised transfer learning (SSTL)

has emerged as a powerful technique for learning data

representations using unlabeled data, which can subsequently be leveraged for improving performance on

downstream supervised learning tasks. In this talk, we

present a novel SSTL method that surpasses existing

approaches in achieving outstanding classification

accuracy on CIFAR-10 and CIFAR- 100. We provide

theoretical insights into the transferability of learned

representations. Our analysis yields a non-asymptotic

error bound that quantifies the performance of downstream classification using the acquired representations. By understanding the underlying principles and

mechanisms, we shed light on how and why unlabeled

data can effectively transfer knowledge to enhance

downstream classification tasks.

A Statistical Theory of Regularization-Based Continual Learning

Wei Lin

Peking University

Abstract: We give a statistical analysis of regularization-based continual learning on a sequence of linear

regression tasks, with emphasis on how different regularization terms affect the model performance. We

first derive the convergence rate for the oracle estimator obtained as if all data were available simultaneously. Next, we consider a family of generalized ℓ2

-regularized algorithms indexed by matrix-valued

hyperparameters, which includes the minimum norm

estimator and continual ridge regression as special

cases. As more tasks are introduced, we derive an

iterative update formula for the estimation error of

generalized ℓ2 - regularized estimators, based on

which we determine the hyperparameter resulting in

the optimal algorithm. Interestingly, the choice of

hyperparameter can harmoniously balance the

trade-off between the backward and forward

knowledge transfer and adjust for distribution heterogeneity. Moreover, the estimation error of the optimal

algorithm is derived explicitly, which is of the same

order as that of the oracle estimator. In contrast, our

analysis of the minimax lower bound for the minimum

norm estimator shows its suboptimality. A byproduct

of our theoretical analysis is the equivalence between

early stopping and generalized ℓ2-regularization instead of conventional ridge regression in continual

learning, which can be of independent interest. Finally,

we conduct experiments to complement our theory.

Joint work with Xuyang Zhao, Huiyuan Wang and

Weiran Huang.

Towards a Statistical Understanding of Deep Neural Network: Beyond the Neural Tangent Kernel

Theory

Qian Lin

Tsinghua University

Abstract: In understanding the generalization ability

of neural networks, two primary approaches have

emerged: the Holder theory, which overlooks the dynamical aspects of neural networks, and the neural

tangent kernel theory, tailored for wide neural networks. We first provide a succinct review of recent

advancements in these theories, elucidating their implications. We then briefly discuss the challenges and

some possible existing solutions, such as the recent

'one-step' analysis of the dynamical properties of neural networks. If time permits, we will introduce the

'adaptive kernel approach', a potential theory for explaining the effectiveness of neural networks.

Joint work with Jianfa Lai, Weihao Lu, Haobo Zhang,

Yicheng Li, and Guhan Chen.

Deep Nonlinear Sufficient Dimension Reduction

Zhou Yu

East China Normal University

Abstract: Linear sufficient dimension reduction, as

exemplified by sliced inverse regression, has seen

substantial development in the past thirty years.

However, with the advent of more complex scenarios,

nonlinear dimension reduction has gained consider-

第95页

86

able interest recently. This article introduces a novel

method for nonlinear sufficient dimension reduction,

utilizing the generalized martingale difference divergence measure in conjunction with deep neural networks. The optimal solution of the proposed objective

function is shown to be unbiased at the general level

of σ-fields. And two optimization schemes, based on

the fascinating deep neural networks, exhibit higher

efficiency and flexibility compared to the classical

eigen decomposition of linear operators. Moreover,

we systematically investigate the slow rate and fast

rate for the estimation error based on advanced

U-process theory. Remarkably, the fast rate is nearly

minimax optimal. The validity of our deep nonlinear

sufficient dimension reduction methods is demonstrated through simulations and real data analysis.

Invited Session IS082: Trustworthy AI

A Network-Based Decentralization Scheme for

Recommender Systems

Xuan Bi

University of Minnesota

Abstract: Recommender systems have witnessed

significant advancements in the past decade, impacting billions of people worldwide. However, these

systems often collect a vast amounts of personal data,

raising concerns about privacy. To address these issues, federated methods have emerged, allowing

models to be trained without sharing users' personal

data with a central server. Despite these advancements,

existing federated methods encounter challenges related to centralized bottlenecks and model aggregation

between users. In this study, we present a fully decentralized federated learning approach, wherein each

user's model is optimized using their own data and

gradients transferred from their neighboring models.

This ensures that personal data remains distributed

and eliminates the necessity for central server-side

aggregation or model merging steps. Empirical experiments demonstrate that our approach achieves a

significant improvement in accuracy compared to

other decentralized methods across various network

structures.

Is Knowledge All Large Language Models Needed

for Causal Reasoning?

Hengrui Cai

University of California, Irvine

Abstract: This paper explores the causal reasoning of

large language models (LLMs) to enhance their interpretability and reliability in advancing artificial intelligence. Despite the proficiency of LLMs in a range of

tasks, their potential for understanding causality requires further exploration. We propose a novel causal

attribution model that utilizes \"do-operators\" for constructing counterfactual scenarios, allowing us to systematically quantify the influence of input numerical

data and LLMs' pre-existing knowledge on their causal reasoning processes. Our newly developed experimental setup assesses LLMs' reliance on contextual

information and inherent knowledge across various

domains. Our evaluation reveals that LLMs' causal

reasoning ability depends on the context and domain-specific knowledge provided, and supports the

argument that \"knowledge is, indeed, what LLMs

principally require for sound causal reasoning\". On

the contrary, in the absence of knowledge, LLMs still

maintain a degree of causal reasoning using the available numerical data, albeit with limitations in the calculations.

On Tracking Structural Changes in Dynamic Heterogeneous Networks

Junhui Wang

The Chinese University of Hong Kong

Abstract: Dynamic networks consist of a sequence of

time-varying heterogeneous networks, and it is of

great importance to detect the structural changes.

Most existing methods focus on detecting abrupt network changes, necessitating the assumption that the

underlying network probability matrix remains constant between adjacent change points. This assumption can be overly strict in many real-life scenarios

due to their versatile network dynamics. In this talk,

we introduce a new subspace tracking method to detect network structural changes in dynamic networks,

whose network connection probabilities may still

undergo continuous changes. Particularly, two new

第96页

87

detection statistics are proposed to jointly detect the

network structural changes, followed by a carefully

refined detection procedure. Theoretically, we show

that the proposed method is asymptotically consistent

in terms of detecting the network structural changes,

and also establish the impossibility region in a minimax fashion. The advantage of the proposed method

is supported by extensive numerical experiments on

both synthetic networks and a series of UK politician

social networks.

PhiBE: A Physics-Informed Bellman Equation for

Continuous Time Reinforcement Learning

Yuhua Zhu

University of California, Los Angeles

Abstract: In this talk, we address the problem of continuous-time reinforcement learning in scenarios

where the dynamics follow a stochastic differential

equation. When the underlying dynamics remain unknown and we have access only to discrete-time information, how can we effectively conduct policy

evaluation? We first highlight that the commonly used

Bellman equation is not always a reliable approximation to the true value function. We then introduce

PhiBE, a PDE-based Bellman equation that offers a

more accurate approximation to the true value function, especially in scenarios where the underlying

dynamics change slowly. Moreover, we extend PhiBE

to higher orders, providing increasingly accurate approximations. Additionally, we present a model-free

algorithm to solve PhiBE when only discrete-time

trajectory data is available. Numerical experiments are

provided to validate the theoretical guarantees we

propose.

Invited Session IS021: Foundation Models in

Large-Scale Biomedical Studies

RWE-GPT: A Large Scale Pretrained Foundation

Model with Negative Control Outcomes for Debiased Real-World Evidence Generation

Yong Chen

University of Pennsylvania

Abstract: Generative pre-trained transformers (GPTs),

a type of \"Foundation Model,\" have revolutionized

natural language processing (NLP) with their versatility in diverse downstream tasks. However, current

GPTs and NLPs are not yet focused on generating

causal evidence from real-world data (RWD). Here,

we present RWE-GPT, our effort in developing the

very first generative pre-trained GPT model designed

specifically for real-world evidence generation using a

wide range of causal inference frameworks. In addition to the capability of learning relationships among

variables to infer the causal quantities of interest, our

RWE-GPT is equipped with bias correction features to

account for various systematic biases within RWD,

such as selection bias and unmeasured confounding

bias. Specifically, our RWE-GPT adopts a novel pretraining strategy that incorporates the negative control

outcomes, which can be leveraged by classical statistical methods via fine-tuning. We present the theoretical guarantee for RWE-GPT, which justifies the identification of a wide range of causal estimands even if

there are unmeasured confounders.

We develop and evaluate our RWE-GPT using

data from three institutes, University of Pennsylvania,

University of Florida, and Yale University, with the

RWD derived from more than 20 million patients. The

RWE-GPT demonstrated robust performance across a

wide range of downstream tasks for causal estimation,

even without being explicitly tailored to the cohorts

during the pre-training stage. These features of

RWE-GPT demonstrate the usefulness of ML methods

in applying foundation models for causal inference in

real-world evidence generation, with applications to

drug repositioning and counterfactual modeling.

Integrated Analysis and Mining of Large-Scale

Biomedical Data

Xueqin Wang

University of Science and Technology of China

Abstract: The proliferation of biomedical cohort data

presents unprecedented opportunities for understanding disease mechanisms, risk factors, and prognostic

markers. However, traditional research approaches

often have a limited scope, hindering the exploration

of complex inter-disease relationships and comprehensive risk factor analysis. To address these chal-

第97页

88

lenges, we introduce UKBFound, a robust and holistic

foundation model leveraging multimodal data from

the UK Biobank, encompassing fundamental, lifestyle,

measurement, environmental, genetic, and imaging

information. This model is designed for the individual

prediction and health risk assessment of 1,560 diseases, offering significant improvements in predictive

accuracy by accounting for multimorbidity mechanisms. Notably, UKBFound enhances risk assessment

performance in over 95.2% of the 21 major disease

categories. UKBFound uncovers intricate connections

among risk factors and disease pathways by enabling

simultaneous prediction and assessment of multiple

diseases, providing a comprehensive perspective on

health risk and multimorbidity. Future work includes

deep-diving into psychiatric, gastrointestinal, respiratory, and circulatory system diseases, developing a

pre-trained model platform for retinal image diagnosis,

supporting clinical applications, and advancing

healthcare diagnostics.

Development of A Large Language Model for

Electronic Health Record Information Extraction

Sheng Yu

Tsinghua University

Abstract: Information extraction from electronic

health records (EHR) is fundamental to enabling biomedical big data studies and all sorts of healthcare AI

applications. Conventionally, EHR information extraction has been challenging and divided into many

subtasks, such as named entity recognition, narrative

status classification, relation classification, etc., using

deep learning models or rule-based methods. Question-answering (QA)-based methods have also been

proposed, but the base models limited their capabilities. The technological revolution brought by large

language models (LLMs) has great potential to improve EHR analysis significantly. In this work, we

develop a new LLM specifically for EHR information

extraction. The first ability of the model is to generate

structured annotation for EHRs in a single run, which

includes named entity recognition, semantic classification, narrative status classification, location extraction, value and unit extraction, and purpose analysis.

The second ability is to perform information extraction via QA, which is for scenarios where the user

needs more complicated information than what structured annotation provides. We introduce how the

model is trained and provide preliminary evaluation

results.

Joint work with Hongyi Yuan, Huaiyuan Ying and

Tianxi Cai.

Invited Session IS077: Statistical Theory and

Learning

Enveloped Huber Regression

Le Zhou

Hong Kong Baptist University

Abstract: Huber regression (HR) is a popular flexible

alternative to the least squares regression when the

error follows a heavy-tailed distribution. We propose a

new method called the enveloped Huber regression

(EHR) by considering the envelope assumption that

there exists some subspace of the predictors that has

no association with the response, which is referred to

as the immaterial part. More efficient estimation is

achieved via the removal of the immaterial part. Different from the envelope least squares (ENV) model

whose estimation is based on maximum normal likelihood, the estimation of the EHR model is through

Generalized Method of Moments. The asymptotic

normality of the EHR estimator is established, and it

is shown that EHR is more efficient than HR. Moreover, EHR is more efficient than ENV when the error

distribution is heavy-tailed, while maintaining a small

efficiency loss when the error distribution is normal.

Moreover, our theory also covers the heteroscedastic

case in which the errormay depend on the covariates.

The envelope dimension in EHR is a tuning parameter

to be determined by the data in practice. We further

propose a novel generalized information criterion

(GIC) for dimension selection and establish its consistency. Extensive simulation studies confirm the

messages from our theory. EHR is further illustrated

on a real dataset.

Joint work with R. Dennis Cook, Hui Zou.

Implicit Generative Prior for Bayesian Neural

第98页

89

Networks

Xiao Wang

Purdue University

Abstract: Predictive uncertainty quantification is

crucial for reliable decision-making in various applied

domains. Bayesian neural networks offer a powerful

framework for this task. However, defining meaningful priors and ensuring computational efficiency remain significant challenges, especially for complex

real-world applications. This paper addresses these

challenges by proposing a novel neural adaptive empirical Bayes (NA-EB) framework. NA-EB leverages

a class of implicit generative priors derived from

low-dimensional distributions. This allows for efficient handling of complex data structures and effective capture of underlying relationships in real-world

datasets. The proposed NA-EB framework combines

variational inference with a gradient ascent algorithm.

This enables simultaneous hyperparameter selection

and approximation of the posterior distribution, leading to improved computational efficiency. We establish the theoretical foundation of the framework

through posterior and classification consistency. We

demonstrate the practical applications of our framework through extensive evaluations on a variety of

tasks, including the two-spiral problem, regression, 10

UCI datasets, and image classification tasks on both

MNIST and CIFAR-10 datasets. The results of our

experiments highlight the superiority of our proposed

framework over existing methods, such as sparse variational Bayesian and generative models, in terms of

prediction accuracy and uncertainty quantification.

Joint work with Yijia Liu.

Estimation of Over-parameterized Models from an

Auto-Modeling Perspective

Chuanhai Liu

Purdue University

Abstract: From a model-building perspective, we

propose a paradigm shift for fitting

over-parameterized models. Philosophically, the

mindset is to fit models to future observations rather

than to the observed sample. Technically, given an

imputation method to generate future observations, we

fit over-parameterized models to these future observations by optimizing an approximation of the desired

expected loss function based on its sample counterpart

and an adaptive duality function. The required imputation method is also developed using the same estimation technique with an adaptive m-out-of-? bootstrap approach. We illustrate its applications with the

many-normal-means problem, n < p linear regression,

and neural network-based image classification of

MNIST digits. The numerical results demonstrate its

superior performance across these diverse applications.

While primarily expository, the corresponding paper

conducts an in-depth investigation into the theoretical

aspects of the topic. It concludes with remarks on

some open problems.

Joint work with Yiran Jiang.

Invited Session IS027: Innovative Statistical

Methods for Heterogeneous Data

Observed Best Prediction for Small Area Counts

Jiming Jiang

University of California, Davis

Abstract: We extend the observed best prediction

(OBP; Jiang, Nguyen, and Rao 2011) method to small

area estimation when the responses are counts at the

area level. We show via a simulation study that the

OBP outperforms the empirical best prediction method when the underlying model is misspecified. A

challenging problem has to do with assessing uncertainty under the potential model misspecification. We

propose a computationally oriented approach that

leads to a second-order unbiased estimator of the

mean squared prediction error of the OBP that is

computationally easy to operate. A real data examples

is considered. This work is joint with Senke Chen of

the Wells Fargo Bank, USA and Thuan Nguyen of the

Oregon Health and Science University, USA.

Estimation of Low Rank High-Dimensional Multivariate Linear Models for Multi-Response Data

Wenyang Zhang

University of Macau

Abstract: In this talk, I will focus on low rank

high-dimensional multivariate linear models

第99页

90

(LRMLM) for high-dimensional multi-response data.

I will present an intuitively appealing estimation approach together with an implementation algorithm. I

will show the asymptotic properties of the estimation

method to justify the estimation procedure theoretically. Intensive simulation study results will be presented to demonstrate the performance of the proposed method when the sample size is finite, and a

comparison will be made with some popular methods

from the literature. I will show the proposed estimator

outperforms all of the alternative methods under various circumstances. Finally, I will apply the LRMLM

together with the proposed estimation to analyze an

environmental dataset and predict concentrations of

PM2.5 at the locations concerned. I will illustrate

how the proposed method provides more accurate

predictions than the alternative approaches.

Minimax Regret Learning for Data with Heterogeneous Sub-populations

Weibin Mo

Purdue University

Abstract: Modern complex datasets often consist of

various sub-populations. To develop robust and generalizable methods in the presence of sub-population

heterogeneity, it is important to guarantee a uniform

learning performance instead of an average one. In

many applications, prior information is often available

on which sub-population or group the data points

belong to. Given the observed groups of data, we

develop a min-max-regret (MMR) learning framework for general supervised learning, which targets to

minimize the worst-group regret. Motivated from the

regret-based decision theoretic framework, the proposed MMR is distinguished from the value-based or

risk-based robust learning methods in the existing

literature. The regret criterion features several robustness and invariance properties simultaneously. In

terms of generalizability, we develop the theoretical

guarantee for the worst-case regret over a super-population of the meta data, which incorporates

the observed sub-populations, their mixtures, as well

as other unseen sub-populations that could be approximated by the observed ones. We demonstrate the

effectiveness of our method through extensive simulation studies and an application to kidney transplantation data from hundreds of transplant centers.

Joint work with Weijing Tang, Songkai Xue, Yufeng

Liu, Ji Zhu.

BinomialRF: Interpretable Combinatoric Efficiency of Random Forests to Identify Biomarker Interactions

Helen Zhang

University of Arizona

Abstract: Identifying critical biomarkers and their

complex interactions is pivotal in biological and genetic research, yet the vast data dimensionality presents significant computational challenges in detecting

these interactions. We introduce a novel wrapper feature selection method named binomial RF, which

leverages the widely used random forest (RF) algorithm to efficiently and scalably pinpoint significant

genes and gene groups, while providing interpretable

features. RF classifiers are favored for their adaptability, robust performance, and capability to select features in high-dimensional datasets. Nonetheless, their

practical application has been limited by their nature

of \"black box\" models. Building upon the \"inclusion

frequency\" feature ranking, binomialRF formalizes

this concept into a binomial probabilistic framework

to measure feature importance and extends to identify

multi-way nonlinear interactions among biomarkers.

Empirical evaluations, including simulations and validation studies with data from the TCGA and UCI

repositories, demonstrate that binomialRF achieves

substantial computational efficiency (enhancements

ranging from 5 to 300-fold) without compromising the

efficacy of model selection for biomarkers and their

interactions. In clinical case studies, binomialRF has

successfully highlighted relevant pathological molecular mechanisms previously documented in literature,

delivering high precision and recall in classification

tasks, both with individual features and their interactions.

Invited Session IS040: Novel Applications in Biostatistics

第100页

91

Kernel Ordinary Differential Equation

Xiaowu Dai

University of California, Los Angeles

Abstract: The ordinary differential equation (ODE) is

widely used in modeling biological and physical processes in science. A new reproducing kernel-based

approach is proposed for the estimation and inference

of ODE given noisy observations. The functional

forms in ODE are assumed to be known or restricted

to be linear or additive, and pairwise interactions are

allowed. Sparse estimation is performed to select individual functionals and construct confidence intervals for the estimated signal trajectories. The estimation optimality and selection consistency of kernel

ODE are established under both the low-dimensional

and high-dimensional settings, where the number of

unknown functionals can be smaller or larger than the

sample size. The proposal builds upon the smoothing

spline analysis of variance (SS-ANOVA) framework,

but tackles several important problems that are not yet

fully addressed, and extends the existing methods of

dynamic causal modeling.

Estimation of Individualized Combination Treatment Rule

Qi Xu

University of California, Irvine

Abstract: Individualized treatment rules (ITRs) have

been widely applied in many fields such as precision

medicine and personalized marketing. Beyond the

extensive studies on ITRs with binary or multiple

treatments, there is considerable interest in applying

combination treatments to enhance the outcome. In

this talk, I will introduce two estimation method of

individualized combination treatment rule under the

outcome regression and inverse probability weighting

framework. Under the outcome regression framework,

we propose a Double Encoder Model (DEM) which

represents the treatment effects with two parallel neural network encoders. This model enables flexible

choices of function bases of treatment effects, and

improve the estimation efficiency via the parameter-sharing feature of the neural network. Under the

inverse probability weighting framework, we target

the same problem from multi-label classification perspective, and propose a novel non-convex loss function to replace the intractable 0-1 loss. The proposed

method is Fisher-consistent regardless of the intensity

level of interaction effects among treatments, and

computationally tractable with a difference-of-convex

algorithm. Our findings are corroborated by extensive

simulation studies and real data examples.

Joint work with Xiaoke Cao, Geping Chen, Hanqi

Zeng, Haoda Fu and Annie Qu.

Optimal Transport for Latent Integration with an

Application to Heterogeneous Neuronal Activity

Data

Annie Qu

University of California, Irvine

Abstract: Detecting dynamic patterns of task-specific

responses shared across heterogeneous datasets is an

essential and challenging problem in many scientific

applications in medical science and neuroscience. In

our motivating example of rodent electrophysiological

data, identifying the dynamical patterns in neuronal

activity associated with ongoing cognitive demands

and behavior is key to uncovering the neural mechanisms of memory. One of the greatest challenges in

investigating a cross-subject biological process is that

the systematic heterogeneity across individuals could

significantly undermine the power of existing machine

learning methods to identify the underlying biological

dynamics. In addition, many technically challenging

neurobiological experiments are conducted on only a

handful of subjects where rich longitudinal data are

available for each subject. The low sample sizes of

such experiments could further reduce the power to

detect common dynamic patterns among subjects. In

this paper, we propose a novel heterogeneous data

integration framework based on optimal transport to

extract shared patterns in complex biological processes. The key advantages of the proposed method are

that it can increase statistical power in identifying

common patterns by reducing heterogeneity unrelated

to the signal by aligning the extracted latent spatiotemporal information across subjects. Our approach is

effective even with a small number of subjects, and

百万用户使用云展网进行在线电子书制作,只要您有文档,即可一键上传,自动生成链接和二维码(独立电子书),支持分享到微信和网站!
收藏
转发
下载
免费制作
其他案例
更多案例
免费制作
x
{{item.desc}}
下载
{{item.title}}
{{toast}}