华北多地灰霾堆积 冬天没冷空气华北灰霾正常
百度 不论如何,此时的赵孟頫已成一代书宗,从此光耀千古。See recent articles
Showing new listings for Monday, 4 August 2025
- [1] arXiv:2508.00089 [pdf, other]
-
Title: Gradient-Boosted Pseudo-Weighting: Methods for Population Inference from Nonprobability samplesSubjects: Methodology (stat.ME)
Nonprobability samples have rapidly emerged to address time-sensitive priority topics in a variety of fields. While these data are timely, they are prone to selection bias. To mitigate selection bias, a large number of survey research literature has explored the use of propensity score (PS) adjustment methods to enhance population representativeness of nonprobability samples, using probability-based survey samples as external references. A recent advancement, the 2-step PS-based pseudo-weighting adjustment method (2PS, Li 2024), has been shown to improve upon recent developments with respect to mean squared error. However, the effectiveness of these methods in reducing bias critically depends on the ability of the underlying propensity model to accurately reflect the true selection process, which is challenging with parametric regression. In this study, we propose a set of pseudo-weight construction methods, which utilize gradient boosting methods (GBM) to estimate PSs in 2PS to construct pseudo-weights, offering greater flexibility compared to logistic regression-based methods. We compare the proposed GBM-based pseudo-weights with existing methods, including 2PS. The population mean estimators are evaluated via Monte Carlo simulation studies. We also evaluated prevalence of various health outcomes, including 15-year mortality, using 1988 ~ 1994 NHANES III as a nonprobability sample and the 1994 NHIS as the reference survey.
- [2] arXiv:2508.00110 [pdf, html, other]
-
Title: funOCLUST: Clustering Functional Data with OutliersSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Functional data present unique challenges for clustering due to their infinite-dimensional nature and potential sensitivity to outliers. An extension of the OCLUST algorithm to the functional setting is proposed to address these issues. The approach leverages the OCLUST framework, creating a robust method to cluster curves and trim outliers. The methodology is evaluated on both simulated and real-world functional datasets, demonstrating strong performance in clustering and outlier identification.
- [3] arXiv:2508.00120 [pdf, html, other]
-
Title: AdapDISCOM: An Adaptive Sparse Regression Method for High-Dimensional Multimodal Data With Block-Wise Missingness and Measurement ErrorsAbdoul O. Diakité, Claudia Moreau, Gleb Bezgin, Nikhil Bhagwat, Pedro Rosa-Neto, Jean-Baptiste Poline, Simon Girard, Amadou Barry, for the Alzheimers Disease Neuroimaging InitiativeComments: 49 pages, 4 figuresSubjects: Methodology (stat.ME); Machine Learning (stat.ML)
Multimodal high-dimensional data are increasingly prevalent in biomedical research, yet they are often compromised by block-wise missingness and measurement errors, posing significant challenges for statistical inference and prediction. We propose AdapDISCOM, a novel adaptive direct sparse regression method that simultaneously addresses these two pervasive issues. Building on the DISCOM framework, AdapDISCOM introduces modality-specific weighting schemes to account for heterogeneity in data structures and error magnitudes across modalities. We establish the theoretical properties of AdapDISCOM, including model selection consistency and convergence rates under sub-Gaussian and heavy-tailed settings, and develop robust and computationally efficient variants (AdapDISCOM-Huber and Fast-AdapDISCOM). Extensive simulations demonstrate that AdapDISCOM consistently outperforms existing methods such as DISCOM, SCOM, and CoCoLasso, particularly under heterogeneous contamination and heavy-tailed distributions. Finally, we apply AdapDISCOM to Alzheimers Disease Neuroimaging Initiative (ADNI) data, demonstrating improved prediction of cognitive scores and reliable selection of established biomarkers, even with substantial missingness and measurement errors. AdapDISCOM provides a flexible, robust, and scalable framework for high-dimensional multimodal data analysis under realistic data imperfections.
- [4] arXiv:2508.00163 [pdf, html, other]
-
Title: Parametric convergence rate of some nonparametric estimators in mixtures of power series distributionsComments: 49 pages, 7 figuresSubjects: Statistics Theory (math.ST)
We consider the problem of estimating a mixture of power series distributions with infinite support, to which belong very well-known models such as Poisson, Geometric, Logarithmic or Negative Binomial probability mass functions. We consider the nonparametric maximum likelihood estimator (NPMLE) and show that, under very mild assumptions, it converges to the true mixture distribution $\pi_0$ at a rate no slower than $(\log n)^{3/2} n^{-1/2}$ in the Hellinger distance. Recent work on minimax lower bounds suggests that the logarithmic factor in the obtained Hellinger rate of convergence can not be improved, at least for mixtures of Poisson distributions. Furthermore, we construct nonparametric estimators that are based on the NPMLE and show that they converge to $\pi_0$ at the parametric rate $n^{-1/2}$ in the $\ell_p$-norm ($p \in [1, \infty]$ or $p \in [2, \infty])$: The weighted least squares and hybrid estimators. Simulations and a real data application are considered to assess the performance of all estimators we study in this paper and illustrate the practical aspect of the theory. The simulations results show that the NPMLE has the best performance in the Hellinger, $\ell_1$ and $\ell_2$ distances in all scenarios. Finally, to construct confidence intervals of the true mixture probability mass function, both the nonparametric and parametric bootstrap procedures are considered. Their performances are compared with respect to the coverage and length of the resulting intervals.
- [5] arXiv:2508.00167 [pdf, html, other]
-
Title: Likelihood-free Posterior Density Learning for Uncertainty Quantification in Inference ProblemsSubjects: Methodology (stat.ME)
Generative models and those with computationally intractable likelihoods are widely used to describe complex systems in the natural sciences, social sciences, and engineering. Fitting these models to data requires likelihood-free inference methods that explore the parameter space without explicit likelihood evaluations, relying instead on sequential simulation, which comes at the cost of computational efficiency and extensive tuning. We develop an alternative framework called kernel-adaptive synthetic posterior estimation (KASPE) that uses deep learning to directly reconstruct the mapping between the observed data and a finite-dimensional parametric representation of the posterior distribution, trained on a large number of simulated datasets. We provide theoretical justification for KASPE and a formal connection to the likelihood-based approach of expectation propagation. Simulation experiments demonstrate KASPE's flexibility and performance relative to existing likelihood-free methods including approximate Bayesian computation in challenging inferential settings involving posteriors with heavy tails, multiple local modes, and over the parameters of a nonlinear dynamical system.
- [6] arXiv:2508.00176 [pdf, html, other]
-
Title: New Pilot-Study Design in Functional Data AnalysisSubjects: Methodology (stat.ME); Applications (stat.AP)
Efficient data collection is essential in applied studies where frequent measurements are costly, time-consuming, or burdensome. This challenge is especially pronounced in functional data settings, where each subject is observed at only a few time points due to practical constraints. Most existing design approaches focus on selecting optimal time points for individual subjects, typically relying on model parameters estimated from a pilot study. However, the design of the pilot study itself has received limited attention. We propose a framework for constructing pilot-study designs that support both accurate trajectory recovery and effective planning of future designs. A search algorithm is developed to generate such high-quality pilot-study designs. Simulation studies and a real data application demonstrate that our approach outperforms commonly used alternatives, highlighting its value in resource-limited settings.
- [7] arXiv:2508.00200 [pdf, other]
-
Title: Predicting Formula 1 Race Outcomes: Decomposing the Roles of Drivers and Constructors through Linear ModelingComments: 26 pages, 12 figures, 9 tablesSubjects: Applications (stat.AP)
Formula 1 performance is a combination of the car's ability and the driver's ability. While a given race or season can tell you how well a car and driver performed jointly, isolating the individual impact of the driver and constructor remains challenging. This paper extends a Regularized Adjusted Plus Minus (RAPM) methodology (Sill 2010), commonly used in basketball and hockey, to parse out individual driver and constructor impact. It employs a time-decayed ridge regression with LOESS (Jacoby 2000) smoothing to predict race results for the Hybrid Engine Era (2014 - 2024). By measuring the constructor and driver coefficients over time, we measure the relative individual impact of driver and constructor throughout the period. Results show that constructors explain 64.0% of the variance in race outcomes in the Hybrid Engine Era. Additionally, constructors have increased importance in benchmarked rank-agnostic cohorts (e.g., Top 10 points finishers) and decreased importance in qualifying. By decomposing performance into individual driver and constructor metrics, we create a robust framework for inter-constructor driver comparisons that the Formula 1 points system obfuscates. Our work enhances the understanding of driver and constructor contributions to race success, offering valuable insights for strategic decision-making in Formula 1.
- [8] arXiv:2508.00206 [pdf, html, other]
-
Title: The hierarchical barycenter: conditional probability simulation with structured and unobserved covariatesSubjects: Methodology (stat.ME); Optimization and Control (math.OC)
This paper presents a new method for conditional probability density this http URL method is design to work with unstructured data set when data are not characterized by the same covariates yet share common information. Specific examples considered in the text are relative to two main classes: homogeneous data characterized by samples with missing value for the covariates and data set divided in two or more groups characterized by covariates that are only partially overlapping. The methodology is based on the mathematical theory of optimal transport extending the barycenter problem to the newly defined hierarchical barycenter problem. A newly, data driven, numerical procedure for the solution of the hierarchical barycenter problem is proposed and its advantages, over the use of classical barycenter, are illustrated on synthetic and real world data sets.
- [9] arXiv:2508.00210 [pdf, html, other]
-
Title: Efficient rare event estimation for multimodal and high-dimensional system reliability via subset adaptive importance samplingSubjects: Computation (stat.CO); Methodology (stat.ME)
Estimating rare events in complex systems is a key challenge in reliability analysis. The challenge grows in multimodal problems, where traditional methods often rely on a small set of design points and risk overlooking critical failure modes. Further, higher dimensions make the probability mass harder to capture and demand substantially larger sample sizes to estimate failures. In this work, we propose a new sampling strategy, subset adaptive importance sampling (SAIS), that combines the strengths of subset simulation and adaptive multiple importance sampling. SAIS iteratively refines a set of proposal distributions using weighted samples from previous stages, efficiently exploring complex and high-dimensional failure regions. Leveraging recent advances in adaptive importance sampling, SAIS yields low-variance estimates using fewer samples than state-of-the-art methods and achieves pronounced improvements in both accuracy and computational cost. Through a series of benchmark problems involving high-dimensional, nonlinear performance functions, and multimodal scenarios, we demonstrate that SAIS consistently outperforms competing methods in capturing diverse failure modes and estimating failure probabilities with high precision.
- [10] arXiv:2508.00216 [pdf, html, other]
-
Title: Predictiveness Curve Assessment under Competing Risks for Risk Prediction ModelsSubjects: Methodology (stat.ME)
The predictiveness curve is a valuable tool for predictive evaluation, risk stratification, and threshold selection in a target population, given a single biomarker or a prediction model. In the presence of competing risks, regression models are often used to generate predictive risk scores or probabilistic predictions targeting the cumulative incidence function--distinct from the cumulative distribution function used in conventional predictiveness curve analyses. We propose estimation and inference procedures for the predictiveness curve with a competing risks regression model, to display the relationship between the cumulative incidence probability and the quantiles of model-based predictions. The estimation procedure combines cross-validation with a flexible regression model for tau-year event risk given the model-based risk score, with corresponding inference procedures via perturbation resampling. The proposed methods perform satisfactorily in simulation studies and are implemented through an R package. We apply the proposed methods to a cirrhosis study to depict the predictiveness curve with model-based predictions for liver-related mortality.
- [11] arXiv:2508.00223 [pdf, html, other]
-
Title: Structural Causal Models for Extremes: an Approach Based on Exponent MeasuresSubjects: Statistics Theory (math.ST); Methodology (stat.ME)
We introduce a new formulation of structural causal models for extremes, called the extremal structural causal model (eSCM). Unlike conventional structural causal models, where randomness is governed by a probability distribution, eSCMs use an exponent measure--an infinite-mass law that naturally arises in the analysis of multivariate extremes. Central to this framework are activation variables, which abstract the single-big-jump principle, along with additional randomization that enriches the class of eSCM laws. This formulation encompasses all possible laws of directed graphical models under the recently introduced notion of extremal conditional independence. We also identify an inherent asymmetry in eSCMs under natural assumptions, enabling the identifiability of causal directions, a central challenge in causal inference. Finally, we propose a method that utilizes this causal asymmetry and demonstrate its effectiveness in both simulated and real datasets.
- [12] arXiv:2508.00247 [pdf, html, other]
-
Title: Sinusoidal Approximation Theorem for Kolmogorov-Arnold NetworksComments: 15 pages, 3 figuresSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Numerical Analysis (math.NA)
The Kolmogorov-Arnold representation theorem states that any continuous multivariable function can be exactly represented as a finite superposition of continuous single variable functions. Subsequent simplifications of this representation involve expressing these functions as parameterized sums of a smaller number of unique monotonic functions. These developments led to the proof of the universal approximation capabilities of multilayer perceptron networks with sigmoidal activations, forming the alternative theoretical direction of most modern neural networks.
Kolmogorov-Arnold Networks (KANs) have been recently proposed as an alternative to multilayer perceptrons. KANs feature learnable nonlinear activations applied directly to input values, modeled as weighted sums of basis spline functions. This approach replaces the linear transformations and sigmoidal post-activations used in traditional perceptrons. Subsequent works have explored alternatives to spline-based activations. In this work, we propose a novel KAN variant by replacing both the inner and outer functions in the Kolmogorov-Arnold representation with weighted sinusoidal functions of learnable frequencies. Inspired by simplifications introduced by Lorentz and Sprecher, we fix the phases of the sinusoidal activations to linearly spaced constant values and provide a proof of its theoretical validity. We also conduct numerical experiments to evaluate its performance on a range of multivariable functions, comparing it with fixed-frequency Fourier transform methods and multilayer perceptrons (MLPs). We show that it outperforms the fixed-frequency Fourier transform and achieves comparable performance to MLPs. - [13] arXiv:2508.00275 [pdf, other]
-
Title: Factor Augmented Quantile Regression ModelSubjects: Methodology (stat.ME)
Along with the widespread adoption of high-dimensional data, traditional statistical methods face significant challenges in handling problems with high correlation of variables, heavy-tailed distribution, and coexistence of sparse and dense effects. In this paper, we propose a factor-augmented quantile regression (FAQR) framework to address these challenges simultaneously within a unified framework. The proposed FAQR combines the robustness of quantile regression and the ability of factor analysis to effectively capture dependencies among high-dimensional covariates, and also provides a framework to capture dense effects (through common factors) and sparse effects (through idiosyncratic components) of the covariates. To overcome the lack of smoothness of the quantile loss function, convolution smoothing is introduced, which not only improves computational efficiency but also eases theoretical derivation. Theoretical analysis establishes the accuracy of factor selection and consistency in parameter estimation under mild regularity conditions. Furthermore, we develop a Bootstrap-based diagnostic procedure to assess the adequacy of the factor model. Simulation experiments verify the rationality of FAQR in different noise scenarios such as normal and $t_2$ distributions.
- [14] arXiv:2508.00333 [pdf, html, other]
-
Title: Tensor Elliptical Graphic ModelSubjects: Methodology (stat.ME)
We address the problem of robust estimation of sparse high dimensional tensor elliptical graphical model. Most of the research focus on tensor graphical model under normality. To extend the tensor graphical model to more heavy-tailed scenarios, motivated by the fact that up to a constant, the spatial-sign covariance matrix can approximate the true covariance matrix when the dimension turns to infinity under tensor elliptical distribution, we proposed a spatial-sign-based estimator to robustly estimate tensor elliptical graphical model, the rate of which matches the existing rate under normality for a wider family of distribution, i.e. elliptical distribution. We also conducted extensive simulations and real data applications to illustrate the practical utility of the proposed methods, especially under heavy-tailed distribution.
- [15] arXiv:2508.00411 [pdf, html, other]
-
Title: Predictive information criterion for jump diffusion processesSubjects: Statistics Theory (math.ST)
In this paper, we address a model selection problem for ergodic jump diffusion processes based on high-frequency samples. We evaluate the expected genuine log-likelihood function and derive an Akaike-type information criterion. In the derivation process, we also give new estimates of the transition density of jump diffusion processes.
- [16] arXiv:2508.00617 [pdf, html, other]
-
Title: Constructive Disintegration and Conditional ModesSubjects: Statistics Theory (math.ST); Machine Learning (cs.LG); Probability (math.PR); Machine Learning (stat.ML)
Conditioning, the central operation in Bayesian statistics, is formalised by the notion of disintegration of measures. However, due to the implicit nature of their definition, constructing disintegrations is often difficult. A folklore result in machine learning conflates the construction of a disintegration with the restriction of probability density functions onto the subset of events that are consistent with a given observation. We provide a comprehensive set of mathematical tools which can be used to construct disintegrations and apply these to find densities of disintegrations on differentiable manifolds. Using our results, we provide a disturbingly simple example in which the restricted density and the disintegration density drastically disagree. Motivated by applications in approximate Bayesian inference and Bayesian inverse problems, we further study the modes of disintegrations. We show that the recently introduced notion of a "conditional mode" does not coincide in general with the modes of the conditional measure obtained through disintegration, but rather the modes of the restricted measure. We also discuss the implications of the discrepancy between the two measures in practice, advocating for the utility of both approaches depending on the modelling context.
- [17] arXiv:2508.00696 [pdf, html, other]
-
Title: Online Rolling Controlled Sequential Monte CarloSubjects: Computation (stat.CO)
We introduce methodology for real-time inference in general-state-space hidden Markov models. Specifically, we extend recent advances in controlled sequential Monte Carlo (CSMC) methods-originally proposed for offline smoothing-to the online setting via a rolling window mechanism. Our novel online rolling controlled sequential Monte Carlo (ORCSMC) algorithm employs two particle systems to simultaneously estimate twisting functions and perform filtering, ensuring real-time adaptivity to new observations while maintaining bounded computational cost. Numerical results on linear-Gaussian, stochastic volatility, and neuroscience models demonstrate improved estimation accuracy and robustness in higher dimensions, compared to standard particle filtering approaches. The method offers a statistically efficient and practical solution for sequential and real-time inference in complex latent variable models.
- [18] arXiv:2508.00770 [pdf, other]
-
Title: On admissibility in post-hoc hypothesis testingComments: 56 pagesSubjects: Statistics Theory (math.ST); Methodology (stat.ME)
The validity of classical hypothesis testing requires the significance level $\alpha$ be fixed before any statistical analysis takes place. This is a stringent requirement. For instance, it prohibits updating $\alpha$ during (or after) an experiment due to changing concern about the cost of false positives, or to reflect unexpectedly strong evidence against the null. Perhaps most disturbingly, witnessing a p-value $p\ll\alpha$ vs $p\leq \alpha$ has no (statistical) relevance for any downstream decision-making. Following recent work of Grünwald (2024), we develop a theory of post-hoc hypothesis testing, enabling $\alpha$ to be chosen after seeing and analyzing the data. To study "good" post-hoc tests we introduce $\Gamma$-admissibility, where $\Gamma$ is a set of adversaries which map the data to a significance level. A test is $\Gamma$-admissible if, roughly speaking, there is no other test which performs at least as well and sometimes better across all adversaries in $\Gamma$. For point nulls and alternatives, we prove general properties of any $\Gamma$-admissible test for any $\Gamma$ and show that they must be based on e-values. We also classify the set of admissible tests for various specific $\Gamma$.
- [19] arXiv:2508.00824 [pdf, other]
-
Title: Local Poisson Deconvolution for Discrete SignalsComments: The first two authors contributed equallySubjects: Statistics Theory (math.ST); Methodology (stat.ME); Machine Learning (stat.ML)
We analyze the statistical problem of recovering an atomic signal, modeled as a discrete uniform distribution $\mu$, from a binned Poisson convolution model. This question is motivated, among others, by super-resolution laser microscopy applications, where precise estimation of $\mu$ provides insights into spatial formations of cellular protein assemblies. Our main results quantify the local minimax risk of estimating $\mu$ for a broad class of smooth convolution kernels. This local perspective enables us to sharply quantify optimal estimation rates as a function of the clustering structure of the underlying signal. Moreover, our results are expressed under a multiscale loss function, which reveals that different parts of the underlying signal can be recovered at different rates depending on their local geometry. Overall, these results paint an optimistic perspective on the Poisson deconvolution problem, showing that accurate recovery is achievable under a much broader class of signals than suggested by existing global minimax analyses. Beyond Poisson deconvolution, our results also allow us to establish the local minimax rate of parameter estimation in Gaussian mixture models with uniform weights.
We apply our methods to experimental super-resolution microscopy data to identify the location and configuration of individual DNA origamis. In addition, we complement our findings with numerical experiments on runtime and statistical recovery that showcase the practical performance of our estimators and their trade-offs.
New submissions (showing 19 of 19 entries)
- [20] arXiv:2507.22786 (cross-list from cs.LG) [pdf, html, other]
-
Title: DO-EM: Density Operator Expectation MaximizationComments: Main text: 9 pages 1 Figure. Total: 23 pages 3 FiguresSubjects: Machine Learning (cs.LG); Quantum Physics (quant-ph); Machine Learning (stat.ML)
Density operators, quantum generalizations of probability distributions, are gaining prominence in machine learning due to their foundational role in quantum computing. Generative modeling based on density operator models (\textbf{DOMs}) is an emerging field, but existing training algorithms -- such as those for the Quantum Boltzmann Machine -- do not scale to real-world data, such as the MNIST dataset. The Expectation-Maximization algorithm has played a fundamental role in enabling scalable training of probabilistic latent variable models on real-world datasets. \textit{In this paper, we develop an Expectation-Maximization framework to learn latent variable models defined through \textbf{DOMs} on classical hardware, with resources comparable to those used for probabilistic models, while scaling to real-world data.} However, designing such an algorithm is nontrivial due to the absence of a well-defined quantum analogue to conditional probability, which complicates the Expectation step. To overcome this, we reformulate the Expectation step as a quantum information projection (QIP) problem and show that the Petz Recovery Map provides a solution under sufficient conditions. Using this formulation, we introduce the Density Operator Expectation Maximization (DO-EM) algorithm -- an iterative Minorant-Maximization procedure that optimizes a quantum evidence lower bound. We show that the \textbf{DO-EM} algorithm ensures non-decreasing log-likelihood across iterations for a broad class of models. Finally, we present Quantum Interleaved Deep Boltzmann Machines (\textbf{QiDBMs}), a \textbf{DOM} that can be trained with the same resources as a DBM. When trained with \textbf{DO-EM} under Contrastive Divergence, a \textbf{QiDBM} outperforms larger classical DBMs in image generation on the MNIST dataset, achieving a 40--60\% reduction in the Fréchet Inception Distance.
- [21] arXiv:2508.00040 (cross-list from cs.LG) [pdf, html, other]
-
Title: Regime-Aware Conditional Neural Processes with Multi-Criteria Decision Support for Operational Electricity Price ForecastingSubjects: Machine Learning (cs.LG); Probability (math.PR); Applications (stat.AP); Machine Learning (stat.ML)
This work integrates Bayesian regime detection with conditional neural processes for 24-hour electricity price prediction in the German market. Our methodology integrates regime detection using a disentangled sticky hierarchical Dirichlet process hidden Markov model (DS-HDP-HMM) applied to daily electricity prices. Each identified regime is subsequently modeled by an independent conditional neural process (CNP), trained to learn localized mappings from input contexts to 24-dimensional hourly price trajectories, with final predictions computed as regime-weighted mixtures of these CNP outputs. We rigorously evaluate R-NP against deep neural networks (DNN) and Lasso estimated auto-regressive (LEAR) models by integrating their forecasts into diverse battery storage optimization frameworks, including price arbitrage, risk management, grid services, and cost minimization. This operational utility assessment revealed complex performance trade-offs: LEAR often yielded superior absolute profits or lower costs, while DNN showed exceptional optimality in specific cost-minimization contexts. Recognizing that raw prediction accuracy doesn't always translate to optimal operational outcomes, we employed TOPSIS as a comprehensive multi-criteria evaluation layer. Our TOPSIS analysis identified LEAR as the top-ranked model for 2021, but crucially, our proposed R-NP model emerged as the most balanced and preferred solution for 2021, 2022 and 2023.
- [22] arXiv:2508.00180 (cross-list from cs.LG) [pdf, html, other]
-
Title: EMA Without the Lag: Bias-Corrected Iterate Averaging SchemesSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
Stochasticity in language model fine-tuning, often caused by the small batch sizes typically used in this regime, can destabilize training by introducing large oscillations in generation quality. A popular approach to mitigating this instability is to take an Exponential moving average (EMA) of weights throughout training. While EMA reduces stochasticity, thereby smoothing training, the introduction of bias from old iterates often creates a lag in optimization relative to vanilla training. In this work, we propose the Bias-Corrected Exponential Moving Average (BEMA), a simple and practical augmentation of EMA that retains variance-reduction benefits while eliminating bias. BEMA is motivated by a simple theoretical model wherein we demonstrate provable acceleration of BEMA over both a standard EMA and vanilla training. Through an extensive suite of experiments on Language Models, we show that BEMA leads to significantly improved convergence rates and final performance over both EMA and vanilla training in a variety of standard LM benchmarks, making BEMA a practical and theoretically motivated intervention for more stable and efficient fine-tuning.
- [23] arXiv:2508.00264 (cross-list from cs.LG) [pdf, html, other]
-
Title: Calibrated Language Models and How to Find Them with Label SmoothingComments: Accepted to the Forty-second International Conference on Machine Learning (ICML) 2025. First two authors contributed equallySubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
Recent advances in natural language processing (NLP) have opened up greater opportunities to enable fine-tuned large language models (LLMs) to behave as more powerful interactive agents through improved instruction-following ability. However, understanding how this impacts confidence calibration for reliable model output has not been researched in full. In this work, we examine various open-sourced LLMs, identifying significant calibration degradation after instruction tuning in each. Seeking a practical solution, we look towards label smoothing, which has been shown as an effective method to regularize for overconfident predictions but has yet to be widely adopted in the supervised fine-tuning (SFT) of LLMs. We first provide insight as to why label smoothing is sufficient to maintain calibration throughout the SFT process. However, settings remain where the effectiveness of smoothing is severely diminished, in particular the case of large vocabulary LLMs (LV-LLMs). We posit the cause to stem from the ability to become over-confident, which has a direct relationship with the hidden size and vocabulary size, and justify this theoretically and experimentally. Finally, we address an outstanding issue regarding the memory footprint of the cross-entropy loss computation in the label smoothed loss setting, designing a customized kernel to dramatically reduce memory consumption without sacrificing speed or performance in comparison to existing solutions for non-smoothed losses.
- [24] arXiv:2508.00286 (cross-list from cs.LG) [pdf, other]
-
Title: Toward using explainable data-driven surrogate models for treating performance-based seismic design as an inverse engineering problemSubjects: Machine Learning (cs.LG); Applications (stat.AP); Machine Learning (stat.ML)
This study presents a methodology to treat performance-based seismic design as an inverse engineering problem, where design parameters are directly derived to achieve specific performance objectives. By implementing explainable machine learning models, this methodology directly maps design variables and performance metrics, tackling computational inefficiencies of performance-based design. The resultant machine learning model is integrated as an evaluation function into a genetic optimization algorithm to solve the inverse problem. The developed methodology is then applied to two different inventories of steel and concrete moment frames in Los Angeles and Charleston to obtain sectional properties of frame members that minimize expected annualized seismic loss in terms of repair costs. The results show high accuracy of the surrogate models (e.g., R2> 90%) across a diverse set of building types, geometries, seismic design, and site hazard, where the optimization algorithm could identify the optimum values of members' properties for a fixed set of geometric variables, consistent with engineering principles.
- [25] arXiv:2508.00294 (cross-list from math.PR) [pdf, html, other]
-
Title: Formal Power Series Representations in Probability and Expected Utility TheorySubjects: Probability (math.PR); Artificial Intelligence (cs.AI); Theoretical Economics (econ.TH); Logic (math.LO); Statistics Theory (math.ST)
We advance a general theory of coherent preference that surrenders restrictions embodied in orthodox doctrine. This theory enjoys the property that any preference system admits extension to a complete system of preferences, provided it satisfies a certain coherence requirement analogous to the one de Finetti advanced for his foundations of probability. Unlike de Finetti's theory, the one we set forth requires neither transitivity nor Archimedeanness nor boundedness nor continuity of preference. This theory also enjoys the property that any complete preference system meeting the standard of coherence can be represented by utility in an ordered field extension of the reals. Representability by utility is a corollary of this paper's central result, which at once extends H?lder's Theorem and strengthens Hahn's Embedding Theorem.
- [26] arXiv:2508.00542 (cross-list from physics.soc-ph) [pdf, html, other]
-
Title: Assessing (im)balance in signed brain networksMarzio Di Vece, Emanuele Agrimi, Samuele Tatullo, Tommaso Gili, Miguel Ibá?ez-Berganza, Tiziano SquartiniComments: 41 pages, 17 figures, 1 tableSubjects: Physics and Society (physics.soc-ph); Information Theory (cs.IT); Data Analysis, Statistics and Probability (physics.data-an); Medical Physics (physics.med-ph); Methodology (stat.ME)
Many complex systems - be they financial, natural or social - are composed by units - such as stocks, neurons or agents - whose joint activity can be represented as a multivariate time series. An issue of both practical and theoretical importance concerns the possibility of inferring the presence of a static relationships between any two units solely from their dynamic state. The present contribution aims at providing an answer within the frame of traditional hypothesis testing. Briefly speaking, our suggestion is that of linking any two units if behaving in a sufficiently similar way. To achieve such a goal, we project a multivariate time series onto a signed graph, by i) comparing the empirical properties of the former with those expected under a suitable benchmark and ii) linking any two units with a positive (negative) edge in case the corresponding series share a significantly large number of concordant (discordant) values. To define our benchmarks, we adopt an information-theoretic approach that is rooted into the constrained maximisation of Shannon entropy, a procedure inducing an ensemble of multivariate time series that preserves some of the empirical properties on average while randomising everything else. We showcase the possible applications of our method by addressing one of the most timely issues in the domain of neurosciences, i.e. that of determining if brain networks are frustrated or not - and, in case, to what extent. As our results suggest, this is indeed the case, the structure of the negative subgraph being more prone to inter-subject variability than the complementary, positive subgraph. At the mesoscopic level, instead, the minimisation of the Bayesian Information Criterion instantiated with the Signed Stochastic Block Model reveals that brain areas gather into modules aligning with the statistical variant of the Relaxed Balance Theory.
- [27] arXiv:2508.00545 (cross-list from cs.LG) [pdf, html, other]
-
Title: Foundations of Interpretable ModelsSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE); Machine Learning (stat.ML)
We argue that existing definitions of interpretability are not actionable in that they fail to inform users about general, sound, and robust interpretable model design. This makes current interpretability research fundamentally ill-posed. To address this issue, we propose a definition of interpretability that is general, simple, and subsumes existing informal notions within the interpretable AI community. We show that our definition is actionable, as it directly reveals the foundational properties, underlying assumptions, principles, data structures, and architectural features necessary for designing interpretable models. Building on this, we propose a general blueprint for designing interpretable models and introduce the first open-sourced library with native support for interpretable data structures and processes.
- [28] arXiv:2508.00658 (cross-list from cs.AI) [pdf, html, other]
-
Title: Multi-Band Variable-Lag Granger Causality: A Unified Framework for Causal Time Series Inference across FrequenciesComments: First draftSubjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Econometrics (econ.EM); Methodology (stat.ME)
Understanding causal relationships in time series is fundamental to many domains, including neuroscience, economics, and behavioral science. Granger causality is one of the well-known techniques for inferring causality in time series. Typically, Granger causality frameworks have a strong fix-lag assumption between cause and effect, which is often unrealistic in complex systems. While recent work on variable-lag Granger causality (VLGC) addresses this limitation by allowing a cause to influence an effect with different time lags at each time point, it fails to account for the fact that causal interactions may vary not only in time delay but also across frequency bands. For example, in brain signals, alpha-band activity may influence another region with a shorter delay than slower delta-band oscillations. In this work, we formalize Multi-Band Variable-Lag Granger Causality (MB-VLGC) and propose a novel framework that generalizes traditional VLGC by explicitly modeling frequency-dependent causal delays. We provide a formal definition of MB-VLGC, demonstrate its theoretical soundness, and propose an efficient inference pipeline. Extensive experiments across multiple domains demonstrate that our framework significantly outperforms existing methods on both synthetic and real-world datasets, confirming its broad applicability to any type of time series data. Code and datasets are publicly available.
Cross submissions (showing 9 of 9 entries)
- [29] arXiv:1407.0064 (replaced) [pdf, html, other]
-
Title: Zero & $N$-inflated overdispersed binomial models for sum-constrained Poisson count processesSubjects: Methodology (stat.ME)
A frequent challenge encountered with compositional ecological data is how to interpret and model data with a high proportion of zeros and $N$'s. Such data frequently occur in ecological applications where counts of species are collected until a pre-specified total imposed (typically) by sampling cost is reached. In the bivariate count (two-species) setting we focus on in this article, zero-inflation of one species will result in $N$-inflation of the other. This can lead to species absence being attributed to an unsuitable habitat as opposed to missingness by chance. Similarly, an excess of $N$'s will lead to misleading inferences about habitat preference and abundance estimates. Our contribution is to identify that two independent zero-inflated Poisson processes subject to a sum constraint provide a novel biologically-motivated generating mechanism for the occurrence of binomial count data exhibiting zero and $N$-inflation. We identify an extension to the model to capture additional overdispersion within the data resulting in a novel zero and $N$-inflated beta-binomial model. We consider two motivating datasets, one involving a pesticide treatment for an invasive species, and a second involving the abundance of two plant species. We demonstrate that incorporation of covariates in each case enable learning about sources of zero and $N$-inflation as well as abundance. We show that the models result in improved understanding of underlying biological processes as well as improved predictive performance.
- [30] arXiv:2206.05161 (replaced) [pdf, html, other]
-
Title: Approximating optimal SMC proposal distributions in individual-based epidemic modelsSubjects: Methodology (stat.ME)
Many epidemic models are naturally defined as individual-based models: where we track the state of each individual within a susceptible population. Inference for individual-based models is challenging due to the high-dimensional state-space of such models, which increases exponentially with population size. We consider sequential Monte Carlo algorithms for inference for individual-based epidemic models where we make direct observations of the state of a sample of individuals. Standard implementations, such as the bootstrap filter or the auxiliary particle filter are inefficient due to mismatch between the proposal distribution of the state and future observations. We develop new efficient proposal distributions that take account of future observations, leveraging the properties that (i) we can analytically calculate the optimal proposal distribution for a single individual given future observations and the future infection rate of that individual; and (ii) the dynamics of individuals are independent if we condition on their infection rates. Thus we construct estimates of the future infection rate for each individual, and then use an independent proposal for the state of each individual given this estimate. Empirical results show order of magnitude improvement in efficiency of the sequential Monte Carlo sampler for both SIS and SEIR models.
- [31] arXiv:2302.09510 (replaced) [pdf, other]
-
Title: Smooth Backfitting for Additive Hazard RatesSubjects: Statistics Theory (math.ST)
Smooth backfitting was first introduced in an additive regression setting via a direct projection alternative to the classic backfitting method by Buja, Hastie and Tibshirani. This paper translates the original smooth backfitting concept to a survival model considering an additively structured hazard. The model allows for censoring and truncation patterns occurring in many applications such as medical studies or actuarial reserving. Our estimators are shown to be a projection of the data into the space of multivariate hazard functions with smooth additive components. Hence, our hazard estimator is the closest nonparametric additive fit even if the actual hazard rate is not additive. This is different to other additive structure estimators where it is not clear what is being estimated if the model is not true. We provide full asymptotic theory for our estimators. We propose an implementation of estimators that show good performance in practice.
- [32] arXiv:2310.13826 (replaced) [pdf, other]
-
Title: A p-value for Process Tracing and other N=1 StudiesSubjects: Methodology (stat.ME); Other Statistics (stat.OT)
We introduce a method for calculating \(p\)-values to test causal hypotheses in qualitative research \emph{a la} process tracing. As in an experiment, our \(p\)-value tells us how often one would make the same or more compelling observations favoring one theory while entertaining a rival theory. We adapt Fisher's (1935) randomization-based urn model to the reality of qualitative researchers, who cannot randomize history, but can make observations about historical processes. Our test includes a method of sensitivity analysis which allows researchers to account for the possibility of observation bias, as well as a framework for representing the varying strenght of individual pieces of evidence, altoguether informing the robustness of qualitative causal inefernce. We provide simulations and replications of previously published work to illustrate how to execute our test using any type of qualitative data about events that took place within one case. This approach adds to the pluralistic turn in the use of probability theory in theory-testing process tracing by offering a simple model with provable conservatism, while relying on few assumptions the consequences of which can be directly assessed.
- [33] arXiv:2312.01046 (replaced) [pdf, html, other]
-
Title: Bagged Regularized $k$-Distances for Anomaly DetectionSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
We consider the paradigm of unsupervised anomaly detection, which involves the identification of anomalies within a dataset in the absence of labeled examples. Though distance-based methods are top-performing for unsupervised anomaly detection, they suffer heavily from the sensitivity to the choice of the number of the nearest neighbors. In this paper, we propose a new distance-based algorithm called bagged regularized $k$-distances for anomaly detection (BRDAD), converting the unsupervised anomaly detection problem into a convex optimization problem. Our BRDAD algorithm selects the weights by minimizing the surrogate risk, i.e., the finite sample bound of the empirical risk of the bagged weighted $k$-distances for density estimation (BWDDE). This approach enables us to successfully address the sensitivity challenge of the hyperparameter choice in distance-based algorithms. Moreover, when dealing with large-scale datasets, the efficiency issues can be addressed by the incorporated bagging technique in our BRDAD algorithm. On the theoretical side, we establish fast convergence rates of the AUC regret of our algorithm and demonstrate that the bagging technique significantly reduces the computational complexity. On the practical side, we conduct numerical experiments to illustrate the insensitivity of the parameter selection of our algorithm compared with other state-of-the-art distance-based methods. Furthermore, our method achieves superior performance on real-world datasets with the introduced bagging technique compared to other approaches.
- [34] arXiv:2405.16958 (replaced) [pdf, html, other]
-
Title: Large Deviations of Gaussian Neural Networks with ReLU activationComments: 13 pages, 2 figures, proof simplifiedSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Probability (math.PR)
We prove a large deviation principle for deep neural networks with Gaussian weights and at most linearly growing activation functions, such as ReLU. This generalises earlier work, in which bounded and continuous activation functions were considered. In practice, linearly growing activation functions such as ReLU are most commonly used. We furthermore simplify previous expressions for the rate function and provide a power-series expansions for the ReLU case.
- [35] arXiv:2406.15500 (replaced) [pdf, html, other]
-
Title: Pure interaction effects unseen by Random ForestsComments: arXiv admin note: substantial text overlap with arXiv:2309.01460Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Random Forests are widely claimed to capture interactions well. However, some simple examples suggest that they perform poorly in the presence of certain pure interactions that the conventional CART criterion struggles to capture during tree construction. Motivated from this, it is argued that simple alternative partitioning schemes used in the tree growing procedure can enhance identification of these interactions. In a simulation study these variants are compared to conventional Random Forests and Extremely Randomized Trees. The results validate that the modifications considered enhance the model's fitting ability in scenarios where pure interactions play a crucial role. Finally, the methods are applied to real datasets.
- [36] arXiv:2407.13971 (replaced) [pdf, html, other]
-
Title: Dimension-reduced Reconstruction Map Learning for Parameter Estimation in Likelihood-Free Inference ProblemsSubjects: Methodology (stat.ME); Computation (stat.CO); Machine Learning (stat.ML)
Many application areas rely on models that can be readily simulated but lack a closed-form likelihood, or an accurate approximation under arbitrary parameter values. Existing parameter estimation approaches in this setting are generally approximate. Recent work on using neural network models to reconstruct the mapping from the data space to the parameters from a set of synthetic parameter-data pairs suffers from the curse of dimensionality, resulting in inaccurate estimation as the data size grows. We propose a dimension-reduced approach to likelihood-free estimation which combines the ideas of reconstruction map estimation with dimension-reduction approaches based on subject-specific knowledge. We examine the properties of reconstruction map estimation with and without dimension reduction and explore the trade-off between approximation error due to information loss from reducing the data dimension and approximation error. Numerical examples show that the proposed approach compares favorably with reconstruction map estimation, approximate Bayesian computation, and synthetic likelihood estimation.
- [37] arXiv:2408.08177 (replaced) [pdf, html, other]
-
Title: Localized Sparse Principal Component Analysis of Multivariate Time Series in Frequency DomainComments: 63 pages, 6 figuresSubjects: Methodology (stat.ME); Machine Learning (stat.ML)
Principal component analysis has been a main tool in multivariate analysis for estimating a low dimensional linear subspace that explains most of the variability in the data. However, in high-dimensional regimes, naive estimates of the principal loadings are not consistent and difficult to interpret. In the context of time series, principal component analysis of spectral density matrices can provide valuable, parsimonious information about the behavior of the underlying process, particularly if the principal components are interpretable in that they are sparse in coordinates and localized in frequency bands. In this paper, we introduce a formulation and consistent estimation procedure for interpretable principal component analysis for high-dimensional time series in the frequency domain. An efficient frequency-sequential algorithm is developed to compute sparse-localized estimates of the low-dimensional principal subspaces of the signal process. The method is motivated by and used to understand neurological mechanisms from high-density resting-state EEG in a study of first episode psychosis.
- [38] arXiv:2408.13392 (replaced) [pdf, html, other]
-
Title: A Multivariate Space-Time Dynamic Model for Characterizing the Atmospheric Impacts Following the Mt Pinatubo EruptioSubjects: Applications (stat.AP)
The June 1991 Mt. Pinatubo eruption resulted in a massive increase of sulfate aerosols in the atmosphere, absorbing radiation and leading to global changes in surface and stratospheric temperatures. A volcanic eruption of this magnitude serves as a natural analog for stratospheric aerosol injection, a proposed solar radiation modification method to combat a warming climate. The impacts of such an event are multifaceted and region-specific. Our goal is to characterize the multivariate and dynamic nature of the atmospheric impacts following the Mt. Pinatubo eruption. We developed a multivariate space-time dynamic linear model to understand the full extent of the spatially- and temporally-varying impacts. Specifically, spatial variation is modeled using a flexible set of basis functions for which the basis coefficients are allowed to vary in time through a vector autoregressive (VAR) structure. This novel model is caste in a Dynamic Linear Model (DLM) framework and estimated via a customized MCMC approach. We demonstrate how the model quantifies the relationships between key atmospheric parameters prior to and following the Mt. Pinatubo eruption with reanalysis data from MERRA-2 and highlight when such model is advantageous over univariate models.
- [39] arXiv:2408.13414 (replaced) [pdf, html, other]
-
Title: Selecting fitted models under epistemic uncertainty using a stochastic process on quantile functionsComments: v3: Updated title. Added comparison matrix to existing model comparison methods. Expanded presentation of calibration experiments, including possible strategies to improve relevance of comparisons. Two new Result figures (8 & 9). Two additional Supplementary figures (2 & 4). 39 pages, 13 figures, 1 table. An online version of this paper is available at this http URLSubjects: Methodology (stat.ME)
Fitting models to data is an important part of the practice of science. Advances in machine learning have made it possible to fit more -- and more complex -- models, but have also exacerbated a problem: when multiple models fit the data equally well, which one(s) should we pick? The answer depends entirely on the modelling goal. In the scientific context, the essential goal is _replicability_: if a model works well to describe one experiment, it should continue to do so when that experiment is replicated tomorrow, or in another laboratory. The selection criterion must therefore be robust to the variations inherent to the replication process. In this work we develop a nonparametric method for estimating uncertainty on a model's empirical risk when replications are non-stationary, thus ensuring that a model is only rejected when another is _reproducibly_ better. We illustrate the method with two examples: one a more classical setting, where the models are structurally distinct, and a machine learning-inspired setting, where they differ only in the value of their parameters. We show how, in this context of replicability or "epistemic uncertainty", it compares favourably to existing model selection criteria, and has more satisfactory behaviour with large experimental datasets.
- [40] arXiv:2409.01908 (replaced) [pdf, html, other]
-
Title: Bayesian CART models for aggregate claim modelingSubjects: Methodology (stat.ME); Machine Learning (cs.LG); Statistical Finance (q-fin.ST); Applications (stat.AP); Machine Learning (stat.ML)
This paper proposes three types of Bayesian CART (or BCART) models for aggregate claim amount, namely, frequency-severity models, sequential models and joint models. We propose a general framework for the BCART models applicable to data with multivariate responses, which is particularly useful for the joint BCART models with a bivariate response: the number of claims and aggregate claim amount. To facilitate frequency-severity modeling, we investigate BCART models for the right-skewed and heavy-tailed claim severity data by using various distributions. We discover that the Weibull distribution is superior to gamma and lognormal distributions, due to its ability to capture different tail characteristics in tree models. Additionally, we find that sequential BCART models and joint BCART models, which incorporate dependence between the number of claims and average severity, are beneficial and thus preferable to the frequency-severity BCART models in which independence is assumed. The effectiveness of these models' performance is illustrated by carefully designed simulations and real insurance data.
- [41] arXiv:2409.13060 (replaced) [pdf, html, other]
-
Title: Forecasting Causal Effects of Future Interventions: Confounding and Transportability IssuesSubjects: Methodology (stat.ME)
Recent developments in causal inference allow us to transport a causal effect of a time-fixed treatment from a randomized trial to a target population across space but within the same time frame. In contrast to transportability across space, transporting causal effects across time or forecasting causal effects of future interventions is more challenging due to time-varying confounders and time-varying effect modifiers. In this article, we seek to formally clarify the causal estimands for forecasting causal effects over time and the structural assumptions required to identify these estimands. Specifically, we develop a set of novel nonparametric identification formulas--g-computation formulas--for these causal estimands, and lay out the conditions required to accurately forecast causal effects from a past observed sample to a future population in a future time window. Our overarching objective is to leverage the modern causal inference theory to provide a theoretical framework for investigating whether the effects seen in a past sample would carry over to a new future population. Throughout the article, a working example addressing the effect of public policies or social events on COVID-related deaths is considered to contextualize the developments of analytical results.
- [42] arXiv:2411.12479 (replaced) [pdf, html, other]
-
Title: Graph-based Square-Root Estimation for Sparse Linear RegressionSubjects: Methodology (stat.ME); Computation (stat.CO)
Sparse linear regression is one of the classic problems in the field of statistics, which has deep connections and high intersections with optimization, computation, and machine learning. To address the effective handling of high-dimensional data, the diversity of real noise, and the challenges in estimating standard deviation of the noise, we propose a novel and general graph-based square-root estimation (GSRE) model for sparse linear regression. Specifically, we use square-root-loss function to encourage the estimators to be independent of the unknown standard deviation of the error terms and design a sparse regularization term by using the graphical structure among predictors in a node-by-node form. Based on the predictor graphs with special structure, we highlight the generality by analyzing that the model in this paper is equivalent to several classic regression models. Theoretically, we also analyze the finite sample bounds, asymptotic normality and model selection consistency of GSRE method without relying on the standard deviation of error terms. In terms of computation, we employ the fast and efficient alternating direction method of multipliers. Finally, based on a large number of simulated and real data with various types of noise, we demonstrate the performance advantages of the proposed method in estimation, prediction and model selection.
- [43] arXiv:2411.14983 (replaced) [pdf, html, other]
-
Title: Large sample scaling analysis of the Zig-Zag algorithm for Bayesian inferenceComments: 50 pages, 7 figues, 1 tableSubjects: Computation (stat.CO)
Piecewise deterministic Markov processes provide scalable methods for sampling from the posterior distributions in big data settings by admitting principled sub-sampling strategies that do not bias the output. An important example is the Zig-Zag process of [Ann. Stats. 47 (2019) 1288 - 1320] where clever sub-sampling has been shown to produce an essentially independent sample at a cost that does not scale with the size of the data. However, sub-sampling also leads to slower convergence and poor mixing of the process, a behaviour which questions the promised scalability of the algorithm. We provide a large sample scaling analysis of the Zig-Zag process and its sub-sampling versions in settings of parametric Bayesian inference. In the transient phase of the algorithm, we show that the Zig-Zag trajectories are well approximated by the solution to a system of ODEs. These ODEs possess a drift in the direction of decreasing KL-divergence between the assumed model and the true distribution and are explicitly characterized in the paper. In the stationary phase, we give weak convergence results for different versions of the Zig-Zag process. Based on our results, we estimate that for large data sets of size n, using suitable control variates with sub-sampling in Zig-Zag, the algorithm costs O(1) to obtain an essentially independent sample; a computational speed-up of O(n) over the canonical version of Zig-Zag and other traditional MCMC methods
- [44] arXiv:2501.08945 (replaced) [pdf, html, other]
-
Title: COADVISE: Covariate Adjustment with Variable Selection in Randomized Controlled TrialsSubjects: Methodology (stat.ME)
Adjusting for covariates in randomized controlled trials can enhance the credibility and efficiency of treatment effect estimation. However, handling numerous covariates and their complex (non-linear) transformations poses a challenge. Motivated by the case study of the Best Apnea Interventions for Research (BestAIR) trial data from the National Sleep Research Resource (NSRR), where the number of covariates (p=114) is comparable to the sample size (N=196), we propose a principled Covariate Adjustment with Variable Selection (COADVISE) framework. COADVISE enables variable selection for covariates most relevant to the outcome while accommodating both linear and nonlinear adjustments. This framework ensures consistent estimates with improved efficiency over unadjusted estimators and provides robust variance estimation, even under outcome model misspecification. We demonstrate efficiency gains through theoretical analysis, extensive simulations, and a re-analysis of the BestAIR trial data to compare alternative variable selection strategies, offering cautionary recommendations. A user-friendly R package, Coadvise, is available to facilitate practical implementation.
- [45] arXiv:2502.07699 (replaced) [pdf, html, other]
-
Title: Sharp Anti-Concentration Inequalities for Extremum Statistics via CopulasComments: 24 pages, 2 figuresSubjects: Statistics Theory (math.ST); Probability (math.PR)
We derive sharp upper and lower bounds for the pointwise concentration function of the maximum statistic of $d$ identically distributed real-valued random variables. Our first main result places no restrictions either on the common marginal law of the samples or on the copula describing their joint distribution. We show that, in general, strictly sublinear dependence of the concentration function on the dimension $d$ is not possible. We then introduce a new class of copulas, namely those with a convex diagonal section, and demonstrate that restricting to this class yields a sharper upper bound on the concentration function. This allows us to establish several new dimension-independent and poly-logarithmic-in-$d$ anti-concentration inequalities for a variety of marginal distributions under mild dependence assumptions. Our theory improves upon the best known results in certain special cases. Applications to high-dimensional statistical inference are presented, including a specific example pertaining to Gaussian mixture approximations for factor models, for which our main results lead to superior distributional guarantees.
- [46] arXiv:2503.08971 (replaced) [pdf, html, other]
-
Title: Data-Driven Adjustment for Multiple TreatmentsComments: 26 pages, 11 figuresSubjects: Methodology (stat.ME)
Covariate adjustment is one method of causal effect identification in non-experimental settings. Prior research provides routes for finding appropriate adjustments sets, but much of this research assumes knowledge of the underlying causal graph. In this paper, we present two routes for finding adjustment sets that do not require knowledge of a graph -- and instead rely on dependencies and independencies in the data directly. We consider a setting where the adjustment set is unaffected by treatment or outcome. The first route shows how to extend prior research in this area using a concept known as c-equivalence. Our second route provides sufficient criteria for finding adjustment sets in the setting of multiple treatments.
- [47] arXiv:2504.21647 (replaced) [pdf, other]
-
Title: Conditional independence testing with a single realization of a multivariate nonstationary nonlinear time seriesSubjects: Methodology (stat.ME); Statistics Theory (math.ST); Machine Learning (stat.ML)
Identifying relationships among stochastic processes is a core objective in many fields, such as economics. While the standard toolkit for multivariate time series analysis has many advantages, it can be difficult to capture nonlinear dynamics using linear vector autoregressive models. This difficulty has motivated the development of methods for causal discovery and variable selection for nonlinear time series, which routinely employ tests for conditional independence. In this paper, we introduce the first framework for conditional independence testing that works with a single realization of a nonstationary nonlinear process. We also show how our framework can be used to test for independence. The key technical ingredients of our framework are time-varying nonlinear regression, estimation of local long-run covariance matrices of products of error processes, and a distribution-uniform strong Gaussian approximation.
- [48] arXiv:2504.21688 (replaced) [pdf, html, other]
-
Title: Assessing Racial Disparities in Healthcare Expenditures via Mediator Distribution ShiftsSubjects: Applications (stat.AP); Methodology (stat.ME); Machine Learning (stat.ML)
Racial disparities in healthcare expenditures are well-documented, yet the underlying drivers remain complex and require further investigation. This study develops a framework for decomposing such disparities through shifts in the distributions of mediating variables, rather than treating race itself as a manipulable exposure. We define disparities as differences in covariate-adjusted outcome distributions across racial groups, and decompose the total disparity into two components: one attributable to differences in mediator distributions, and another residual component that would remain even after equalizing these distributions. Using data from the Medical Expenditures Panel Survey, we examine the extent to which expenditure disparities would persist or be reduced if mediators such as socioeconomic status, insurance access, health behaviors, or health status were equalized across racial groups. To ensure valid inference, we derive asymptotically linear estimators based on influence-function techniques and flexible machine learning tools, including super learners and a two-part model designed for the zero-inflated, right-skewed nature of expenditure data.
- [49] arXiv:2505.00292 (replaced) [pdf, html, other]
-
Title: Conformal changepoint localizationSubjects: Statistics Theory (math.ST); Signal Processing (eess.SP); Methodology (stat.ME)
Changepoint localization is the problem of estimating the index at which a change occurred in the data generating distribution of an ordered list of data, or declaring that no change occurred. We present the broadly applicable CONCH (CONformal CHangepoint localization) algorithm, which uses a matrix of conformal p-values to produce a confidence interval for a (single) changepoint under the mild assumption that the pre-change and post-change distributions are each exchangeable. We exemplify the CONCH algorithm on a variety of synthetic and real-world datasets, including using black-box pre-trained classifiers to detect changes in sequences of images, text, and accelerometer data.
- [50] arXiv:2505.10498 (replaced) [pdf, html, other]
-
Title: Batched Nonparametric Bandits via k-Nearest Neighbor UCBSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST); Methodology (stat.ME)
We study sequential decision-making in batched nonparametric contextual bandits, where actions are selected over a finite horizon divided into a small number of batches. Motivated by constraints in domains such as medicine and marketing -- where online feedback is limited -- we propose a nonparametric algorithm that combines adaptive k-nearest neighbor (k-NN) regression with the upper confidence bound (UCB) principle. Our method, BaNk-UCB, is fully nonparametric, adapts to the context dimension, and is simple to implement. Unlike prior work relying on parametric or binning-based estimators, BaNk-UCB uses local geometry to estimate rewards and adaptively balances exploration and exploitation. We provide near-optimal regret guarantees under standard Lipschitz smoothness and margin assumptions, using a theoretically motivated batch schedule that balances regret across batches and achieves minimax-optimal rates. Empirical evaluations on synthetic and real-world datasets demonstrate that BaNk-UCB consistently outperforms binning-based baselines.
- [51] arXiv:2507.19413 (replaced) [pdf, html, other]
-
Title: Riesz representers for the rest of usSubjects: Statistics Theory (math.ST)
The application of semiparametric efficient estimators, particularly those that leverage machine learning, is rapidly expanding within epidemiology and causal inference. This literature is increasingly invoking the Riesz representation theorem and Riesz regression. This paper aims to introduce the Riesz representation theorem to an epidemiologic audience, explaining what it is and why it's useful, using step-by-step worked examples.
- [52] arXiv:2507.21982 (replaced) [pdf, html, other]
-
Title: Preconditioned Discrete-HAMS: A Second-order Irreversible Discrete SamplerComments: arXiv admin note: text overlap with arXiv:2507.09807Subjects: Methodology (stat.ME); Machine Learning (stat.ML)
Gradient-based Markov Chain Monte Carlo methods have recently received much attention for sampling discrete distributions, with notable examples such as Norm Constrained Gradient (NCG), Auxiliary Variable Gradient (AVG), and Discrete Hamiltonian Assisted Metropolis Sampling (DHAMS). In this work, we propose the Preconditioned Discrete-HAMS (PDHAMS) algorithm, which extends DHAMS by incorporating a second-order, quadratic approximation of the potential function, and uses Gaussian integral trick to avoid directly sampling a pairwise Markov random field. The PDHAMS sampler not only satisfies generalized detailed balance, hence enabling irreversible sampling, but also is a rejection-free property for a target distribution with a quadratic potential function. In various numerical experiments, PDHAMS algorithms consistently yield superior performance compared with other methods.
- [53] arXiv:2012.12762 (replaced) [pdf, html, other]
-
Title: Strong Laws of Large Numbers for Generalizations of Fréchet Mean SetsSubjects: Probability (math.PR); Statistics Theory (math.ST)
A Fréchet mean of a random variable $Y$ with values in a metric space $(\mathcal Q, d)$ is an element of the metric space that minimizes $q \mapsto \mathbb E[d(Y,q)^2]$. This minimizer may be non-unique. We study strong laws of large numbers for sets of generalized Fréchet means. Following generalizations are considered: the minimizers of $\mathbb E[d(Y, q)^\alpha]$ for $\alpha > 0$, the minimizers of $\mathbb E[H(d(Y, q))]$ for integrals $H$ of non-decreasing functions, and the minimizers of $\mathbb E[\mathfrak c(Y, q)]$ for a quite unrestricted class of cost functions $\mathfrak c$. We show convergence of empirical versions of these sets in outer limit and in one-sided Hausdorff distance. The derived results require only minimal assumptions.
- [54] arXiv:2403.10711 (replaced) [pdf, other]
-
Title: Gaussian universality for approximately polynomial functions of high-dimensional dataComments: Fixed a missing m in the upper bound; added a necessary and sufficient condition for asymptotic normalitySubjects: Probability (math.PR); Statistics Theory (math.ST)
We establish an invariance principle for polynomial functions of $n$ independent, high-dimensional random vectors, and also show that the obtained rates are nearly optimal. Both the dimension of the vectors and the degree of the polynomial are permitted to grow with $n$. Specifically, we obtain a finite sample upper bound for the error of approximation by a polynomial of Gaussians, measured in Kolmogorov distance, and extend it to functions that are approximately polynomial in a mean squared error sense. We give a corresponding lower bound that shows the invariance principle holds up to polynomial degree $o(\log n)$. The proof is constructive and adapts an asymmetrisation argument due to V. V. Senatov. We also give a necessary and sufficient condition for asymptotic normality via the fourth moment phenomenon of Nualart and Peccati. As applications, we obtain a higher-order delta method with possibly non-Gaussian limits, and generalise a number of known results on high-dimensional and infinite-order U-statistics, and on fluctuations of subgraph counts.
- [55] arXiv:2502.17077 (replaced) [pdf, html, other]
-
Title: A comparative analysis of rank aggregation methods for the partial label ranking problemComments: Full version of the paper accepted at ECAI 2025Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
The label ranking problem is a supervised learning scenario in which the learner predicts a total order of the class labels for a given input instance. Recently, research has increasingly focused on the partial label ranking problem, a generalization of the label ranking problem that allows ties in the predicted orders. So far, most existing learning approaches for the partial label ranking problem rely on approximation algorithms for rank aggregation in the final prediction step. This paper explores several alternative aggregation methods for this critical step, including scoring-based and non-parametric probabilistic-based rank aggregation approaches. To enhance their suitability for the more general partial label ranking problem, the investigated methods are extended to increase the likelihood of producing ties. Experimental evaluations on standard benchmarks demonstrate that scoring-based variants consistently outperform the current state-of-the-art method in handling incomplete information. In contrast, non-parametric probabilistic-based variants fail to achieve competitive performance.
- [56] arXiv:2503.08984 (replaced) [pdf, html, other]
-
Title: "All-Something-Nothing" Phase Transitions in Planted k-Factor RecoveryComments: 35 pages, 5 figures. Accepted for presentation at the 2025 Conference on Learning Theory, Lyon, FranceSubjects: Probability (math.PR); Statistics Theory (math.ST)
This paper studies the problem of inferring a $k$-factor, specifically a spanning $k$-regular graph, planted within an Erdos--Renyi random graph $G(n,\lambda/n)$. We uncover an interesting "all-something-nothing" phase transition. Specifically, we show that as the average degree $\lambda$ surpasses the critical threshold of $1/k$, the inference problem undergoes a transition from almost exact recovery ("all" phase) to partial recovery ("something" phase). Moreover, as $\lambda$ tends to infinity, the accuracy of recovery diminishes to zero, leading to the onset of the "nothing" phase. This finding complements the recent result by Mossel, Niles-Weed, Sohn, Sun, and Zadik who established that for certain sufficiently dense graphs, the problem undergoes an "all-or-nothing" phase transition, jumping from near-perfect to near-zero recovery. In addition, we characterize the recovery accuracy of a linear-time iterative pruning algorithm and show that it achieves almost exact recovery when $\lambda < 1/k$. A key component of our analysis is a two-step cycle construction: we first build trees through local neighborhood exploration and then connect them by sprinkling using reserved edges. Interestingly, for proving impossibility of almost exact recovery, we construct $\Theta(n)$ many small trees of size $\Theta(1)$, whereas for establishing the algorithmic lower bound, a single large tree of size $\Theta(\sqrt{n\log n})$ suffices.
- [57] arXiv:2506.23068 (replaced) [pdf, html, other]
-
Title: Curious Causality-Seeking Agents Learn Meta Causal WorldComments: 33 pagesSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Applications (stat.AP)
When building a world model, a common assumption is that the environment has a single, unchanging underlying causal rule, like applying Newton's laws to every situation. In reality, what appears as a drifting causal mechanism is often the manifestation of a fixed underlying mechanism seen through a narrow observational window. This brings about a problem that, when building a world model, even subtle shifts in policy or environment states can alter the very observed causal mechanisms. In this work, we introduce the \textbf{Meta-Causal Graph} as world models, a minimal unified representation that efficiently encodes the transformation rules governing how causal structures shift across different latent world states. A single Meta-Causal Graph is composed of multiple causal subgraphs, each triggered by meta state, which is in the latent state space. Building on this representation, we introduce a \textbf{Causality-Seeking Agent} whose objectives are to (1) identify the meta states that trigger each subgraph, (2) discover the corresponding causal relationships by agent curiosity-driven intervention policy, and (3) iteratively refine the Meta-Causal Graph through ongoing curiosity-driven exploration and agent experiences. Experiments on both synthetic tasks and a challenging robot arm manipulation task demonstrate that our method robustly captures shifts in causal dynamics and generalizes effectively to previously unseen contexts.
- [58] arXiv:2507.20980 (replaced) [pdf, html, other]
-
Title: LargeMvC-Net: Anchor-based Deep Unfolding Network for Large-scale Multi-view ClusteringComments: 10 pages, 7 figuresSubjects: Computer Vision and Pattern Recognition (cs.CV); Computation (stat.CO); Machine Learning (stat.ML)
Deep anchor-based multi-view clustering methods enhance the scalability of neural networks by utilizing representative anchors to reduce the computational complexity of large-scale clustering. Despite their scalability advantages, existing approaches often incorporate anchor structures in a heuristic or task-agnostic manner, either through post-hoc graph construction or as auxiliary components for message passing. Such designs overlook the core structural demands of anchor-based clustering, neglecting key optimization principles. To bridge this gap, we revisit the underlying optimization problem of large-scale anchor-based multi-view clustering and unfold its iterative solution into a novel deep network architecture, termed LargeMvC-Net. The proposed model decomposes the anchor-based clustering process into three modules: RepresentModule, NoiseModule, and AnchorModule, corresponding to representation learning, noise suppression, and anchor indicator estimation. Each module is derived by unfolding a step of the original optimization procedure into a dedicated network component, providing structural clarity and optimization traceability. In addition, an unsupervised reconstruction loss aligns each view with the anchor-induced latent space, encouraging consistent clustering structures across views. Extensive experiments on several large-scale multi-view benchmarks show that LargeMvC-Net consistently outperforms state-of-the-art methods in terms of both effectiveness and scalability.