Feed - arxlens

0

Learning operators on labelled conditional distributions with applications to mean field control of non exchangeable systems

math.OC math.PR stat.ML Samy Mekkaoui, Huy\^en Pham, Xavier Warin · Mar 23, 2026

This paper develops a neural operator framework for approximating mappings defined on constrained Wasserstein spaces $\mathcal{M}_\lambda$, consisting of probability measures on $I \times \mathbb{R}^d$ with prescribed marginal $\lambda$ on the label space $I$. The core contribution is the DeepONetCyl architecture, which combines cylindrical moment approximations $\Phi_J(\mu) = (\langle \varphi_1, \mu \rangle, \ldots, \langle \varphi_J, \mu \rangle)$ with a DeepONet-type branch–trunk structure to preserve the marginal constraint. This enables learning of heterogeneous (non-exchangeable) mean-field control problems where agent interactions depend on labels, extending prior neural methods beyond the exchangeable case.

We study the approximation of operators acting on probability measures on a product space with prescribed marginal. Let $I$ be a label space endowed with a reference measure $\lambda$, and define $\cal M_\lambda$ as the set of probability measures on $I\times \mathbb{R}^d$ with first marginal $\lambda$. By disintegration, elements of $\cal M_\lambda$ correspond to families of labeled conditional distributions. Operators defined on this constrained measure space arise naturally in mean-field control problems with heterogeneous, non-exchangeable agents. Our main theoretical result establishes a universal approximation theorem for continuous operators on $\cal M_\lambda$. The proof combines cylindrical approximations of probability measures with DeepONet-type branch-trunk neural architecture, yielding finite-dimensional representations of such operators. We further introduce a sampling strategy for generating training measures in $\cal M_\lambda$, enabling practical learning of such conditional mean-field operators. We apply the method to the numerical resolution of mean-field control problems with heterogeneous interactions, thereby extending previous neural approaches developed for homogeneous (exchangeable) systems. Numerical experiments illustrate the accuracy and computational effectiveness of the proposed framework.

Read abstractHide abstract

0

RAMPAGE: RAndomized Mid-Point for debiAsed Gradient Extrapolation

cs.LG math.OC Abolfazl Hashemi · Mar 23, 2026

RAMPAGE addresses discretization bias in Extragradient (EG) methods for variational inequalities by replacing the deterministic midpoint with randomized sampling. The core idea uses uniform sampling to construct an unbiased estimator of the continuous-time flow integral, while RAMPAGE+ leverages antithetic variates to eliminate first-order variance terms. This matters for training GANs and other non-conservative games where EG's $\mathcal{O}(\eta^2)$ bias causes divergence in highly nonlinear regimes.

A celebrated method for Variational Inequalities (VIs) is Extragradient (EG), which can be viewed as a standard discrete-time integration scheme. With this view in mind, in this paper we show that EG may suffer from discretization bias when applied to non-linear vector fields, conservative or otherwise. To resolve this discretization shortcoming, we introduce RAndomized Mid-Point for debiAsed Gradient Extrapolation (RAMPAGE) and its variance-reduced counterpart, RAMPAGE+ which leverages antithetic sampling. In contrast with EG, both methods are unbiased. Furthermore, leveraging negative correlation, RAMPAGE+ acts as an unbiased, geometric path-integrator that completely removes internal first-order terms from the variance, provably improving upon RAMPAGE. We further demonstrate that both methods enjoy provable $\mathcal{O}(1/k)$ convergence guarantees for a range of problems including root finding under co-coercive, co-hypomonotone, and generalized Lipschitzness regimes. Furthermore, we introduce symmetrically scaled variants to extend our results to constrained VIs. Finally, we provide convergence guarantees of both methods for stochastic and deterministic smooth convex-concave games. Somewhat interestingly, despite being a randomized method, RAMPAGE+ attains purely deterministic bounds for a number of the studied settings.

Read abstractHide abstract

0

Multinoulli Extension: A Lossless Continuous Relaxation for Partition-Constrained Subset Selection

cs.LG math.OC Qixin Zhang, Wei Huang, Yan Sun et al. · Mar 23, 2026

The paper tackles partition-constrained subset selection for 'close-to-submodular' objectives—specifically α-weakly DR-submodular and (γ,β)-weakly submodular functions—where existing distorted local-search methods suffer from prohibitive query complexity (˜O(1/ϵ^6)) and require prior knowledge of structural parameters. The authors propose the Multinoulli Extension (ME), a continuous relaxation that learns multinoulli priors for each partition block, enabling lossless rounding without submodularity assumptions. They develop offline (Multinoulli-SCG) and online (Multinoulli-OSCG/OSGA) algorithms achieving tight approximation guarantees with O(1/ϵ^2) query complexity and O(√T) regret, respectively.

Identifying the most representative subset for a close-to-submodular objective while satisfying the predefined partition constraint is a fundamental task with numerous applications in machine learning. However, the existing distorted local-search methods are often hindered by their prohibitive query complexities and the rigid requirement for prior knowledge of difficult-to-obtain structural parameters. To overcome these limitations, we introduce a novel algorithm titled Multinoulli-SCG, which not only is parameter-free, but also can achieve the same approximation guarantees as the distorted local-search methods with significantly fewer function evaluations. More specifically, when the objective function is monotone $\alpha$-weakly DR-submodular or $(\gamma,\beta)$-weakly submodular, our Multinoulli-SCG algorithm can attain a value of $(1-e^{-\alpha})\text{OPT}-\epsilon$ or $(\frac{\gamma^{2}(1-e^{-(\beta(1-\gamma)+\gamma^2)})}{\beta(1-\gamma)+\gamma^2})\text{OPT}-\epsilon$ with only $O(1/\epsilon^{2})$ function evaluations, where OPT denotes the optimal value. The cornerstone of our Multinoulli-SCG algorithm is an innovative continuous-relaxation framework named Multinoulli Extension(ME), which can effectively convert the discrete subset selection problem subject to partition constraints into a solvable continuous maximization focused on learning the optimal multinoulli priors across the concerned partition. In sharp contrast with the well-established multi-linear extension for submodular subset selection, a notable advantage of our proposed ME is its intrinsic capacity to provide a lossless rounding scheme for any set function. Furthermore, based on our proposed ME, we also present two novel online algorithms, namely, Multinoulli-OSCG and Multinoulli-OSGA, for the unexplored online subset selection problems over partition constraints.

Read abstractHide abstract

0

On the Role of Batch Size in Stochastic Conditional Gradient Methods

cs.LG math.OC stat.ML Rustem Islamov, Roman Machacek, Aurelien Lucchi et al. · Mar 22, 2026

This paper studies how batch size and sequence length should scale with the total token budget in stochastic conditional gradient methods for LLM training. Under a $\mu$-Kurdyka-\L ojasiewicz condition, the authors derive a BST (Batch-Sequence-Token) scaling rule $BS \asymp T^{2/3}$ that predicts three distinct regimes: noise-dominated, batch-independent optimal, and iteration-starved. The theory yields actionable guidelines for adaptive batch size scheduling and is validated on NanoGPT models up to 1B parameters.

We study the role of batch size in stochastic conditional gradient methods under a $\mu$-Kurdyka-{\L}ojasiewicz ($\mu$-KL) condition. Focusing on momentum-based stochastic conditional gradient algorithms (e.g., Scion), we derive a new analysis that explicitly captures the interaction between stepsize, batch size, and stochastic noise. Our study reveals a regime-dependent behavior: increasing the batch size initially improves optimization accuracy but, beyond a critical threshold, the benefits saturate and can eventually degrade performance under a fixed token budget. Notably, the theory predicts the magnitude of the optimal stepsize and aligns well with empirical practices observed in large-scale training. Leveraging these insights, we derive principled guidelines for selecting the batch size and stepsize, and propose an adaptive strategy that increases batch size and sequence length during training while preserving convergence guarantees. Experiments on NanoGPT are consistent with the theoretical predictions and illustrate the emergence of the predicted scaling regimes. Overall, our results provide a theoretical framework for understanding batch size scaling in stochastic conditional gradient methods and offer guidance for designing efficient training schedules in large-scale optimization.

Read abstractHide abstract

Nothing here yet