Nothing here yet
This paper investigates the geometric structure of converged states in LLM pretraining, asking whether models converge to a common minimizer across data sources or merely a minimizer of the summed loss. The authors hypothesize that the "closeness" of task-specific minima correlates with downstream generalization, and propose the Nexus optimizer to maximize gradient similarity as a tractable proxy for closeness. Their core finding—that identical pretraining loss can mask vastly different downstream performance depending on the implicit bias toward geometric closeness—challenges the prevailing reliance on pretraining loss as the sole evaluation metric.
Predicting how complex scenes evolve is essential for intelligent systems, yet dense video generation expends enormous compute on appearance rather than dynamics. This paper introduces Myriad, an autoregressive diffusion model that predicts future motion via sparse point trajectories, explicitly avoiding the 'visual tax' of pixel-level generation. By modeling step-wise uncertainty accumulation through flow matching and utilizing fused transformer blocks, the method achieves throughput of 2200 samples/min compared to less than 1 for video models, while matching or exceeding their predictive accuracy on motion-focused benchmarks.
Neural Computers (NCs) propose a new machine form where computation, memory, and I/O are unified inside a learned latent runtime state rather than separated as in conventional computers or external as in agents. This work instantiates early NC prototypes as video models that roll out terminal and desktop interfaces from text, pixels, and actions—showing that basic I/O alignment and short-horizon control are learnable without privileged program state. The results demonstrate early runtime primitives but also highlight that symbolic stability, routine reuse, and runtime governance remain unsolved on the long path toward the envisioned Completely Neural Computer (CNC).
This paper introduces Vision Transformer (ViT), which applies a standard Transformer encoder directly to sequences of image patches for image classification. The core insight is that convolutional inductive biases (locality and translation equivariance) are unnecessary when models are pre-trained at sufficient scale—specifically on datasets containing 14M to 300M images. When transferred to downstream benchmarks, ViT matches or exceeds state-of-the-art CNNs while requiring substantially less computational resources to pre-train.
This paper addresses the static nature of Large Language Models that prevents dynamic adaptation to streaming contexts. The authors introduce In-Place Test-Time Training, which repurposes existing MLP down-projection matrices as “fast weights” that update during inference via a Next-Token Prediction (NTP)-aligned objective. Unlike prior TTT methods that require architectural changes, this approach enables “drop-in” enhancement of pretrained models without retraining from scratch.
This paper tackles the challenge of scaling reinforcement learning for long-horizon tool-using agents, where LLMs must orchestrate dozens of tool calls to satisfy multifaceted constraints. The authors propose STAR, a post-training pipeline that decomposes the RL design space across five axes—reward shaping, model scaling, data composition, algorithm selection, and environmental stability—to derive a practical, scale-aware recipe for training.
This paper tackles the efficiency–generalization trade-off in Continual Test-Time Adaptation (CTTA), where models must adapt online to unlabeled streams under distribution shift without source data. The core insight is that feature updates need only occur within a low-rank "golden subspace" coinciding with the row space of the classifier. To avoid costly retraining, the authors propose using the Average Gradient Outer Product (AGOP) as an online proxy for the classifier weight structure, leading to the GOLD method that projects features onto this subspace and learns a compact scaling vector. If the theoretical claims hold under realistic nonlinear settings, this could significantly reduce deployment costs for adaptive systems.
ShapDBM addresses the fragmentation problem in Decision Boundary Maps (DBMs) by transforming data into Shapley space before applying dimensionality reduction. This creates more compact decision zones that reflect model behavior rather than raw data distribution, enabling high-quality visualization of complex datasets like SVHN where traditional data-space DBMs fail.
This paper proposes a fundamental shift in evaluating probabilistic time series forecasting by replacing passive observation of historical trajectories with an interventionist "noise titration" protocol. By injecting calibrated Gaussian noise into known chaotic and stochastic dynamical systems, the authors transform forecasting into an exact distributional inference task where statistical calibration can be verified against ground-truth likelihoods. They extend the Fern architecture to output full covariance structures via SPD cone parameterization, then use the framework to expose severe failures in zero-shot foundation models under non-stationarity.
Traditional concentration indices like the Herfindahl-Hirschman Index ($HHI = \sum_i w_i^2$) measure weight dispersion but ignore network topology, meaning two systems with identical weight distributions can exhibit different effective concentration. This paper introduces the Network Concentration Index (NCI), defined as $\psi(w,A) = \frac{w^{\top}Aw}{1-\sum_i w_i^2}$, which measures the fraction of potential weighted interconnection realized along observed network links. The framework unifies weight distributions with interaction structures, providing a theoretically grounded tool for assessing systemic risk in financial networks, supply chains, and economic production systems.
MIHT tackles Time Series Classification (TSC) with variable-length, multivariate data—common in sensor and healthcare applications. The core idea combines Multiple Instance Learning (MIL) with Hoeffding Trees (incremental decision trees) to represent series as overlapping subseries bags and iteratively optimize which $k$ consecutive subseries are most discriminative. The approach promises both handling of unequal-length inputs and interpretability via a single tree structure.
Federated learning enables privacy-preserving medical AI but struggles with unreliable uncertainty estimates when clinical data is heterogeneous and imbalanced across sites. TrustFed addresses this by introducing representation-aware conformal prediction, which assigns test samples to calibration clients based on feature-space similarity and aggregates local thresholds via a soft-nearest strategy to provide finite-sample coverage guarantees without centralizing raw data. Validated on over 430,000 images across six distinct imaging modalities, the work advances federated learning from privacy-preserving training toward clinically trustworthy deployment with statistically calibrated uncertainty.
Parallel decoding promises faster text generation than autoregressive models but historically sacrifices quality due to simplified conditional independence assumptions. This paper introduces Gumbel Distillation, which leverages the Gumbel-Max trick to create a deterministic mapping from latent noise to teacher outputs, effectively providing the parallel student a blueprint for joint token distributions. By conditioning on Gumbel noise rather than relying on naive factorization, the method narrows the quality-efficiency gap, delivering substantial improvements across masked diffusion and multi-token prediction architectures.
The paper tackles the computational bottleneck of radiative transfer models (RTMs) for hyperspectral image (HSI) generation by proposing a VAE-based emulation framework that learns latent representations conditioned on biophysical parameters. It introduces both pixel-to-pixel (P2P) and fully convolutional (FC-VAE) variants, trained via either direct one-step mapping or a two-step pretraining strategy that decouples representation learning from parameter-to-latent interpolation. The work is significant for remote sensing applications as it provides empirical evidence that optimal emulator architecture depends critically on whether the target data is simulated (where P2P excels) or real-world imagery (where FC-VAE-pre dominates), and demonstrates that emulated data preserves downstream utility for parameter retrieval tasks.
Next app prediction struggles when user intent shifts rapidly and historical profiles are sparse. MISApp tackles this via multi-hop session graphs that decompose transitions into 1-, 2-, and 3-hop structural ranges, using LightGCN for lightweight propagation and a Transformer encoder-decoder to model intent evolution without requiring static user profiles, aiming for robust cold-start performance.
AnimalCLAP addresses zero-shot species recognition from vocalizations—a critical challenge for biodiversity monitoring when training data is scarce for rare species. The core idea is to inject hierarchical taxonomic knowledge (class, order, family, genus, species) into audio-text contrastive learning via multiple prompt templates, paired with a large dataset of 4,225 hours covering 6,823 species annotated with 22 ecological traits. This matters because it enables automated monitoring in visually occluded habitats like dense forests while inferring biological traits directly from sound.
Multi-agent applications execute tasks through multi-stage workflows where each stage is an LLM call feeding into the next. While heterogeneous clusters (mixing model sizes/families) enable better latency–performance trade-offs than homogeneous deployments, they introduce complex scheduling challenges: model selection affects both task accuracy and queue congestion. Chimera addresses this by predicting per-model confidence scores, forecasting total workflow output lengths, and estimating real-time load via in-flight token volumes to jointly optimize end-to-end latency and task performance.
This paper addresses federated learning for cross-view video understanding, where heterogeneous camera viewpoints create highly non-IID client distributions that impede generalization to unseen views. FedCVU proposes three complementary modules: VS-Norm preserves client-specific normalization statistics to handle view-dependent feature shifts; CV-Align introduces lightweight prototype-based contrastive learning to align representations across cameras; and SLA employs selective layer aggregation to reduce communication overhead by 40–45%. The work targets an important practical scenario—privacy-preserving multi-camera surveillance where centralizing raw footage is infeasible.
This paper investigates whether domain knowledge for quantum code generation should be embedded in model parameters through fine-tuning or provided at inference time via retrieval and agents. Comparing a parameter-specialized Granite-20B baseline against modern general-purpose LLMs (OpenAI, Claude, Gemini) on the Qiskit-HumanEval benchmark, the authors find that inference-time augmentation—particularly agentic execution feedback—outperforms fine-tuning by over 35 percentage points, offering a more maintainable path as quantum SDKs evolve.
Multifidelity surrogate modeling aims to leverage cheap low-fidelity simulations to improve predictions of expensive high-fidelity models when training data is scarce. This paper proposes MAGPI, a Gaussian process regression method that augments the high-fidelity input space with features derived from recursively-trained low-fidelity surrogate models. The approach unifies desirable properties from cokriging and autoregressive estimators while allowing non-GP models for low-fidelity levels, achieving superior accuracy and computational efficiency.