Nothing here yet
This paper investigates the geometric structure of converged states in LLM pretraining, asking whether models converge to a common minimizer across data sources or merely a minimizer of the summed loss. The authors hypothesize that the "closeness" of task-specific minima correlates with downstream generalization, and propose the Nexus optimizer to maximize gradient similarity as a tractable proxy for closeness. Their core finding—that identical pretraining loss can mask vastly different downstream performance depending on the implicit bias toward geometric closeness—challenges the prevailing reliance on pretraining loss as the sole evaluation metric.
Predicting how complex scenes evolve is essential for intelligent systems, yet dense video generation expends enormous compute on appearance rather than dynamics. This paper introduces Myriad, an autoregressive diffusion model that predicts future motion via sparse point trajectories, explicitly avoiding the 'visual tax' of pixel-level generation. By modeling step-wise uncertainty accumulation through flow matching and utilizing fused transformer blocks, the method achieves throughput of 2200 samples/min compared to less than 1 for video models, while matching or exceeding their predictive accuracy on motion-focused benchmarks.
The paper identifies a demand-side externality in AI-driven automation: when firms displace workers, they capture full cost savings but externalize the demand destruction to rivals. In competitive markets, this creates a Prisoner's Dilemma where rational firms over-automate beyond the collective optimum, generating deadweight losses for both workers and owners. The analysis shows that only a Pigouvian tax on automation can correct this failure, while UBI, capital taxes, and worker equity programs cannot.
Neural Computers (NCs) propose a new machine form where computation, memory, and I/O are unified inside a learned latent runtime state rather than separated as in conventional computers or external as in agents. This work instantiates early NC prototypes as video models that roll out terminal and desktop interfaces from text, pixels, and actions—showing that basic I/O alignment and short-horizon control are learnable without privileged program state. The results demonstrate early runtime primitives but also highlight that symbolic stability, routine reuse, and runtime governance remain unsolved on the long path toward the envisioned Completely Neural Computer (CNC).
This paper introduces Vision Transformer (ViT), which applies a standard Transformer encoder directly to sequences of image patches for image classification. The core insight is that convolutional inductive biases (locality and translation equivariance) are unnecessary when models are pre-trained at sufficient scale—specifically on datasets containing 14M to 300M images. When transferred to downstream benchmarks, ViT matches or exceeds state-of-the-art CNNs while requiring substantially less computational resources to pre-train.
This paper addresses the static nature of Large Language Models that prevents dynamic adaptation to streaming contexts. The authors introduce In-Place Test-Time Training, which repurposes existing MLP down-projection matrices as “fast weights” that update during inference via a Next-Token Prediction (NTP)-aligned objective. Unlike prior TTT methods that require architectural changes, this approach enables “drop-in” enhancement of pretrained models without retraining from scratch.
GISTBench evaluates whether LLMs can accurately extract user interests from behavioral interaction histories in recommendation systems. Unlike traditional benchmarks that optimize for item prediction accuracy, it verifies if predicted interests are actually grounded in engagement signals using two novel metrics: Interest Groundedness ($IG$) and Interest Specificity ($IS$). The authors find that current LLMs struggle primarily with recall—discovering all verifiable interests—rather than hallucination, revealing critical bottlenecks in evidence counting across heterogeneous signal types.
Recursive Language Models (RLMs) tackle the long-context problem by treating prompts as external environment variables that an LLM can programmatically manipulate through a REPL. Instead of feeding long prompts directly into the neural network, RLMs use symbolic code execution to decompose, filter, and recursively invoke sub-models over prompt snippets. This allows processing inputs up to 10M+ tokens—two orders of magnitude beyond typical context windows—while maintaining strong performance on complex aggregation tasks.
Dynamic visual effects like explosions require complex temporal reasoning that is difficult to capture in text prompts. P-Flow introduces a training-free framework that treats prompts as optimization variables, using vision-language models to iteratively refine descriptions based on discrepancies between generated and reference videos. The method combines flow-matching noise inversion with lightweight historical context to achieve model-agnostic customization without fine-tuning.
This paper addresses weakly-supervised video scene graph generation (WS-VSGG), where models must parse videos into structured relational triplets using only sparse unlocalized annotations without bounding boxes. The core insight is that off-the-shelf object detectors indiscriminately detect all visible objects, overwhelming relation models with noisy non-interactive pairs, while fully-supervised detectors implicitly filter relationally irrelevant objects. To bridge this gap, the authors propose a three-component framework: Relation-Aware Matching (RAM) refines pseudo-labels via vision-language grounding, Pair Affinity Learning and Scoring (PALS) learns to distinguish interactive from non-interactive pairs, and Pair Affinity Modulation (PAM) gates attention based on affinity scores. This substantially narrows the gap to full supervision while reducing annotation costs.
This paper tackles test-time adaptation (TTA) for large multimodal 3D vision-language models under distribution shifts. The core idea is BayesMM, which models both textual and geometric features as Gaussian distributions and fuses them via Bayesian model averaging. Unlike cache-based methods that store discrete samples, this approach claims to avoid progressive information loss and heuristic hyperparameter tuning while maintaining training-free operation.
GeoFusion-CAD tackles the scalability bottleneck in parametric CAD generation, where Transformer-based methods struggle with long command sequences due to quadratic attention costs. The authors propose an end-to-end diffusion framework that encodes CAD programs as hierarchical trees and processes them with G-Mamba blocks—geometry-conditioned state-space models that achieve linear complexity $\mathcal{O}(Ld)$ while capturing geometric and topological dependencies. This enables scaling to sequences of up to 240 commands while maintaining high geometric fidelity.
Multiple Instance Learning (MIL) for gigapixel pathology images relies on a single linear layer to transform general patch features into task-specific representations before aggregation. This paper identifies this linear layer as a critical yet overlooked bottleneck and proposes Mammoth, a parameter-efficient mixture-of-experts module that replaces it with multi-headed soft routing to specialized low-rank experts. By routing morphologically similar patches to distinct expert slots, Mammoth achieves superior performance without increasing model size, demonstrating that the feature transformation step matters more than the choice of aggregation function.
Single-image 3D reconstruction is fundamentally ill-posed: one view admits many valid 3D explanations, especially under occlusion and structural variation. This paper tackles the problem by learning an adaptive part-whole hierarchy rather than fixed-decomposition or monolithic representations. The core idea is a slot-based architecture where an image-conditioned gating mechanism predicts which latent structural slots to activate per instance, coupled with a class-agnostic prototype bank that aligns active slots to shared geometric priors via soft attention. This eliminates the need for user-specified part counts while encouraging cross-category reuse of recurring structural patterns like legs or handles.
Pre-trained vision encoders excel at 2D recognition but lack 3D spatial awareness. SpatialBoost addresses this by converting dense 3D spatial information from 2D images into linguistic expressions, then injecting them into frozen vision encoders via LLM-based training with a novel dual-channel attention mechanism. The framework improves performance on spatial tasks (depth estimation, robot control) while maintaining or enhancing general vision capabilities (ImageNet classification), suggesting language serves as an effective supervision signal for geometric understanding.
Neyman–Pearson multiclass classification (NPMC) handles asymmetric error costs by constraining class-specific misclassification rates, yet existing methods fail when training labels are corrupted. This paper proposes an empirical likelihood (EL) framework that recovers true class proportions and posterior probabilities from noisy labels via an exponential tilting density ratio model, enabling valid error control without prior knowledge of the noise transition matrix. The approach combines semiparametric estimation theory with a practical EM algorithm, yielding classifiers that satisfy NP oracle inequalities asymptotically.
Hand-object interaction (HOI) video generation is currently split between pose-only synthesis, static appearance generation, and motion methods requiring ground-truth first frames. This paper introduces PAM, a three-stage Pose–Appearance–Motion engine that generates high-resolution HOI videos from only initial/target poses and object geometry, achieving true sim-to-real transfer. The system combines GraspXL for pose trajectory generation, Flux for appearance synthesis with multimodal ControlNet conditioning, and CogVideoX for motion generation, producing 480×720 videos while improving FVD from 38.83 to 29.13 on DexYCB compared to prior work.
Video-LLMs struggle with high computational costs from massive visual token volumes (e.g., 6,272 tokens for a 32-frame video). This paper challenges the standard two-stage spatiotemporal compression paradigm—which assumes spatial and temporal redundancy are separable—by reformulating compression as a global allocation problem. The authors propose a unified selection mechanism combining attention weights and semantic similarity to identify high-contribution, low-redundancy tokens, plus a text-aware merging module for secondary compression inside the LLM. The result is a training-free, plug-and-play method that retains ~90% performance with only 2% of tokens.
This paper benchmarks Vision Transformer backbones (ViT-B, ViT-L, ViT-H) within a Local pattern Self-Supervised Auxiliary Task (L-SSAT) framework. The core idea fuses Local Directional Pattern (LDP) texture descriptors with RGB inputs via Masked Autoencoder reconstruction as an auxiliary task to primary face classification. The study addresses whether a unified backbone exists across diverse face analysis tasks including deepfake detection (FaceForensics++), attribute prediction (CelebA), and emotion recognition (AffectNet).
Direct Preference Optimization (DPO) for Vision-Language Models suffers from Likelihood Displacement, where optimization collapses the probabilities of both chosen and rejected responses, causing models to abandon visual evidence for language priors. This paper proposes Asymmetric Constrained Preference Optimization (ACPO), which applies dynamic, length-aware scaling exclusively to the rejected reward term, preserving the chosen distribution as a stable anchor while selectively suppressing incorrect outputs.