Nothing here yet
Predicting how complex scenes evolve is essential for intelligent systems, yet dense video generation expends enormous compute on appearance rather than dynamics. This paper introduces Myriad, an autoregressive diffusion model that predicts future motion via sparse point trajectories, explicitly avoiding the 'visual tax' of pixel-level generation. By modeling step-wise uncertainty accumulation through flow matching and utilizing fused transformer blocks, the method achieves throughput of 2200 samples/min compared to less than 1 for video models, while matching or exceeding their predictive accuracy on motion-focused benchmarks.
Neural Computers (NCs) propose a new machine form where computation, memory, and I/O are unified inside a learned latent runtime state rather than separated as in conventional computers or external as in agents. This work instantiates early NC prototypes as video models that roll out terminal and desktop interfaces from text, pixels, and actions—showing that basic I/O alignment and short-horizon control are learnable without privileged program state. The results demonstrate early runtime primitives but also highlight that symbolic stability, routine reuse, and runtime governance remain unsolved on the long path toward the envisioned Completely Neural Computer (CNC).
This paper introduces Vision Transformer (ViT), which applies a standard Transformer encoder directly to sequences of image patches for image classification. The core insight is that convolutional inductive biases (locality and translation equivariance) are unnecessary when models are pre-trained at sufficient scale—specifically on datasets containing 14M to 300M images. When transferred to downstream benchmarks, ViT matches or exceeds state-of-the-art CNNs while requiring substantially less computational resources to pre-train.
This paper addresses the static nature of Large Language Models that prevents dynamic adaptation to streaming contexts. The authors introduce In-Place Test-Time Training, which repurposes existing MLP down-projection matrices as “fast weights” that update during inference via a Next-Token Prediction (NTP)-aligned objective. Unlike prior TTT methods that require architectural changes, this approach enables “drop-in” enhancement of pretrained models without retraining from scratch.
GISTBench evaluates whether LLMs can accurately extract user interests from behavioral interaction histories in recommendation systems. Unlike traditional benchmarks that optimize for item prediction accuracy, it verifies if predicted interests are actually grounded in engagement signals using two novel metrics: Interest Groundedness ($IG$) and Interest Specificity ($IS$). The authors find that current LLMs struggle primarily with recall—discovering all verifiable interests—rather than hallucination, revealing critical bottlenecks in evidence counting across heterogeneous signal types.
Recursive Language Models (RLMs) tackle the long-context problem by treating prompts as external environment variables that an LLM can programmatically manipulate through a REPL. Instead of feeding long prompts directly into the neural network, RLMs use symbolic code execution to decompose, filter, and recursively invoke sub-models over prompt snippets. This allows processing inputs up to 10M+ tokens—two orders of magnitude beyond typical context windows—while maintaining strong performance on complex aggregation tasks.
UniMotion addresses the fragmentation in human motion modeling by unifying motion, text, and RGB understanding/generation within a single 1.5B parameter architecture. Unlike prior work relying on discrete tokenization or handling only partial modality subsets, it treats motion as a continuous first-class modality via a Cross-Modal Aligned Motion VAE (CMA-VAE). The framework introduces Dual-Posterior KL Alignment to distill visual semantics into motion representations without requiring images at inference, and Latent Reconstruction Alignment to bootstrap the motion pathway through dense self-supervision before sparse text calibration.
ThinkJEPA addresses the limitation of JEPA-style latent world models that rely on short, densely sampled windows, which bias predictions toward local dynamics while missing long-horizon semantics. The paper proposes a dual-temporal architecture combining a dense-frame V-JEPA branch for fine-grained motion with a sparsely sampled VLM "thinker" branch that provides semantic guidance via multi-layer feature pyramids. This matters because it attempts to marry the physical consistency of latent world models with the general knowledge of vision-language models for robust trajectory forecasting.
3D-Layout-R1 tackles language-guided 3D spatial editing by training LLMs/VLMs to perform structured reasoning over explicit scene graphs. Instead of free-form chains-of-thought, the model outputs JSON graph edits that iteratively transform object poses and relations, combined with GRPO-based RL using dense 3D IoU and collision-aware rewards. This approach yields measurable gains in layout accuracy while maintaining interpretability across sorting, spatial alignment, and room-editing tasks.
TiCo tackles a critical gap in spoken dialogue models: the inability to control response duration, which is essential for time-constrained scenarios like driving assistants or emergency healthcare. Unlike text length control, speech duration depends on complex factors including phonetics, prosody, and speaking rate. The paper proposes Spoken Time Markers (STMs)—special tokens like <15.0 seconds> inserted during generation—to enable real-time temporal awareness. Using a two-stage post-training framework (self-generated supervised fine-tuning followed by reinforcement learning with verifiable rewards), TiCo equips models to estimate elapsed time and adjust content dynamically to meet target durations.
This paper proposes a bold interdisciplinary bridge between holographic string dualities and artificial intelligence, hypothesizing that AI tasks such as language modeling can be viewed as particle trajectory prediction on graphs admitting a holographically dual "string" description. Drawing on the AdS/CFT correspondence, the authors conjecture that word metrics on $S_n$ Cayley graphs correspond to areas under lattice paths in dual planar polygons, verified computationally via their CayleyPy library.
This paper tackles the visual perception gap in automated text layout generation. While existing Multimodal Large Language Models (MLLMs) generate layout code (SVG/JSON) to render text on images, they operate blind to the actual rendered output, producing layouts with overlapping text, poor contrast, or misalignment. The authors propose Visual Feedback Layout Model (VFLM), which closes the loop by rendering generated SVGs and feeding the visual results back to the model for iterative reflection and refinement. The framework uses a two-stage pipeline—cold-start supervised fine-tuning followed by reinforcement learning with GRPO—and introduces a specialized layout reward model trained on fine-grained quality hierarchies. A surprising finding is that simple outcome-based rewards outperform complex process-oriented rewards that explicitly encode step-wise incentives.
Large language models have historically lagged behind specialized encoder-decoder MT systems, but their superior context modeling makes them natural candidates for document-level translation. This paper tackles two key obstacles: the scarcity of high-quality document-level parallel corpora and LLM tendencies toward hallucinations and omissions. The authors propose a two-stage fine-tuning framework that first generates synthetic document-level data from summarization corpora via LLM augmentation, filters this data using sacreBLEU, COMET, and LaBSE cosine similarity, and then trains models first on sentence-level data before adapting to the filtered document corpus.
Autonomous vehicles struggle with adverse weather perception. This paper proposes LRC-WeatherNet, a lightweight fusion network combining LiDAR, RADAR, and camera via early BEV fusion and mid-level gating to classify weather conditions in real-time. The approach achieves $86.66\%$ accuracy on the MSU-4S dataset with $7.13\,\mathrm{ms}$ inference, demonstrating that adaptive multi-modal fusion outperforms unimodal baselines, though dataset limitations restrict generalization to rare weather events.
This paper distinguishes different forms of reasoning by the structural properties they demand from underlying representational systems. The core insight is that deduction requires four specific properties (operability, consistency, structural preservation, and compositionality) that cannot be secured through mere statistical scaling. This has significant implications for AI systems and cognitive science, providing a principled boundary between reasoning that can rely on associative approximations versus reasoning requiring structural guarantees.
MARCUS tackles the bottleneck of human interpretation in cardiovascular diagnosis by creating an agentic, multimodal vision-language model that jointly reasons over raw ECG signals, echocardiogram videos, and cardiac MRI. The core innovation is a hierarchical architecture where modality-specific expert encoders feed into an orchestrating agent that synthesizes findings while resisting 'mirage reasoning'—the tendency of VLMs to confabulate explanations without actually processing the image. Trained on 13.5 million clinical images and 1.6 million expert-curated Q&A pairs, MARCUS aims to bridge the gap between single-task diagnostic AI and interactive clinical reasoning.
This paper attacks the expensive problem of annotating NLP test sets by importing Active Testing (AT) from computer vision into language tasks. Given a labeling budget $B$, the goal is to select a subset $X_A$ that minimizes the estimation error $|M(X_F) - M(X_A)|$ between full and sampled test-set metrics, potentially cutting annotation costs by up to 95% while keeping prediction error under 1%. The core mechanism couples importance-weighted unbiased estimators with acquisition strategies (including a novel Agreement strategy based on attention-head disagreement) and an adaptive stopping criterion that removes the need to pre-specify the budget.
SecureBreak introduces a response-level safety dataset designed to detect harmful LLM outputs that bypass alignment mechanisms. Unlike existing benchmarks that classify prompts, this work focuses on binary classification of generated responses (safe vs. unsafe) across 3,059 samples from multiple model families including Llama, Qwen, Gemma, and Mistral. The core value proposition is providing a 'last-line defense' layer for post-generation filtering and supervisory signals to guide security re-alignment, addressing the growing threat of jailbreak attacks.
This paper argues that frictionless AI interfaces pose a systemic risk of "cognitive agency surrender"—the habitual abdication of human reasoning to algorithmic systems. Drawing on cognitive psychology, the authors theorize "Scaffolded Cognitive Friction" as a defense: intentionally injecting epistemic tension via Multi-Agent Systems (MAS) that expose structured disagreements (computational Devil's Advocates) to force System 2 activation. The work positions itself as a bridge between HCI, cognitive science, and AI governance.
Foundation models for Earth observation risk learning spurious correlations when pretraining with random masking. This paper proposes SpecTM (Spectral Targeted Masking), which deterministically masks pigment-sensitive spectral bands (phycocyanin, chlorophyll-a, red-edge) to enforce physics-based cross-spectral learning. Validated on microcystin concentration prediction using NASA PACE hyperspectral imagery over Lake Erie, the method achieves $R^2=0.695$ (current week) and $R^2=0.620$ (8-day-ahead), showing strong label efficiency but limited geographic validation.