Nothing here yet
This paper addresses a critical gap in embodied AI: while Multimodal Large Language Models (MLLMs) excel at reactive, short-horizon planning, they fail at biological-like "mental navigation" — the ability to construct global cognitive maps from long egocentric videos and simulate paths before acting. To tackle this, the authors introduce Video2Mental, a benchmark requiring models to generate structured hierarchical cognitive maps from videos exceeding five minutes, then produce landmark-grounded navigation plans validated in the Habitat simulator. They also propose NavMind, a Qwen3-VL-based model trained via difficulty-stratified progressive supervised fine-tuning to internalize these structured representations.
This paper tackles the inefficiency of Interleaved-Modal Chain-of-Thought (ICoT) reasoning, where current methods statically insert visual tokens after every reasoning step, wasting compute on redundant image embeddings and using semantically broken patches. DaP-ICoT introduces a confidence-aware gating mechanism that only pulls visual context when model certainty drops below a threshold, combined with SAM2-based object segmentation to provide coherent visual thoughts instead of fragmented patches.
Standard Transformers apply fixed-depth computation regardless of problem difficulty, limiting their ability to solve tasks requiring variable-depth reasoning like multi-hop traversal or nested logic. This paper proposes a depth-recurrent Transformer that iteratively applies a shared-weight block in latent space—enabling 'vertical Chain-of-Thought' where models trade recurrence steps for deeper reasoning without consuming context window. The work demonstrates strong compositional generalization on three synthetic tasks and offers a mechanistic alternative to horizontal token-generation paradigms.
This work addresses zero-shot detection of AI-generated images by measuring how Vision Foundation Model (VFM) representations respond to structured high-frequency perturbations. The core idea is that synthetic images contain characteristic frequency biases, causing their embeddings to shift differently than real images when high-frequency noise is applied to local patches. The method achieves strong detection accuracy while requiring only a single Fourier transform and one forward pass, making it one to two orders of magnitude faster than comparable training-free approaches.
Long-context LLM inference hits a memory wall: each decode step requires scanning the entire KV cache, incurring $O(n)$ memory bandwidth that cannot be solved by faster arithmetic. PRISM proposes a thin-film lithium niobate photonic accelerator that performs the block-selection similarity search in $O(1)$ optical latency using a broadcast-and-weight architecture, eliminating the $O(n)$ scan entirely. The work claims $16\times$–$32\times$ traffic reduction at 64K–128K tokens and a four-order-of-magnitude energy advantage over GPU baselines by matching photonic hardware capabilities—passive query broadcast, quasi-static microring weights, and low-precision rank output—to the selection task.
MetaCompress addresses token reduction for multi-turn VQA in Large Vision-Language Models, where future questions are unpredictable and may target any image region. The paper proposes a learning-based prompt-agnostic compression module trained via KL divergence minimization between original and compressed outputs, demonstrating that heuristic attention-based pruning is suboptimal for this scenario. The method achieves strong efficiency-accuracy trade-offs across five LVLM architectures while training on only ~20k samples.
This paper tackles the domain generalization problem in image deraining, where models trained on synthetic data fail catastrophically on out-of-distribution (OOD) real-world scenarios. The authors propose a three-stage pipeline—Superpixel Generation, Resolution-adaptive Fusion, and Pseudo-label Re-Synthesis—that adapts source-domain models to target domains using only unpaired rain-free images, eliminating the need for costly paired rainy data collection.
This paper examines representation genesis—the transition from non-representational physical systems to those with content-manipulable states. It argues that major frameworks in philosophy of mind (Language of Thought, teleosemantics, predictive processing, enactivism, and genetic phenomenology) share a 'Representation Presupposition' structure that prevents them from explaining this first acquisition without circularity. With large language models now achieving high cognitive performance without clear genesis events, the absence of a satisfactory theory becomes urgent.
Retrieval-Augmented Generation (RAG) systems mitigate large language model hallucinations by integrating external knowledge bases, yet this multi-module architecture introduces complex security vulnerabilities spanning data poisoning, membership inference, and adversarial manipulation. This survey systematically maps threats across the RAG pipeline—vector database construction, retrieval, and generation—and categorizes corresponding defenses from input-side access control to output-side privacy preservation. As a comprehensive review of 152 papers, it aims to unify the analysis of threat models, defense mechanisms, and evaluation benchmarks to foster trustworthy RAG deployments in sensitive domains.
AgenticRec attacks a key gap in LLM-based recommenders: existing agents rely on frozen reasoning chains and cannot learn from ranking feedback to refine tool use. The paper proposes a two-stage training framework that combines ReAct-style tool invocation with list-wise Group Relative Policy Optimization (GRPO) and Progressive Preference Refinement (PPR) for hard-negative mining. The work matters because it demonstrates that end-to-end reinforcement learning can align multi-step tool use with ranking objectives, moving beyond prompt-engineered agent workflows.
This paper studies generalization error bounds for Transformer models using offset Rademacher complexity. The core idea is to derive sharp excess risk bounds that achieve optimal $O(1/n)$ convergence rates---improving upon the standard $O(1/\sqrt{n})$---for single-head, multi-head, and multi-layer architectures, with explicit dependence on matrix ranks and parameter norms. The authors further extend these results to unbounded sub-Gaussian and heavy-tailed input distributions, broadening the applicability beyond standard boundedness assumptions.
High-fidelity CFD for vehicle aerodynamic drag is bottlenecked not by solver wall time but by workflow friction—CAD cleanup, meshing retries, and queue contention. This paper proposes a contract-centric blueprint where self-evolving coding agents search over executable surrogate programs (not static models) to predict drag coefficient $C_d$ under industrial constraints. The system combines Famou-Agent-style evaluator feedback with population-based island evolution and hard evaluation contracts that enforce leakage prevention, deterministic replay, and resource budgets, aiming for a screen-and-escalate deployment where uncertain cases trigger automatic fallback to high-fidelity CFD.
Rule-State Inference (RSI) addresses compliance monitoring in domains like taxation where authoritative rules are known a priori but observations are partial, noisy, or strategically distorted. The paper proposes a Bayesian framework that inverts the standard ML paradigm: instead of learning rules from data, RSI encodes regulatory rules as structured priors and infers latent rule states (activation, compliance rate, parametric drift) via posterior inference. This enables zero-shot compliance assessment without labeled training data—a critical capability for low-resource environments where non-compliance labels are scarce or legally sensitive.
MIST addresses the challenge of generating high-quality SQL test cases for Database Management Systems using lightweight Large Language Models. The framework combines a feature-guided synthesis stage that leverages hierarchical documentation structures with error feedback, and a Monte Carlo Tree Search-based mutation stage to overcome coverage plateaus. This two-pronged approach aims to achieve high code coverage in resource-constrained industrial environments where only small LLMs can be deployed locally.
This paper addresses robotic reaching in cluttered, unseen environments using a hybrid rigid-soft continuum manipulator. The core idea is a real-time pipeline that combines multi-view RGB reconstruction (Mast3r), open-world object detection (YOLO-World), shape-aware RRT* planning with asymmetric collision constraints, and a learned controller trained on pose-to-actuation data. If validated at scale, this could enable robots to navigate dense foliage or disaster debris where rigid arms fail and pure soft arms lack reach.
The paper critiques the institutionalization of LLM benchmarks as "Silicon Bureaucracy" and "AI Test-Oriented Education", arguing high scores often conflate exam-oriented competence with genuine generalization due to data contamination. It proposes an audit framework using a router-worker setup: clean-control routers transmit full questions while noisy routers delete, rewrite, and perturb before aggregation. For clean benchmarks, noisy aggregation should not systematically exceed the baseline; persistent above-baseline gains suggest contamination-related memory activation. The core finding—that 10 of 12 models exceed clean baselines under multi-router noisy conditions—challenges the interpretability of raw benchmark scores.
DiT-Flow tackles multi-condition speech enhancement (noise, reverberation, codec compression) by combining flow matching with a latent Diffusion Transformer (DiT) backbone. The paper proposes operating flow matching in a VAE-compressed latent space for efficiency, introduces StillSonicSet (a synthetic dataset with realistic room acoustics for stationary sources), and applies Mixture-of-LoRA-Experts (MoELoRA) for parameter-efficient adaptation to unseen distortions. The work matters because most SE models fail when deployed on real-world audio with compound distortions unseen during training.
CAID tackles long-horizon software engineering tasks where single agents struggle with accuracy and wall-clock time. The core idea is Centralized Asynchronous Isolated Delegation: a manager decomposes tasks into dependency graphs and delegates to multiple engineer agents working in isolated git worktrees, integrating progress via branch-and-merge. The system improves accuracy by 26.7% absolute on PaperBench and 14.3% on Commit0, demonstrating that structured coordination grounded in SWE primitives outperforms simply scaling single-agent iteration budgets.
AdaRubric solves the static-rubric bottleneck in LLM-as-Judge evaluation by dynamically generating task-specific evaluation dimensions from task descriptions. It scores agent trajectories step-by-step with confidence-weighted per-dimension feedback and filters preference pairs using the DimensionAwareFilter—a provably necessary mechanism to prevent high-scoring dimensions from masking failures. The approach achieves Pearson $r=0.79$ correlation with human judgments and yields substantial downstream gains: +6.8–8.5 percentage points in DPO task success and +6.6 pp faster PPO convergence at 5K steps.
The paper tackles instability in multi-agent reinforcement learning for LLM reasoning, where noisy, heavy-tailed rewards break standard GRPO batch-mean normalization. It proposes DACR, a structured Answer-Critique-Rewrite protocol with cross-improvement rewards, and ARE, a robust estimator that replaces empirical means with a Median-of-Means variant using adaptive losses. Experiments on mathematical reasoning and aerial vision-language navigation demonstrate improved accuracy and training stability under synthetic noise contamination.