Nothing here yet
RAG improves factual reliability but doesn't eliminate hallucinations. The paper reveals a mechanistic paradox: induction heads that copy correct answers from context simultaneously trigger entropy neurons that suppress confidence, causing entropy-based uncertainty signals to fail. INTRYGUE gates predictive entropy using induction head activation (SinkRate) to correct this inflation, offering a training-free method for reliable RAG hallucination detection.
This paper applies the classic Abernathy-Utterback (A-U) innovation diffusion model to generative AI's environmental impact. The authors argue that alarmist predictions about GAI's carbon footprint often ignore how innovation diffusion drives process optimization and efficiency gains. They forecast that the GAI industry is transitioning from the 'fluid' A-U:1 phase to the 'transitional' A-U:2 phase, where dominant designs will emerge. The paper predicts two main business models: large generalist platforms serving mass audiences, and smaller specialized models targeting specific use cases. Their core argument is that GAI 'will never be green, but its impact may not be as problematic as is sometimes claimed' depending on which business model dominates.
Concrete mix design requires balancing competing objectives of mechanical strength and sustainability. BOxCrete introduces a Gaussian Process regression framework trained on 533 strength measurements from 123 unique mixtures to predict compressive strength evolution over curing time and optimize mixes for embodied carbon using multi-objective Bayesian Optimization. The work addresses a critical gap in the literature by providing an open-source alternative to proprietary industrial datasets and models.
This paper addresses computational barriers for Brazilian Portuguese question answering by systematically evaluating Parameter-Efficient Fine-Tuning (PEFT) methods on BERTimbau models using the SQuAD-BR dataset. The authors test LoRA, DoRA, QLoRA, and QDoRA across Base (110M) and Large (335M) variants, demonstrating that LoRA achieves 95.8% of full fine-tuning performance while reducing training time by 73.5%. A key finding is that PEFT methods require substantially higher learning rates ($2\times 10^{-4}$) than standard BERT fine-tuning to achieve optimal results, with quantization resilience favoring larger models.
This paper tackles the problem of measuring dialectal bias in LLMs for Bengali, a low-resource language with nine major regional variants. The authors propose a two-phase framework combining RAG-based translation to create dialectal benchmarks with an RLAIF-inspired evaluation protocol that uses CoT-first reasoning and multi-judge validation. They expose the catastrophic failure of traditional metrics like BLEU and WER for agglutinative dialectal Bengali, showing that LLM-as-judge better predicts human quality assessments.
CataractSAM-2 adapts Meta's Segment Anything Model 2 (SAM-2) for real-time semantic segmentation in cataract surgery videos. The core idea is to fine-tune only the prompt encoder and mask decoder while freezing the image encoder, enabling precise segmentation of anatomical structures and surgical instruments under challenging conditions like glare and occlusion. The paper also introduces an interactive annotation framework that propagates sparse user prompts across video frames to accelerate ground-truth generation.
Large Language Models often inherit societal biases that manifest as stereotyped associations across demographic groups. This paper proposes CatRAG, a dual-mechanism debiasing framework that combines a category-theoretic functor-guided projection—collapsing protected-attribute directions in embedding space via spectral decomposition—with diversity-aware Retrieval-Augmented Generation to ground inference in balanced evidence. Evaluated on the BBQ benchmark across Llama-3, GPT-OSS, and Gemma-3, the method claims to reduce bias scores from ~60% to near zero while improving accuracy by up to 40% over base models.
The paper tackles the inefficiency of homogeneous compute allocation in multi-task supervised fine-tuning (SFT), where fast-learning tasks overfit while slow ones remain under-trained. The authors propose mSFT, an iterative algorithm that dynamically excludes overfitting sub-datasets and reverts to optimal checkpoints. Their approach consistently outperforms baselines across 6 models and 10 benchmarks, sometimes reducing compute while improving accuracy.
Unified-MAS tackles a critical failure mode in automatic Multi-Agent Systems: their severe performance degradation in knowledge-intensive domains like healthcare and law, where general-purpose reasoning nodes fall short. The core innovation decouples granular node implementation from topological orchestration through an offline two-stage pipeline that synthesizes domain-specific agent nodes via external knowledge retrieval and refines them using a perplexity-guided reward signal. This paradigm matters because it promises to catapult general-purpose Auto-MAS to expert-level performance without costly manual engineering of domain-specific agents.
AgentHER tackles the data waste problem in LLM agent training by adapting Hindsight Experience Replay (HER) from RL to natural-language trajectories. The core insight is that failed trajectories—typically 60–75% of collected data—often represent valid demonstrations for achievable alternative goals. The paper proposes a four-stage pipeline with multi-judge verification that converts discarded failures into SFT and DPO training data, yielding +7.1–11.7 pp gains over success-only fine-tuning across four model families on WebArena and ToolBench.
This paper proposes a conceptual framework for AI-driven scientific discovery by treating swarms of autonomous virtual laboratories as particles in a particle swarm optimization (PSO) system. Each virtual lab—comprising LLM-based agents for planning, experimentation, and review—operates as an independent research unit that interacts with others through citation-analogous voting mechanisms. The central idea is to simulate the emergent dynamics of real scientific communities (exploration-exploitation balance, paradigm formation, natural selection of ideas) without a central coordinator. The work matters because current single-agent systems like The AI Scientist may lack the diversity and error-correction mechanisms that make human science robust.
This paper addresses the fundamental problem that correlational sentiment analysis cannot distinguish genuine economic associations from spurious statistical artifacts in financial markets. The core contribution is a refutation-validated framework for aspect-based sentiment analysis that combines net-ratio sentiment scoring with four robustness tests—placebo, random common cause, subset stability, and bootstrap validation—to filter false discoveries in high-dimensional sentiment-return analysis. This matters because investment strategies built on spurious correlations can lead to systematic losses, and regulators increasingly demand explainable AI systems with auditable validation.
This paper introduces "silent commitment failure" — a phenomenon where instruction-tuned language models produce confident, incorrect outputs with no detectable pre-commitment warning signal — and proposes "governability" as a measurable property for AI agent safety. The core claim is that 2 of 3 instruction-following models evaluated exhibit zero-warning failure modes, with profound implications for autonomous agent deployment. The work distinguishes itself from hallucination studies by focusing on detectability before commitment rather than correctness of output, and presents empirical evidence that conflict-detection signals (the "authority band") are geometric properties fixed at pretraining rather than injectable through fine-tuning.
This paper tackles SAR (Synthetic Aperture Radar) automatic target recognition under coherent speckle noise. It proposes FSCE, a framework combining frequency-domain wavelet decomposition with spatial multi-scale convolutions in a shallow feature enhancement module (DSAF), guided by online knowledge distillation from a ResNet101 teacher. The work matters because SAR imagery suffers from unique multiplicative noise that obscures target features, yet the claimed improvements appear marginal on saturated benchmarks.
SafePilot addresses a critical gap in deploying Large Language Models (LLMs) for cyber-physical systems (CPS): LLM "hallucinations" can generate plausible-sounding but unsafe plans that violate safety constraints or temporal requirements. The authors propose a hierarchical neuro-symbolic framework that combines LLM planning with formal verification—using First-Order Logic (FOL) for attribute-based constraints and Linear Temporal Logic (LTL) for temporal constraints—to ensure plans satisfy specifications before execution.
This paper proposes Riemannian Foundation Model (RFM), a vision for unifying graph learning through Riemannian geometry rather than GNN message-passing or LLM serialization. The authors argue that graphs are discrete analogs of manifolds, and that concepts like vector bundles, curvature, and parallel transport provide the proper toolkit for universal graph modeling—enabling both structural inference and generation in a way that current Euclidean GNNs and tokenized LLMs cannot achieve.
Existing adversarial-example-based fingerprinting schemes rely on empirical heuristics to set the fingerprint-to-boundary distance, risking violations of either robustness or uniqueness. This paper proposes AnaFP, an analytical approach that derives theoretical lower and upper bounds $\tau_{\text{lower}} < \tau < \tau_{\text{upper}}$ on a stretch factor controlling this distance. By formalizing robustness and uniqueness constraints and employing surrogate model pools with quantile-based relaxation, AnaFP generates fingerprints with guaranteed properties, validated across CNNs, MLPs, and GNNs.
This paper tackles the lack of shared formalism for comparing hierarchical memory systems in language agents. It proposes a unifying theory based on three operators: extraction (α) that maps raw data to atomic units, coarsening (C = (π, ρ)) that partitions and summarizes units, and traversal (τ) that selects content under a token budget. The core insight is the self-sufficiency spectrum of representatives ρ, which constrains viable retrieval strategies—an observation the authors call the coarsening-traversal (C–T) coupling.
This paper addresses the challenge of efficient failure management in LLM-based Multi-Agent Systems (MASs). Existing approaches rely on expensive per-trace reasoning with large judge LLMs, which is slow and unstable. The core contribution is EAGER, a framework that uses unsupervised reasoning-scoped contrastive learning to encode intra-agent and inter-agent dynamics into embeddings, enabling real-time step-wise failure detection and reflexive mitigation guided by historical patterns rather than costly LLM inference.
The paper addresses multi-UAV coordination under intermittent communications by proposing a Spatio-Temporal Attention enhanced MADRL (STA-MADRL) framework. It combines delay-penalized rewards to incentivize information exchange with a prediction module that recovers missing state data using temporal and spatial attention mechanisms. The authors claim 75% throughput improvements over communication-limited baselines while achieving near-ideal performance without requiring real-time global state sharing.