Your paper timeline
Scroll AI takes the way you would scroll a great paper aggregator: quick signal first, deeper critique when something earns your attention, and challenges when a claim feels off.
199 papers in cs.AI
Trending mixes fresh papers with community signal.
0
cs.LGcs.AI Bulent Haznedar, Levent Karacan · Mar 23, 2026

FISformer proposes replacing the dot-product self-attention in Transformers with a Sugeno-type Fuzzy Inference System (FIS) for time series forecasting. Instead of computing query-key similarities, the model fuzzifies tokens using learnable Gaussian membership functions, applies fuzzy rules, and defuzzifies to produce interaction weights. The paper suggests this approach captures uncertainty and nonlinearity better than standard attention, reporting state-of-the-art results on benchmarks like ETT, ECL, and Weather.

Transformers have achieved remarkable progress in time series forecasting, yet their reliance on deterministic dot-product attention limits their capacity to model uncertainty and nonlinear dependencies across multivariate temporal dimensions. To address this limitation, we propose FISFormer, a Fuzzy Inference System-driven Transformer that replaces conventional attention with a FIS Interaction mechanism. In this framework, each query-key pair undergoes a fuzzy inference process for every feature dimension, where learnable membership functions and rule-based reasoning estimate token-wise relational strengths. These FIS-derived interaction weights capture uncertainty and provide interpretable, continuous mappings between tokens. A softmax operation is applied along the token axis to normalize these weights, which are then combined with the corresponding value features through element-wise multiplication to yield the final context-enhanced token representations. This design fuses the interpretability and uncertainty modeling of fuzzy logic with the representational power of Transformers. Extensive experiments on multiple benchmark datasets demonstrate that FISFormer achieves superior forecasting accuracy, noise robustness, and interpretability compared to state-of-the-art Transformer variants, establishing fuzzy inference as an effective alternative to conventional attention mechanisms.
0
cs.CRcs.AIcs.MM Rui Yang Tan, Yujia Hu, Roy Ka-Wei Lee · Mar 23, 2026

This paper exposes a critical vulnerability in Multimodal Large Language Models (MLLMs): safety alignment fails when harmful intent is embedded in structured visual narratives. The authors introduce ComicJailbreak, a benchmark of 1,167 three-panel comics where panels 1–2 establish narrative context and panel 3 contains a blank speech bubble filled with a paraphrased harmful goal. The model is prompted to "complete the comic" by generating the fourth panel. Across 15 state-of-the-art MLLMs, comic-based attacks achieve ensemble success rates exceeding 90% on Gemini-family models and 85%+ on most open-source models—substantially outperforming plain-text and random-image baselines. The work also reveals that existing defenses (AdaShield, Attack as Defense) trigger severe over-refusal on benign prompts, and that automated safety judges are unreliable on sensitive-but-benign content.

Multimodal Large Language Models (MLLMs) extend text-only LLMs with visual reasoning, but also introduce new safety failure modes under visually grounded instructions. We study comic-template jailbreaks that embed harmful goals inside simple three-panel visual narratives and prompt the model to role-play and "complete the comic." Building on JailbreakBench and JailbreakV, we introduce ComicJailbreak, a comic-based jailbreak benchmark with 1,167 attack instances spanning 10 harm categories and 5 task setups. Across 15 state-of-the-art MLLMs (six commercial and nine open-source), comic-based attacks achieve success rates comparable to strong rule-based jailbreaks and substantially outperform plain-text and random-image baselines, with ensemble success rates exceeding 90% on several commercial models. Then, with the existing defense methodologies, we show that these methods are effective against the harmful comics, they will induce a high refusal rate when prompted with benign prompts. Finally, using automatic judging and targeted human evaluation, we show that current safety evaluators can be unreliable on sensitive but non-harmful content. Our findings highlight the need for safety alignment robust to narrative-driven multimodal jailbreaks.
0
cs.CVcs.AI Pawe{\l} Borsukiewicz, Daniele Lunghi, Melissa Tessa et al. · Mar 23, 2026

Adversarial Camouflage proposes a wearable privacy defense against facial recognition by optimizing simple face paint patterns (stripes or chevrons) to adversarially minimize embedding similarities across multiple recognition models. The core idea is to restrict the attack space to low-dimensional, user-reproducible geometric parameters (color, angle, width) that can be painted onto semantically valid facial regions, enabling protesters and privacy-conscious individuals to evade automated surveillance without specialized equipment.

While the rapid development of facial recognition algorithms has enabled numerous beneficial applications, their widespread deployment has raised significant concerns about the risks of mass surveillance and threats to individual privacy. In this paper, we introduce \textit{Adversarial Camouflage} as a novel solution for protecting users' privacy. This approach is designed to be efficient and simple to reproduce for users in the physical world. The algorithm starts by defining a low-dimensional pattern space parameterized by color, shape, and angle. Optimized patterns, once found, are projected onto semantically valid facial regions for evaluation. Our method maximizes recognition error across multiple architectures, ensuring high cross-model transferability even against black-box systems. It significantly degrades the performance of all tested state-of-the-art face recognition models during simulations and demonstrates promising results in real-world human experiments, while revealing differences in model robustness and evidence of attack transferability across architectures.
0
cs.CVcs.AIcs.LG Peter Fasogbon, Ugurcan Budak, Patrice Rondao Alface et al. · Mar 23, 2026

This paper tackles camera-agnostic pruning of 3D Gaussian splats for standardized interchange settings like MPEG I-3DGS, where training images, camera parameters, and gradients are unavailable. The authors propose BetaDescPrune, a one-shot post-training method that computes Hybrid Splat Feature Histogram (HSFH) descriptors to capture local geometric and appearance consistency, then models pruning decisions via Beta-distributed evidence with uncertainty-aware confidence scoring. The core insight is that reliable splat importance can be inferred from intrinsic neighborhood structure alone without rendering supervision.

The pruning of 3D Gaussian splats is essential for reducing their complexity to enable efficient storage, transmission, and downstream processing. However, most of the existing pruning strategies depend on camera parameters, rendered images, or view-dependent measures. This dependency becomes a hindrance in emerging camera-agnostic exchange settings, where splats are shared directly as point-based representations (e.g., .ply). In this paper, we propose a camera-agnostic, one-shot, post-training pruning method for 3D Gaussian splats that relies solely on attribute-derived neighbourhood descriptors. As our primary contribution, we introduce a hybrid descriptor framework that captures structural and appearance consistency directly from the splat representation. Building on these descriptors, we formulate pruning as a statistical evidence estimation problem and introduce a Beta evidence model that quantifies per-splat reliability through a probabilistic confidence score. Experiments conducted on standardized test sequences defined by the ISO/IEC MPEG Common Test Conditions (CTC) demonstrate that our approach achieves substantial pruning while preserving reconstruction quality, establishing a practical and generalizable alternative to existing camera-dependent pruning strategies.
0
cs.AI Caio Azevedo, Stefano Sabatini, Sascha Hornauer et al. · Mar 23, 2026

Multi-agent trajectory prediction requires models to understand complex future interactions between agents. This paper proposes braid prediction, an auxiliary task where models classify the crossing relationships (below/over/no\_crossing) between every pair of agents using shared mode embeddings from a DETR-style decoder. By training jointly to predict these topological braid labels alongside trajectories, the model gains future-interaction awareness with negligible inference overhead.

To safely operate, an autonomous vehicle must know the future behavior of a potentially high number of interacting agents around it, a task often posed as multi-agent trajectory prediction. Many previous attempts to model social interactions and solve the joint prediction task either add extensive computational requirements or rely on heuristics to label multi-agent behavior types. Braid theory, in contrast, provides a powerful exact descriptor of multi-agent behavior by projecting future trajectories into braids that express how trajectories cross with each other over time; a braid then corresponds to a specific mode of coordination between the multiple agents in the future. In past work, braids have been used lightly to reason about interacting agents and restrict the attention window of predicted agents. We show that leveraging more fully the expressivity of the braid representation and using it to condition the trajectories themselves leads to even further gains in joint prediction performance, with negligible added complexity either in training or at inference time. We do so by proposing a novel auxiliary task, braid prediction, done in parallel with the trajectory prediction task. By classifying edges between agents into their correct crossing types in the braid representation, the braid prediction task is able to imbue the model with improved social awareness, which is reflected in joint predictions that more closely adhere to the actual multi-agent behavior. This simple auxiliary task allowed us to obtain significant improvements in joint metrics on three separate datasets. We show how the braid prediction task infuses the model with future intention awareness, leading to more accurate joint predictions. Code is available at github.com/caiocj1/traj-pred-braid-theory.
0
cs.CVcs.AI Shuxian Zhao, Jie Gui, Baosheng Yu et al. · Mar 23, 2026

SteelDefectX introduces a vision-language dataset for steel defect detection that aggregates 7,778 images from four existing sources with novel coarse-to-fine textual annotations—ranging from class-level defect descriptions to sample-level attributes (shape, size, depth, position, contrast) generated via GPT-4o. The paper establishes a four-task benchmark showing that rich textual supervision improves cross-material transfer, though it reveals a tension where fine-grained annotations unexpectedly hurt few-shot performance.

Steel surface defect detection is essential for ensuring product quality and reliability in modern manufacturing. Current methods often rely on basic image classification models trained on label-only datasets, which limits their interpretability and generalization. To address these challenges, we introduce SteelDefectX, a vision-language dataset containing 7,778 images across 25 defect categories, annotated with coarse-to-fine textual descriptions. At the coarse-grained level, the dataset provides class-level information, including defect categories, representative visual attributes, and associated industrial causes. At the fine-grained level, it captures sample-specific attributes, such as shape, size, depth, position, and contrast, enabling models to learn richer and more detailed defect representations. We further establish a benchmark comprising four tasks, vision-only classification, vision-language classification, few/zero-shot recognition, and zero-shot transfer, to evaluate model performance and generalization. Experiments with several baseline models demonstrate that coarse-to-fine textual annotations significantly improve interpretability, generalization, and transferability. We hope that SteelDefectX will serve as a valuable resource for advancing research on explainable, generalizable steel surface defect detection. The data will be publicly available on https://github.com/Zhaosxian/SteelDefectX.
0
cs.AI Hunmin Do, Taejun Yoon, Kiyong Jung · Mar 23, 2026

This paper tackles the challenge of multi-party consensus-building in travel planning, where agents must negotiate conflicting subjective preferences rather than converge on objective truths. MIND (Multi-agent Inference for Negotiation Dialogue) introduces a Theory-of-Mind-inspired framework where agents infer hidden preference intensities (willingness scores $w \in [1,10]$) from linguistic cues and dynamically adjust their tone between warmth and toughness. The work matters because it extends Multi-Agent Debate (MAD) from factual domains to social coordination problems requiring compromise.

While Multi-Agent Debate (MAD) research has advanced, its efficacy in coordinating complex stakeholder interests such as travel planning remains largely unexplored. To bridge this gap, we propose MIND (Multi-agent Inference for Negotiation Dialogue), a framework designed to simulate realistic consensus-building among travelers with heterogeneous preferences. Grounded in the Theory of Mind (ToM), MIND introduces a Strategic Appraisal phase that infers opponent willingness (w) from linguistic nuances with 90.2% accuracy. Experimental results demonstrate that MIND outperforms traditional MAD frameworks, achieving a 20.5% improvement in High-w Hit and a 30.7% increase in Debate Hit-Rate, effectively prioritizing high-stakes constraints. Furthermore, qualitative evaluations via LLM-as-a-Judge confirm that MIND surpasses baselines in Rationality (68.8%) and Fluency (72.4%), securing an overall win rate of 68.3%. These findings validate that MIND effectively models human negotiation dynamics to derive persuasive consensus.
0
cs.CLcs.AI Pengfei Cao, Mingxuan Yang, Yubo Chen et al. · Mar 23, 2026

This paper introduces SemEval-2026 Task 12, Abductive Event Reasoning (AER), a shared task requiring systems to identify the most plausible direct cause of a target event from noisy multi-document evidence. The task is cast as an evidence-grounded multiple-choice benchmark with multiple correct answers allowed, capturing challenges like distributed evidence, indirect background factors, and semantically related distractors. With 122 participants and 518 submissions, it represents a significant community effort to benchmark real-world causal reasoning in long-context settings.

Understanding why real-world events occur is important for both natural language processing and practical decision-making, yet direct-cause inference remains underexplored in evidence-rich settings. To address this gap, we organized SemEval-2026 Task 12: Abductive Event Reasoning (AER).\footnote{The task data is available at https://github.com/sooo66/semeval2026-task12-dataset.git} The task asks systems to identify the most plausible direct cause of a target event from supporting evidence. We formulate AER as an evidence-grounded multiple-choice benchmark that captures key challenges of real-world causal reasoning, including distributed evidence, indirect background factors, and semantically related but non-causal distractors. The shared task attracted 122 participants and received 518 submissions. This paper presents the task formulation, dataset construction pipeline, evaluation setup, and system results. AER provides a focused benchmark for abductive reasoning over real-world events and highlights challenges for future work on causal reasoning and multi-document understanding.
0
cs.CVcs.AI Sashuai Zhou, Qiang Zhou, Junpeng Ma et al. · Mar 23, 2026

SpatialReward addresses the persistent problem of spatial inconsistencies in text-to-image generation, where models produce globally plausible images with incorrect object positioning and relationships. The paper proposes a three-stage verifiable reward model that decomposes free-form prompts into structured constraints, verifies object attributes via expert detectors, and employs vision-language chain-of-thought reasoning to assess complex spatial layouts. Integrated into Flow-GRPO reinforcement learning for Stable Diffusion and FLUX, the approach significantly improves spatial consistency while maintaining overall image quality.

Recent advances in text-to-image (T2I) generation via reinforcement learning (RL) have benefited from reward models that assess semantic alignment and visual quality. However, most existing reward models pay limited attention to fine-grained spatial relationships, often producing images that appear plausible overall yet contain inaccuracies in object positioning. In this work, we present \textbf{SpatialReward}, a verifiable reward model explicitly designed to evaluate spatial layouts in generated images. SpatialReward adopts a multi-stage pipeline: a \emph{Prompt Decomposer} extracts entities, attributes, and spatial metadata from free-form prompts; expert detectors provide accurate visual grounding of object positions and attributes; and a vision-language model applies chain-of-thought reasoning over grounded observations to assess complex spatial relations that are challenging for rule-based methods. To more comprehensively evaluate spatial relationships in generated images, we introduce \textbf{SpatRelBench}, a benchmark covering object attributes, orientation, inter-object relations, and rendered text placement. Experiments on Stable Diffusion and FLUX show that incorporating SpatialReward into RL training consistently improves spatial consistency and overall generation quality, with results aligned more closely to human judgments. These findings indicate that verifiable reward models hold considerable potential for enabling more accurate and controllable optimization in text-to-image generation models.
0
cs.AI Naoshi Uchihira · Mar 23, 2026

This paper proposes the GenAI SECI model, an update to Nonaka & Takeuchi's classic SECI framework for knowledge creation, designed to leverage generative AI for managing workplace ("Gen-Ba") tacit knowledge. The central innovation is "Digital Fragmented Knowledge"—partial, fragmentary knowledge stored in cyberspace that generative AI aggregates, structures, and recommends to amplify human understanding without requiring full externalization into explicit knowledge. This addresses the urgent problem of transferring expert tacit knowledge as Japan's workforce ages, going beyond conventional KM systems that struggle with the effort required to formalize knowledge.

The emergence of generative AI is bringing about a significant transformation in knowledge management. Generative AI has the potential to address the limitations of conventional knowledge management systems, and it is increasingly being deployed in real-world settings with promising results. Related research is also expanding rapidly. However, much of this work focuses on research and practice related to the management of explicit knowledge. While fragmentary efforts have been made regarding the management of tacit knowledge using generative AI, the modeling and systematization that handle both tacit and explicit knowledge in an integrated manner remain insufficient. In this paper, we propose the "GenAI SECI" model as an updated version of the knowledge creation process (SECI) model, redesigned to leverage the capabilities of generative AI. A defining feature of the "GenAI SECI" model is the introduction of "Digital Fragmented Knowledge", a new concept that integrates explicit and tacit knowledge within cyberspace. Furthermore, a concrete system architecture for the proposed model is presented, along with a comparison with prior research models that share a similar problem awareness and objectives.
0
cs.CVcs.AIcs.LG Jesper B. Christensen, Ciaran Bench, Spencer A. Thomas et al. · Mar 23, 2026

Ctrl-A addresses automated data augmentation by framing it as a control problem, dynamically adjusting per-operation augmentation strengths via a feedback loop that balances training and validation loss ratios. The method introduces Relative Operation Response (ROR) curves to individually tune transformation distributions without manual initialization or expensive search phases. While it achieves competitive results on CIFAR and SVHN benchmarks with minimal computational overhead (~10% vs. TrivialAugment), the evaluation relies on a modified training setup with extended epochs, raising questions about separability of algorithmic gains from training protocol changes.

We introduce ControlAugment (Ctrl-A), an automated data augmentation algorithm for image-vision tasks, which incorporates principles from control theory for online adjustment of augmentation strength distributions during model training. Ctrl-A eliminates the need for initialization of individual augmentation strengths. Instead, augmentation strength distributions are dynamically, and individually, adapted during training based on a control-loop architecture and what we define as relative operation response curves. Using an operation-dependent update procedure provides Ctrl-A with the potential to suppress augmentation styles that negatively impact model performance, alleviating the need for manually engineering augmentation policies for new image-vision tasks. Experiments on the CIFAR-10, CIFAR-100, and SVHN-core benchmark datasets using the common WideResNet-28-10 architecture demonstrate that Ctrl-A is highly competitive with existing state-of-the-art data augmentation strategies.
0
cs.CVcs.AI Yuyang You, Yongzhi Li, Jiahui Li et al. · Mar 23, 2026

Video diffusion models suffer from prohibitive inference costs, but standard image distillation techniques like DMD cause severe oversaturation and temporal collapse when naively extended to video. This work introduces a video-specific distillation framework featuring an adaptive regression loss that dynamically reweights real-data supervision to prevent color artifacts, a temporal variance regularizer to combat static output, and an inference-time frame interpolation module that halves sequence length during high-noise steps to accelerate generation. Applied to Wan2.1, the method enables stable 4-step synthesis with state-of-the-art VBench scores.

Video generation has recently emerged as a central task in the field of generative AI. However, the substantial computational cost inherent in video synthesis makes model distillation a critical technique for efficient deployment. Despite its significance, there is a scarcity of methods specifically designed for video diffusion models. Prevailing approaches often directly adapt image distillation techniques, which frequently lead to artifacts such as oversaturation, temporal inconsistency, and mode collapse. To address these challenges, we propose a novel distillation framework tailored specifically for video diffusion models. Its core innovations include: (1) an adaptive regression loss that dynamically adjusts spatial supervision weights to prevent artifacts arising from excessive distribution shifts; (2) a temporal regularization loss to counteract temporal collapse, promoting smooth and physically plausible sampling trajectories; and (3) an inference-time frame interpolation strategy that reduces sampling overhead while preserving perceptual quality. Extensive experiments and ablation studies on the VBench and VBench2 benchmarks demonstrate that our method achieves stable few-step video synthesis, significantly enhancing perceptual fidelity and motion realism. It consistently outperforms existing distillation baselines across multiple metrics.
0
cs.LGcs.AIcs.CL Xinyan Wang, Xiaogeng Liu, Chaowei Xiao · Mar 23, 2026

ROM tackles overthinking in Large Reasoning Models, where models generate redundant reasoning after reaching correct answers. The core idea is a lightweight streaming detector—an 8.13M parameter head attached to late-layer hidden states of a frozen LLM—that predicts overthinking probability token-by-token and triggers early stopping. It matters because it promises 47% token reduction without full model retraining. We find the method empirically effective but note concerns regarding data scaling limits and labeling costs.

Large Reasoning Models (LRMs) achieve strong accuracy on challenging tasks by generating long Chain-of-Thought traces, but suffer from overthinking. Even after reaching the correct answer, they continue generating redundant reasoning steps. This behavior increases latency and compute cost and can also lead to answer drift. Existing mitigation methods either require training-heavy backbone modification or rely on hand-crafted heuristics that do not truly capture overthinking patterns. We propose ROM, the first method that formulates overthinking mitigation as a streaming prediction-and-control problem. ROM attaches a lightweight detection head to the late-layer hidden states of a frozen large language model backbone. It monitors tokens in real time and triggers an early transition to the final answer once overthinking is detected. We also introduce token-level supervision based on solution correctness boundaries and a data augmentation strategy that reduces distilled-data bias. Across seven benchmarks, ROM achieves the highest accuracy (93.51%), the shortest responses (1,159 tokens), and the best response efficiency. Compared with the vanilla baseline, it reduces response length by 47.2% and improves efficiency by 121%. These results show that streaming detection is a promising approach to real-time overthinking mitigation.
0
cs.AI Shuying Chen, Sen Cui, Zhong Cao · Mar 23, 2026

This paper presents Oph-Guid-RAG, a multimodal retrieval-augmented generation system for ophthalmic clinical decision support. Unlike conventional text-based RAG systems, it retrieves full guideline page images using ColQwen2.5, preserving tables, flowcharts, and layout information without OCR errors. The system introduces a controllable retrieval framework with routing and filtering to selectively introduce external evidence, evaluated on a specialized ophthalmology subset extracted from HealthBench.

In this work, we propose Oph-Guid-RAG, a multimodal visual RAG system for ophthalmology clinical question answering and decision support. We treat each guideline page as an independent evidence unit and directly retrieve page images, preserving tables, flowcharts, and layout information. We further design a controllable retrieval framework with routing and filtering, which selectively introduces external evidence and reduces noise. The system integrates query decomposition, query rewriting, retrieval, reranking, and multimodal reasoning, and provides traceable outputs with guideline page references. We evaluate our method on HealthBench using a doctor-based scoring protocol. On the hard subset, our approach improves the overall score from 0.2969 to 0.3861 (+0.0892, +30.0%) compared to GPT-5.2, and achieves higher accuracy, improving from 0.5956 to 0.6576 (+0.0620, +10.4%). Compared to GPT-5.4, our method achieves a larger accuracy gain of +0.1289 (+24.4%). These results show that our method is more effective on challenging cases that require precise, evidence-based reasoning. Ablation studies further show that reranking, routing, and retrieval design are critical for stable performance, especially under difficult settings. Overall, we show how combining visionbased retrieval with controllable reasoning can improve evidence grounding and robustness in clinical AI applications,while pointing out that further work is needed to be more complete.
0
cs.HCcs.AI Yuta Tsuchiya, Yukino Baba · Mar 23, 2026

As users increasingly consult multiple large language models for decision support, a critical question arises: does increasing the number of AI advisors improve accuracy or amplify harmful conformity pressures? This paper investigates how panel size, within-panel consensus, and human-likeness of presentation shape human reliance and decision accuracy across three prediction tasks (income, recidivism, and dating). Through two crowdsourced experiments with 348 participants, the authors reveal a surprising non-monotonic relationship: three AI advisors improve accuracy over a single advisor, but five provide no additional benefit, while unanimous consensus fosters overreliance and wide disagreement creates confusion.

Just as people improve decision-making by consulting diverse human advisors, they can now also consult with multiple AI systems. Prior work on group decision-making shows that advice aggregation creates pressure to conform, leading to overreliance. However, the conditions under which multi-AI consultation improves or undermines human decision-making remain unclear. We conducted experiments with three tasks in which participants received advice from panels of AIs. We varied panel size, within-panel consensus, and the human-likeness of presentation. Accuracy improved for small panels relative to a single AI; larger panels yielded no gains. The level of within-panel consensus affected participants' reliance on AI advice: High consensus fostered overreliance; a single dissent reduced pressure to conform; wide disagreement created confusion and undermined appropriate reliance. Human-like presentations increased perceived usefulness and agency in certain tasks, without raising conformity pressure. These findings yield design implications for presenting multi-AI advice that preserve accuracy while mitigating conformity.
0
cs.AI Mohammad Asadi, Tahoura Nedaee, Jack W. O'Sullivan et al. · Mar 23, 2026

The paper proposes CEBaG, a deterministic hallucination detection method for medical Visual Question Answering that eliminates the need for costly stochastic sampling. By combining token-level predictive variance with visual evidence magnitude derived from log-probabilities, the method detects when models generate responses that contradict input images. This approach achieves superior detection accuracy while reducing computational cost from 20+ generations to just three forward passes, addressing a critical safety bottleneck in clinical AI deployment.

Multimodal large language models (MLLMs) have shown strong potential for medical Visual Question Answering (VQA), yet they remain prone to hallucinations, defined as generating responses that contradict the input image, posing serious risks in clinical settings. Current hallucination detection methods, such as Semantic Entropy (SE) and Vision-Amplified Semantic Entropy (VASE), require 10 to 20 stochastic generations per sample together with an external natural language inference model for semantic clustering, making them computationally expensive and difficult to deploy in practice. We observe that hallucinated responses exhibit a distinctive signature directly in the model's own log-probabilities: inconsistent token-level confidence and weak sensitivity to visual evidence. Based on this observation, we propose Confidence-Evidence Bayesian Gain (CEBaG), a deterministic hallucination detection method that requires no stochastic sampling, no external models, and no task-specific hyperparameters. CEBaG combines two complementary signals: token-level predictive variance, which captures inconsistent confidence across response tokens, and evidence magnitude, which measures how much the image shifts per-token predictions relative to text-only inference. Evaluated across four medical MLLMs and three VQA benchmarks (16 experimental settings), CEBaG achieves the highest AUC in 13 of 16 settings and improves over VASE by 8 AUC points on average, while being fully deterministic and self-contained. The code will be made available upon acceptance.
0
cs.HCcs.AIcs.CL David M. Markowitz · Mar 23, 2026

Dyadic is a web-based platform for studying human-human and human-AI conversations through text or voice-based interaction. It attempts to solve the methodological gap in conversation research by providing turnkey tools for experimental manipulation, live monitoring, and in-situ survey delivery during ongoing chats. The core value proposition is lowering barriers to entry for researchers studying dyadic interaction processes without requiring programming expertise.

Conversation is ubiquitous in social life, but the empirical study of this interactive process has been thwarted by tools that are insufficiently modular and unadaptive to researcher needs. To relieve many constraints in conversation research, the current tutorial presents an overview and introduction to a new tool, Dyadic (https://www.chatdyadic.com/), a web-based platform for studying human-human and human-AI conversations using text-based or voice-based chats. Dyadic is distinct from other platforms by offering studies with multiple modalities, AI suggestions (e.g., in human-human studies, AI can suggest responses to a participant), live monitoring (e.g., researchers can evaluate, in real time, chats between communicators), and survey deployment (e.g., Likert-type scales, feeling thermometers, and open-ended text boxes can be sent to humans for in situ evaluations of the interaction), among other consequential features. No coding is required to operate Dyadic directly, and integrations with existing survey platforms are offered.
0
cs.CVcs.AI Duy D. Nguyen, Phat T. Tran-Truong · Mar 23, 2026

SegMaFormer proposes a hybrid encoder for 3D medical image segmentation that places Mamba state-space layers in early high-resolution stages (for linear-complexity sequence mixing) and self-attention only in deeper low-resolution stages (where quadratic cost is manageable). The goal is to reduce the prohibitive compute of full 3D attention while preserving global context. With just 2M parameters and 15 GFLOPs, the authors claim competitive results on BraTS, Synapse, and ACDC benchmarks against models up to 75\times larger.

The advent of Transformer and Mamba-based architectures has significantly advanced 3D medical image segmentation by enabling global contextual modeling, a capability traditionally limited in Convolutional Neural Networks (CNNs). However, state-of-the-art Transformer models often entail substantial computational complexity and parameter counts, which is particularly prohibitive for volumetric data and further exacerbated by the limited availability of annotated medical imaging datasets. To address these limitations, this work introduces SegMaFormer, a lightweight hybrid architecture that synergizes Mamba and Transformer modules within a hierarchical volumetric encoder for efficient long-range dependency modeling. The model strategically employs Mamba-based layers in early, high-resolution stages to reduce computational overhead while capturing essential spatial context, and reserves self-attention mechanisms for later, lower-resolution stages to refine feature representation. This design is augmented with generalized rotary position embeddings to enhance spatial awareness. Despite its compact structure, SegMaFormer achieves competitive performance on three public benchmarks (Synapse, BraTS, and ACDC), matching the Dice coefficient of significantly larger models. Empirically, our approach reduces parameters by up to 75x and substantially decreases FLOPs compared to current state-of-the-art models, establishing an efficient and high-performing solution for 3D medical image segmentation.
0
cs.CVcs.AI Yunzhuo Sun, Xinyue Liu, Yanyang Li et al. · Mar 23, 2026

This paper addresses video moment retrieval (VMR) for complex multi-verb queries by proposing a two-stage framework that generates auxiliary short videos via text-to-video diffusion (CogVideoX) as temporal motion priors, then processes them through a linear-time Mamba network. The approach tackles the limitation of static image augmentations—which miss motion dynamics—while avoiding the quadratic complexity of Transformer-based methods on long untrimmed videos. The framework achieves state-of-the-art results on TVR with particular strength on multi-verb queries, though its effectiveness depends heavily on external video generation quality.

Text-driven video moment retrieval (VMR) remains challenging due to limited capture of hidden temporal dynamics in untrimmed videos, leading to imprecise grounding in long sequences. Traditional methods rely on natural language queries (NLQs) or static image augmentations, overlooking motion sequences and suffering from high computational costs in Transformer-based architectures. Existing approaches fail to integrate subtitle contexts and generated temporal priors effectively, we therefore propose a novel two-stage framework for enhanced temporal grounding. In the first stage, LLM-guided subtitle matching identifies relevant textual cues from video subtitles, fused with the query to generate auxiliary short videos via text-to-video models, capturing implicit motion information as temporal priors. In the second stage, augmented queries are processed through a multi-modal controlled Mamba network, extending text-controlled selection with video-guided gating for efficient fusion of generated priors and long sequences while filtering noise. Our framework is agnostic to base retrieval models and widely applicable for multimodal VMR. Experimental evaluations on the TVR benchmark demonstrate significant improvements over state-of-the-art methods, including reduced computational overhead and higher recall in long-sequence grounding.
0
cs.LGcs.AI Juan Sebastian Rojas, Chi-Guhn Lee · Mar 23, 2026

This paper identifies a subtle but important distinction between two interpretations of the TD error in reinforcement learning: the explicit form (bootstrapped target minus prediction) commonly used in deep RL, and the implicit form (difference between temporally successive predictions) from the original Sutton (1988) formulation. While equivalent in tabular settings, the authors demonstrate that increasingly nonlinear architectures cause these to diverge significantly, with profound implications for average-reward and differential RL algorithms.

The temporal difference (TD) error was first formalized in Sutton (1988), where it was first characterized as the difference between temporally successive predictions, and later, in that same work, formulated as the difference between a bootstrapped target and a prediction. Since then, these two interpretations of the TD error have been used interchangeably in the literature, with the latter eventually being adopted as the standard critic loss in deep reinforcement learning (RL) architectures. In this work, we show that these two interpretations of the TD error are not always equivalent. In particular, we show that increasingly-nonlinear deep RL architectures can cause these interpretations of the TD error to yield increasingly different numerical values. Then, building on this insight, we show how choosing one interpretation of the TD error over the other can affect the performance of deep RL algorithms that utilize the TD error to compute other quantities, such as with deep differential (i.e., average-reward) RL methods. All in all, our results show that the default interpretation of the TD error as the difference between a bootstrapped target and a prediction does not always hold in deep RL settings.