Feed - arxlens

0

When Minor Edits Matter: LLM-Driven Prompt Attack for Medical VLM Robustness in Ultrasound

cs.CV Yasamin Medghalchi, Milad Yazdani, Amirhossein Dabiriaghdam et al. · Mar 22, 2026

Medical Vision-Language Models (Med-VLMs) for ultrasound analysis are vulnerable to subtle prompt variations that mimic real clinical communication patterns. This paper proposes a black-box attack framework using an LLM to generate minimal, clinically plausible text edits guided by Monte Carlo Tree Search (MCTS), requiring no access to the target model's weights or gradients. The study reveals that small adversarial rewrites can drastically degrade diagnostic QA accuracy—raising critical safety concerns for deploying such systems in point-of-care settings where prompt variability is inherent.

Ultrasound is widely used in clinical practice due to its portability, cost-effectiveness, safety, and real-time imaging capabilities. However, image acquisition and interpretation remain highly operator dependent, motivating the development of robust AI-assisted analysis methods. Vision-language models (VLMs) have recently demonstrated strong multimodal reasoning capabilities and competitive performance in medical image analysis, including ultrasound. However, emerging evidence highlights significant concerns about their trustworthiness. In particular, adversarial robustness is critical because Med-VLMs operate via natural-language instructions, rendering prompt formulation a realistic and practically exploitable point of vulnerability. Small variations (typos, shorthand, underspecified requests, or ambiguous wording) can meaningfully shift model outputs. We propose a scalable adversarial evaluation framework that leverages a large language model (LLM) to generate clinically plausible adversarial prompt variants via "humanized" rewrites and minimal edits that mimic routine clinical communication. Using ultrasound multiple-choice question answering benchmarks, we systematically assess the vulnerability of SOTA Med-VLMs to these attacks, examine how attacker LLM capacity influences attack success, analyze the relationship between attack success and model confidence, and identify consistent failure patterns across models. Our results highlight realistic robustness gaps that must be addressed for safe clinical translation. Code will be released publicly following the review process.

Read abstractHide abstract

0

Retrieving Climate Change Disinformation by Narrative

cs.CL Max Upravitelev, Veronika Solopova, Charlott Jakob et al. · Mar 23, 2026

This paper reframes climate disinformation detection from classification to retrieval, treating narrative core messages as queries to rank corpus texts without fixed taxonomies. They propose SpecFi, which generates hypothetical documents using community summaries from graph-based detection (NodeRAG) as few-shot examples. The approach achieves MAP 0.505 on CARDS and demonstrates robustness to high narrative variance that cripples standard baselines.

Detecting climate disinformation narratives typically relies on fixed taxonomies, which do not accommodate emerging narratives. Thus, we re-frame narrative detection as a retrieval task: given a narrative's core message as a query, rank texts from a corpus by alignment with that narrative. This formulation requires no predefined label set and can accommodate emerging narratives. We repurpose three climate disinformation datasets (CARDS, Climate Obstruction, climate change subset of PolyNarrative) for retrieval evaluation and propose SpecFi, a framework that generates hypothetical documents to bridge the gap between abstract narrative descriptions and their concrete textual instantiations. SpecFi uses community summaries from graph-based community detection as few-shot examples for generation, achieving a MAP of 0.505 on CARDS without access to narrative labels. We further introduce narrative variance, an embedding-based difficulty metric, and show via partial correlation analysis that standard retrieval degrades on high-variance narratives (BM25 loses 63.4% of MAP), while SpecFi-CS remains robust (32.7% loss). Our analysis also reveals that unsupervised community summaries converge on descriptions close to expert-crafted taxonomies, suggesting that graph-based methods can surface narrative structure from unlabeled text.

Read abstractHide abstract

0

Text-Image Conditioned 3D Generation

cs.CV Jiazhong Cen, Jiemin Fang, Sikuang Li et al. · Mar 22, 2026

The paper addresses a fundamental limitation in 3D generation: image-conditioned models suffer from viewpoint bias and hallucinate unobserved regions, while text-conditioned models lack precise visual fidelity. The authors propose Text–Image Conditioned 3D Generation, a task requiring joint reasoning over visual exemplars and textual descriptions, and introduce TIGON—a minimalist dual-branch baseline that fuses separate image- and text-conditioned DiT backbones via zero-initialized cross-modal bridges and simple prediction averaging. This matters because it offers users more flexible control by combining pixel-aligned appearance cues with high-level semantic guidance.

High-quality 3D assets are essential for VR/AR, industrial design, and entertainment, motivating growing interest in generative models that create 3D content from user prompts. Most existing 3D generators, however, rely on a single conditioning modality: image-conditioned models achieve high visual fidelity by exploiting pixel-aligned cues but suffer from viewpoint bias when the input view is limited or ambiguous, while text-conditioned models provide broad semantic guidance yet lack low-level visual detail. This limits how users can express intent and raises a natural question: can these two modalities be combined for more flexible and faithful 3D generation? Our diagnostic study shows that even simple late fusion of text- and image-conditioned predictions outperforms single-modality models, revealing strong cross-modal complementarity. We therefore formalize Text-Image Conditioned 3D Generation, which requires joint reasoning over a visual exemplar and a textual specification. To address this task, we introduce TIGON, a minimalist dual-branch baseline with separate image- and text-conditioned backbones and lightweight cross-modal fusion. Extensive experiments show that text-image conditioning consistently improves over single-modality methods, highlighting complementary vision-language guidance as a promising direction for future 3D generation research. Project page: https://jumpat.github.io/tigon-page

Read abstractHide abstract

0

Reading Between the Lines: How Electronic Nonverbal Cues shape Emotion Decoding

cs.CL cs.HC Taara Kumar, Kokil Jaidka · Mar 22, 2026

This paper investigates how users decode emotions in text-based communication through electronic nonverbal cues (eNVCs)—orthographic signals like elongation, punctuation, and emojis that approximate paralinguistic features. The authors propose a taxonomy grounded in nonverbal communication theory (kinesics and paralinguistics) and test it across three complementary studies: a content analysis developing a regex detection toolkit, a within-subjects experiment manipulating eNVC presence and sarcasm ($n=513$), and focus groups exploring interpretive strategies. The work identifies sarcasm as a critical boundary condition where eNVCs fail to aid interpretation and provides an open-source Python/R package for automated cue detection.

As text-based computer-mediated communication (CMC) increasingly structures everyday interaction, a central question re-emerges with new urgency: How do users reconstruct nonverbal expression in environments where embodied cues are absent? This paper provides a systematic, theory-driven account of electronic nonverbal cues (eNVCs) - textual analogues of kinesics, vocalics, and paralinguistics - in public microblog communication. Across three complementary studies, we advance conceptual, empirical, and methodological contributions. Study 1 develops a unified taxonomy of eNVCs grounded in foundational nonverbal communication theory and introduces a scalable Python toolkit for their automated detection. Study 2, a within-subject survey experiment, offers controlled causal evidence that eNVCs substantially improve emotional decoding accuracy and lower perceived ambiguity, while also identifying boundary conditions, such as sarcasm, under which these benefits weaken or disappear. Study 3, through focus group discussions, reveals the interpretive strategies users employ when reasoning about digital prosody, including drawing meaning from the absence of expected cues and defaulting toward negative interpretations in ambiguous contexts. Together, these studies establish eNVCs as a coherent and measurable class of digital behaviors, refine theoretical accounts of cue richness and interpretive effort, and provide practical tools for affective computing, user modeling, and emotion-aware interface design. The eNVC detection toolkit is available as a Python and R package at https://github.com/kokiljaidka/envc.

Read abstractHide abstract

0

Learning Trajectory-Aware Multimodal Large Language Models for Video Reasoning Segmentation

cs.CV Jingnan Luo, Mingqi Gao, Jun Liu et al. · Mar 23, 2026

This paper addresses video reasoning segmentation—segmenting objects in videos based on complex human instructions—by proposing TrajSeg, a unified framework built on Multimodal Large Language Models (MLLMs). The core innovation is bidirectional text-trajectory alignment, where the model learns both text-to-trajectory grounding and trajectory-to-text captioning, alongside a Frame-level Content Integration (FCI) module and a unified mask decoder that eliminates the need for separate key-frame and tracking models. The work matters because it simplifies training pipelines and aims to improve trajectory perception in dynamic video contexts.

The prosperity of Multimodal Large Language Models (MLLMs) has stimulated the demand for video reasoning segmentation, which aims to segment video objects based on human instructions. Previous studies rely on unidirectional and implicit text-trajectory alignment, which struggles with trajectory perception when faced with severe video dynamics. In this work, we propose TrajSeg, a simple and unified framework built upon MLLMs. Concretely, we introduce bidirectional text-trajectory alignment, where MLLMs accept grounding-intended (text-to-trajectory) and captioning-intended (trajectory-to-text) instructions. This way, MLLMs can benefit from enhanced correspondence and better perceive object trajectories in videos. The mask generation from trajectories is achieved via a frame-level content integration (FCI) module and a unified mask decoder. The former adapts the MLLM-parsed trajectory-level token to frame-specific information. The latter unifies segmentation for all frames into a single structure, enabling the proposed framework to be simplified and end-to-end trainable. Extensive experiments on referring and reasoning video segmentation datasets demonstrate the effectiveness of TrajSeg, which outperforms all video reasoning segmentation methods on all metrics. The code will be publicly available at https://github.com/haodi19/TrajSeg.

Read abstractHide abstract

0

Semantic Shift: the Fundamental Challenge in Text Embedding and Retrieval

cs.CL cs.IR Hang Gao, Dimitris N. Metaxas · Mar 22, 2026

This paper identifies "semantic shift"—the intrinsic evolution of meaning within a text—as the root cause of embedding pathologies like anisotropy and length-induced collapse. The authors argue that pooling-based aggregation forces "semantic smoothing," where diverse sentences compromise into a diluted representation. They formalize semantic shift as the product of local evolution and global dispersion ($\mathrm{Shift}(k) = \mathrm{Local}(k) \cdot \mathrm{Disp}(k)$), showing through controlled concatenation experiments that it predicts embedding concentration and retrieval degradation better than text length alone. The work reframes geometric pathologies not as inherent model defects but as consequences of content structure interacting with pooling mechanics.

Transformer-based embedding models rely on pooling to map variable-length text into a single vector, enabling efficient similarity search but also inducing well-known geometric pathologies such as anisotropy and length-induced embedding collapse. Existing accounts largely describe \emph{what} these pathologies look like, yet provide limited insight into \emph{when} and \emph{why} they harm downstream retrieval. In this work, we argue that the missing causal factor is \emph{semantic shift}: the intrinsic, structured evolution and dispersion of semantics within a text. We first present a theoretical analysis of \emph{semantic smoothing} in Transformer embeddings: as the semantic diversity among constituent sentences increases, the pooled representation necessarily shifts away from every individual sentence embedding, yielding a smoothed and less discriminative vector. Building on this foundation, we formalize semantic shift as a computable measure integrating local semantic evolution and global semantic dispersion. Through controlled experiments across corpora and multiple embedding models, we show that semantic shift aligns closely with the severity of embedding concentration and predicts retrieval degradation, whereas text length alone does not. Overall, semantic shift offers a unified and actionable lens for understanding embedding collapse and for diagnosing when anisotropy becomes harmful.

Read abstractHide abstract

0

3D-Layout-R1: Structured Reasoning for Language-Instructed Spatial Editing

cs.CV cs.AI Haoyu Zhen, Xiaolong Li, Yilin Zhao et al. · Mar 23, 2026

3D-Layout-R1 tackles language-guided 3D spatial editing by training LLMs/VLMs to perform structured reasoning over explicit scene graphs. Instead of free-form chains-of-thought, the model outputs JSON graph edits that iteratively transform object poses and relations, combined with GRPO-based RL using dense 3D IoU and collision-aware rewards. This approach yields measurable gains in layout accuracy while maintaining interpretability across sorting, spatial alignment, and room-editing tasks.

Large Language Models (LLMs) and Vision Language Models (VLMs) have shown impressive reasoning abilities, yet they struggle with spatial understanding and layout consistency when performing fine-grained visual editing. We introduce a Structured Reasoning framework that performs text-conditioned spatial layout editing via scene-graph reasoning. Given an input scene graph and a natural-language instruction, the model reasons over the graph to generate an updated scene graph that satisfies the text condition while maintaining spatial coherence. By explicitly guiding the reasoning process through structured relational representations, our approach improves both interpretability and control over spatial relationships. We evaluate our method on a new text-guided layout editing benchmark encompassing sorting, spatial alignment, and room-editing tasks. Our training paradigm yields an average 15% improvement in IoU and 25% reduction in center-distance error compared to Chain of Thought Fine-tuning (CoT-SFT) and vanilla GRPO baselines. Compared to SOTA zero-shot LLMs, our best models achieve up to 20% higher mIoU, demonstrating markedly improved spatial precision.

Read abstractHide abstract

0

Generalization Limits of In-Context Operator Networks for Higher-Order Partial Differential Equations

cs.LG cs.NA math.NA Jamie Mahowald, Tan Bui-Thanh · Mar 23, 2026

This paper extends In-Context Operator Networks (ICONs)—which learn PDE solution operators via in-context learning without retraining—to higher-order and higher-dimensional PDEs. The authors test on 19 problem types including the heat equation and 3D linear PDEs, finding that while point-wise accuracy degrades for complex OOD problems, the model retains qualitative solution behavior.

We investigate the generalization capabilities of In-Context Operator Networks (ICONs), a new class of operator networks that build on the principles of in-context learning, for higher-order partial differential equations. We extend previous work by expanding the type and scope of differential equations handled by the foundation model. We demonstrate that while processing complex inputs requires some new computational methods, the underlying machine learning techniques are largely consistent with simpler cases. Our implementation shows that although point-wise accuracy degrades for higher-order problems like the heat equation, the model retains qualitative accuracy in capturing solution dynamics and overall behavior. This demonstrates the model's ability to extrapolate fundamental solution characteristics to problems outside its training regime.

Read abstractHide abstract

0

HELIX: Scaling Raw Audio Understanding with Hybrid Mamba-Attention Beyond the Quadratic Limit

cs.SD cs.LG eess.AS Khushiyant, Param Thakkar · Mar 22, 2026

This paper studies the coupling between three design axes in audio representation learning: input frontend (raw waveform vs. spectrogram), backbone architecture (Mamba vs. attention), and sequence length. The authors introduce HELIX, a minimal hybrid architecture with five bidirectional Mamba layers and one attention bottleneck at matched 8.3M parameter capacity. The key finding is that these choices are not independent: raw waveforms help with Mamba but not attention, attention hurts on short environmental sounds but becomes critical at 30,000 tokens (5 minutes), where pure attention fails with OOM errors and HELIX closes an 11.5-point gap over pure Mamba on speaker identification.

Audio representation learning typically evaluates design choices such as input frontend, sequence backbone, and sequence length in isolation. We show that these axes are coupled, and conclusions from one setting often do not transfer to others. We introduce HELIX, a controlled framework comparing pure Mamba, pure attention, and a minimal hybrid with a single attention bottleneck. All models are parameter-matched at about 8.3M parameters to isolate architectural effects. Across six datasets, we find that the preferred input representation depends on the backbone, and that attention hurts performance on short, stationary audio but becomes important at longer sequence lengths. On a 5-minute speaker identification task with 30,000 tokens, pure attention fails with out-of-memory errors, while HELIX closes an 11.5-point gap over pure Mamba.

Read abstractHide abstract

0

Left Behind: Cross-Lingual Transfer as a Bridge for Low-Resource Languages in Large Language Models

cs.CL Abdul-Salem Beibitkhan · Mar 22, 2026

This paper investigates whether cross-lingual transfer (CLT)—prompting models to translate queries to English, reason in English, then translate answers back—can bridge the performance gap for low-resource languages. The authors benchmark eight LLMs across 2,000 responses in Kazakh and Mongolian, finding that CLT selectively benefits bilingual models (+2.2–4.3pp) but not English-first architectures, while revealing a concerning "fluency illusion" where models appear fluent in LRLs while producing less accurate content.

We investigate how large language models perform on low-resource languages by benchmarking eight LLMs across five experimental conditions in English, Kazakh, and Mongolian. Using 50 hand-crafted questions spanning factual, reasoning, technical, and culturally grounded categories, we evaluate 2,000 responses on accuracy, fluency, and completeness. We find a consistent performance gap of 13.8-16.7 percentage points between English and low-resource language conditions, with models maintaining surface-level fluency while producing significantly less accurate content. Cross-lingual transfer-prompting models to reason in English before translating back-yields selective gains for bilingual architectures (+2.2pp to +4.3pp) but provides no benefit to English-dominant models. Our results demonstrate that current LLMs systematically underserve low-resource language communities, and that effective mitigation strategies are architecture-dependent rather than universal.

Read abstractHide abstract

0

Model selection in hybrid quantum neural networks with applications to quantum transformer architectures

quant-ph cs.LG Harsh Wadhwa, Rahul Bhowmick, Naipunnya Raj et al. · Mar 23, 2026

Quantum machine learning model selection currently lacks principled guidelines, forcing practitioners to train numerous expensive configurations. This paper introduces QBET (Quantum Bias-Expressivity Toolbox), an unsupervised pre-screening framework that evaluates hybrid quantum-classical transformers using LZ-complexity-based Simplicity Bias (AUC) and Expressivity metrics without gradient descent. The core idea is that architectures with higher AUC (stronger bias toward simple Boolean functions) correlate with better downstream task performance, offering a filter to identify promising quantum attention variants before committing to full training on NISQ devices.

Quantum machine learning models generally lack principled design guidelines, often requiring full resource-intensive training across numerous choices of encodings, quantum circuit designs and initialization strategies to find effective configuration. To address this challenge, we develope the Quantum Bias-Expressivity Toolbox ($\texttt{QBET}$), a framework for evaluating quantum, classical, and hybrid transformer architectures. In this toolbox, we introduce lean metrics for Simplicity Bias ($\texttt{SB}$) and Expressivity ($\texttt{EXP}$), for comparing across various models, and extend the analysis of $\texttt{SB}$ to generative and multiclass-classification tasks. We show that $\texttt{QBET}$ enables efficient pre-screening of promising model variants obviating the need to execute complete training pipelines. In evaluations on transformer-based classification and generative tasks we employ a total of $18$ qubits for embeddings ($6$ qubits each for query, key, and value). We identify scenarios in which quantum self-attention variants surpass their classical counterparts by ranking the respective models according to the $\texttt{SB}$ metric and comparing their relative performance.

Read abstractHide abstract

0

CounterScene: Counterfactual Causal Reasoning in Generative World Models for Safety-Critical Closed-Loop Evaluation

cs.RO cs.CV Bowen Jing, Ruiyang Hao, Weitao Zhou et al. · Mar 22, 2026

Existing safety-critical scenario generation methods force collisions through brute-force perturbations, destroying trajectory realism. CounterScene reframes this as a counterfactual inference problem within diffusion-based BEV world models: given a safe scene, identify the single agent whose behavioral change would maximally increase collision risk, then minimally intervene on that agent alone via structured diffusion guidance. This targets the realism-adversarial trade-off by allowing danger to emerge through natural interaction propagation rather than global trajectory distortion.

Generating safety-critical driving scenarios requires understanding why dangerous interactions arise, rather than merely forcing collisions. However, existing methods rely on heuristic adversarial agent selection and unstructured perturbations, lacking explicit modeling of interaction dependencies and thus exhibiting a realism--adversarial trade-off. We present CounterScene, a framework that endows closed-loop generative BEV world models with structured counterfactual reasoning for safety-critical scenario generation. Given a safe scene, CounterScene asks: what if the causally critical agent had behaved differently? To answer this, we introduce causal adversarial agent identification to identify the critical agent and classify conflict types, and develop a conflict-aware interactive world model in which a causal interaction graph is used to explicitly model dynamic inter-agent dependencies. Building on this structure, stage-adaptive counterfactual guidance performs minimal interventions on the identified agent, removing its spatial and temporal safety margins while allowing risk to emerge through natural interaction propagation. Extensive experiments on nuScenes demonstrate that CounterScene achieves the strongest adversarial effectiveness while maintaining superior trajectory realism across all horizons, improving long-horizon collision rate from 12.3% to 22.7% over the strongest baseline with better realism (ADE 1.88 vs.2.09). Notably, this advantage further widens over longer rollouts, and CounterScene generalizes zero-shot to nuPlan with state-of-the-art realism.

Read abstractHide abstract

0

Multi-Perspective LLM Annotations for Valid Analyses in Subjective Tasks

cs.CL Navya Mehrotra, Adam Visokay, Kristina Gligori\'c · Mar 22, 2026

LLM annotations encode some human perspectives better than others, especially in subjective tasks where demographic background shapes judgments. This paper introduces Perspective-Driven Inference (PDI), a statistical framework that treats the distribution of group-specific annotations as a vector estimand $\theta^* = (\theta^*_{g_1}, \dots, \theta^*_{g_K})$ and adaptively allocates limited human labels to groups where LLM proxies are least reliable. The core contribution is an error-predictor-driven sampling rule that improves estimation accuracy for harder-to-model demographics while maintaining valid frequentist coverage.

Large language models are increasingly used to annotate texts, but their outputs reflect some human perspectives better than others. Existing methods for correcting LLM annotation error assume a single ground truth. However, this assumption fails in subjective tasks where disagreement across demographic groups is meaningful. Here we introduce Perspective-Driven Inference, a method that treats the distribution of annotations across groups as the quantity of interest, and estimates it using a small human annotation budget. We contribute an adaptive sampling strategy that concentrates human annotation effort on groups where LLM proxies are least accurate. We evaluate on politeness and offensiveness rating tasks, showing targeted improvements for harder-to-model demographic groups relative to uniform sampling baselines, while maintaining coverage.

Read abstractHide abstract

0

AdditiveLLM2: A Multi-modal Large Language Model for Additive Manufacturing

cs.LG Peter Pak, Amir Barati Farimani · Mar 23, 2026

AdditiveLLM2 is a domain-adapted multi-modal LLM for additive manufacturing built by fine-tuning Gemma 3 12B on ~50 million tokens from open-access AM journal articles. The work addresses the challenge of specializing general LLMs for technical domains without consuming context window space (as with RAG) or requiring massive datasets. Using domain adaptive pretraining (DAPT) for both text and vision plus visual instruction tuning (VIT), the authors demonstrate that even relatively small curated datasets can yield domain expertise exceeding 90% accuracy on AM knowledge tasks.

This work presents AdditiveLLM2 a multi-modal, domain adapted large language model built upon the instruction tuned variant of the Gemma 3 model using a relatively small dataset of around 50 million tokens. The dataset (AdditiveLLM2-OA) consists of open-access additive manufacturing journal articles with data extracted for the domain adaptive pretraining and visual instruction tuning processes. Various stages of the developed model are evaluated with the Additive-Manufacturing-Benchmark which consists of additive manufacturing domain specific tasks compiled published resources. AdditiveLLM2 exhibits proficiency in both language and vision based tasks, achieving accuracies upwards of 90% in general additive manufacturing knowledge. This domain adaptive pretraining and instruction tuning strategy outline an accessible specialization method for large language models to a domain such as additive manufacturing.

Read abstractHide abstract

0

A Generalised Exponentiated Gradient Approach to Enhance Fairness in Binary and Multi-class Classification Tasks

cs.LG stat.ML Maryam Boubekraoui, Giordano d'Aloisio, Antinisca Di Marco · Mar 22, 2026

While most bias mitigation research targets binary classification, multi-class fairness remains under-explored. This paper proposes Generalised Exponentiated Gradient (GEG), an in-processing method that extends the Exponentiated Gradient framework to multi-class settings and enables simultaneous optimization of multiple fairness constraints via positive-label moment conditions. Evaluated on ten datasets against six baselines, GEG achieves fairness improvements up to 92% with moderate accuracy trade-offs, filling a critical gap in fair machine learning toolboxes.

The widespread use of AI and ML models in sensitive areas raises significant concerns about fairness. While the research community has introduced various methods for bias mitigation in binary classification tasks, the issue remains under-explored in multi-class classification settings. To address this limitation, in this paper, we first formulate the problem of fair learning in multi-class classification as a multi-objective problem between effectiveness (i.e., prediction correctness) and multiple linear fairness constraints. Next, we propose a Generalised Exponentiated Gradient (GEG) algorithm to solve this task. GEG is an in-processing algorithm that enhances fairness in binary and multi-class classification settings under multiple fairness definitions. We conduct an extensive empirical evaluation of GEG against six baselines across seven multi-class and three binary datasets, using four widely adopted effectiveness metrics and three fairness definitions. GEG overcomes existing baselines, with fairness improvements up to 92% and a decrease in accuracy up to 14%.

Read abstractHide abstract

0

Focus on Background: Exploring SAM's Potential in Few-shot Medical Image Segmentation with Background-centric Prompting

cs.CV Yuntian Bo, Yazhou Zhu, Piotr Koniusz et al. · Mar 22, 2026

Few-shot medical image segmentation (FSMIS) aims to segment anatomical structures with minimal annotations, but Segment Anything Model (SAM) based approaches suffer from over-segmentation due to ambiguous medical boundaries. This paper reformulates SAM-based FSMIS as a background-centric prompt localization task, proposing FoB (Focus on Background) to generate precise background prompts that constrain SAM’s predictions. By modeling contextual dependencies and ring-like structural priors, the method achieves state-of-the-art performance across CT, MRI, and dermatoscopic imaging while maintaining strong cross-domain generalization.

Conventional few-shot medical image segmentation (FSMIS) approaches face performance bottlenecks that hinder broader clinical applicability. Although the Segment Anything Model (SAM) exhibits strong category-agnostic segmentation capabilities, its direct application to medical images often leads to over-segmentation due to ambiguous anatomical boundaries. In this paper, we reformulate SAM-based FSMIS as a prompt localization task and propose FoB (Focus on Background), a background-centric prompt generator that provides accurate background prompts to constrain SAM's over-segmentation. Specifically, FoB bridges the gap between segmentation and prompt localization by category-agnostic generation of support background prompts and localizing them directly in the query image. To address the challenge of prompt localization for novel categories, FoB models rich contextual information to capture foreground-background spatial dependencies. Moreover, inspired by the inherent structural patterns of background prompts in medical images, FoB models this structure as a constraint to progressively refine background prompt predictions. Experiments on three diverse medical image datasets demonstrate that FoB outperforms other baselines by large margins, achieving state-of-the-art performance on FSMIS, and exhibiting strong cross-domain generalization. Our code is available at https://github.com/primebo1/FoB_SAM.

Read abstractHide abstract

0

SkinCLIP-VL: Consistency-Aware Vision-Language Learning for Multimodal Skin Cancer Diagnosis

cs.CV Zhixiang Lu, Shijie Xu, Kaicheng Yan et al. · Mar 22, 2026

Multimodal skin cancer diagnosis with vision-language models faces a trilemma of computational cost, data scarcity, and black-box opacity. SkinCLIP-VL tackles this via a "frozen perception, adaptive reasoning" architecture that keeps CLIP frozen, adapts a quantized Qwen2.5-VL via LoRA, and introduces the Consistency-aware Focal Alignment (CFA) Loss to jointly handle class imbalance, cross-modal alignment, and calibration. The paper matters because it couples strong empirical performance with a clinician validation study, aiming to bridge the gap between AI accuracy and clinical trust.

The deployment of vision-language models (VLMs) in dermatology is hindered by the trilemma of high computational costs, extreme data scarcity, and the black-box nature of deep learning. To address these challenges, we present SkinCLIP-VL, a resource-efficient framework that adapts foundation models for trustworthy skin cancer diagnosis. Adopting a frozen perception, adaptive reasoning paradigm, we integrate a frozen CLIP encoder with a lightweight, quantized Qwen2.5-VL via low-rank adaptation (LoRA). To strictly align visual regions with clinical semantics under long-tailed distributions, we propose the Consistency-aware Focal Alignment (CFA) Loss. This objective synergizes focal re-weighting, semantic alignment, and calibration. On ISIC and Derm7pt benchmarks, SkinCLIP-VL surpasses 13B-parameter baselines by 4.3-6.2% in accuracy with 43% fewer parameters. Crucially, blinded expert evaluation and out-of-distribution testing confirm that our visually grounded rationales significantly enhance clinical trust compared to traditional saliency maps.

Read abstractHide abstract

0

GIDE: Unlocking Diffusion LLMs for Precise Training-Free Image Editing

cs.CV Zifeng Zhu, Jiaming Han, Jiaxiang Zhao et al. · Mar 22, 2026

GIDE addresses a key challenge in image editing: applying training-free editing techniques to Diffusion Large Language Models (DLLMs). Unlike continuous diffusion models where DDIM inversion is well-established, DLLMs use discrete tokenization that prevents direct application of standard noise inversion. GIDE introduces a three-stage framework (grounding, inversion, refinement) that enables precise localized editing via points, boxes, or text prompts while preserving background content. The significance lies in bridging discrete token spaces with high-fidelity inversion without additional training.

While Diffusion Large Language Models (DLLMs) have demonstrated remarkable capabilities in multi-modal generation, performing precise, training-free image editing remains an open challenge. Unlike continuous diffusion models, the discrete tokenization inherent in DLLMs hinders the application of standard noise inversion techniques, often leading to structural degradation during editing. In this paper, we introduce GIDE (Grounded Inversion for DLLM Image Editing), a unified framework designed to bridge this gap. GIDE incorporates a novel Discrete Noise Inversion mechanism that accurately captures latent noise patterns within the discrete token space, ensuring high-fidelity reconstruction. We then decompose the editing pipeline into grounding, inversion, and refinement stages. This design enables GIDE supporting various editing instructions (text, point and box) and operations while strictly preserving the unedited background. Furthermore, to overcome the limitations of existing single-step evaluation protocols, we introduce GIDE-Bench, a rigorous benchmark comprising 805 compositional editing scenarios guided by diverse multi-modal inputs. Extensive experiments on GIDE-Bench demonstrate that GIDE significantly outperforms prior training-free methods, improving Semantic Correctness by 51.83% and Perceptual Quality by 50.39%. Additional evaluations on ImgEdit-Bench confirm its broad applicability, demonstrating consistent gains over trained baselines and yielding photorealistic consistency on par with leading models.

Read abstractHide abstract

0

DGRNet: Disagreement-Guided Refinement for Uncertainty-Aware Brain Tumor Segmentation

cs.CV Bahram Mohammadi, Yanqiu Wu, Vu Minh Hieu Phan et al. · Mar 22, 2026

DGRNet addresses two critical gaps in brain tumor segmentation: reliable uncertainty quantification and under-utilization of radiology reports. The core idea transforms prediction disagreement among multiple lightweight view-specific adapters into an active signal that guides targeted refinement in ambiguous regions, integrated with clinical text conditioning. This approach achieves state-of-the-art accuracy on the TextBraTS benchmark while providing clinically meaningful uncertainty estimates calibrated to actual errors.

Accurate brain tumor segmentation from MRI scans is critical for diagnosis and treatment planning. Despite the strong performance of recent deep learning approaches, two fundamental limitations remain: (1) the lack of reliable uncertainty quantification in single-model predictions, which is essential for clinical deployment because the level of uncertainty may impact treatment decision-making, and (2) the under-utilization of rich information in radiology reports that can guide segmentation in ambiguous regions. In this paper, we propose the Disagreement-Guided Refinement Network (DGRNet), a novel framework that addresses both limitations through multi-view disagreement-based uncertainty estimation and text-conditioned refinement. DGRNet generates diverse predictions via four lightweight view-specific adapters attached to a shared encoder-decoder, enabling efficient uncertainty quantification within a single forward pass. Afterward, we build disagreement maps to identify regions of high segmentation uncertainty, which are then selectively refined according to clinical reports. Moreover, we introduce a diversity-preserving training strategy that combines pairwise similarity penalties and gradient isolation to prevent view collapse. The experimental results on the TextBraTS dataset show that DGRNet favorably improves state-of-the-art segmentation accuracy by 2.4% and 11% in main metrics Dice and HD95, respectively, while providing meaningful uncertainty estimates.

Read abstractHide abstract

0

Conspiracy Frame: a Semiotically-Driven Approach for Conspiracy Theories Detection

cs.CL Heidi Campana Piva, Shaina Ashraf, Maziar Kianimoghadam Jouneghani et al. · Mar 22, 2026

The paper proposes the Conspiracy Frame, a semiotic and frame-semantic representation of conspiratorial narratives with five elements (plan, secret, in-group, out-group, call-to-action), and introduces Con.Fra., a span-annotated Telegram corpus. The core hypothesis is that injecting FrameNet-derived semantic frames into LLM prompts will improve conspiracy detection and explainability. Results show that while frame-guided prompting achieves comparable classification scores to few-shot learning, it does not consistently outperform it, though it reveals interesting abstract semantic patterns.

Conspiracy theories are anti-authoritarian narratives that lead to social conflict, impacting how people perceive political information. To help in understanding this issue, we introduce the Conspiracy Frame: a fine-grained semantic representation of conspiratorial narratives derived from frame-semantics and semiotics, which spawned the Conspiracy Frames (Con.Fra.) dataset: a corpus of Telegram messages annotated at span-level. The Conspiracy Frame and Con.Fra. dataset contribute to the implementation of a more generalizable understanding and recognition of conspiracy theories. We observe the ability of LLMs to recognize this phenomenon in-domain and out-of-domain, investigating the role that frames may have in supporting this task. Results show that, while the injection of frames in an in-context approach does not lead to clear increase of performance, it has potential; the mapping of annotated spans with FrameNet shows abstract semantic patterns (e.g., `Kinship', `Ingest\_substance') that potentially pave the way for a more semantically- and semiotically-aware detection of conspiratorial narratives.

Read abstractHide abstract

Nothing here yet