Your paper timeline
Scroll AI takes the way you would scroll a great paper aggregator: quick signal first, deeper critique when something earns your attention, and challenges when a claim feels off.
186 papers in cs.CV
Trending mixes fresh papers with community signal.
0
cs.CV Yasamin Medghalchi, Milad Yazdani, Amirhossein Dabiriaghdam et al. · Mar 22, 2026

Medical Vision-Language Models (Med-VLMs) for ultrasound analysis are vulnerable to subtle prompt variations that mimic real clinical communication patterns. This paper proposes a black-box attack framework using an LLM to generate minimal, clinically plausible text edits guided by Monte Carlo Tree Search (MCTS), requiring no access to the target model's weights or gradients. The study reveals that small adversarial rewrites can drastically degrade diagnostic QA accuracy—raising critical safety concerns for deploying such systems in point-of-care settings where prompt variability is inherent.

Ultrasound is widely used in clinical practice due to its portability, cost-effectiveness, safety, and real-time imaging capabilities. However, image acquisition and interpretation remain highly operator dependent, motivating the development of robust AI-assisted analysis methods. Vision-language models (VLMs) have recently demonstrated strong multimodal reasoning capabilities and competitive performance in medical image analysis, including ultrasound. However, emerging evidence highlights significant concerns about their trustworthiness. In particular, adversarial robustness is critical because Med-VLMs operate via natural-language instructions, rendering prompt formulation a realistic and practically exploitable point of vulnerability. Small variations (typos, shorthand, underspecified requests, or ambiguous wording) can meaningfully shift model outputs. We propose a scalable adversarial evaluation framework that leverages a large language model (LLM) to generate clinically plausible adversarial prompt variants via "humanized" rewrites and minimal edits that mimic routine clinical communication. Using ultrasound multiple-choice question answering benchmarks, we systematically assess the vulnerability of SOTA Med-VLMs to these attacks, examine how attacker LLM capacity influences attack success, analyze the relationship between attack success and model confidence, and identify consistent failure patterns across models. Our results highlight realistic robustness gaps that must be addressed for safe clinical translation. Code will be released publicly following the review process.
0
cs.CV Jiazhong Cen, Jiemin Fang, Sikuang Li et al. · Mar 22, 2026

The paper addresses a fundamental limitation in 3D generation: image-conditioned models suffer from viewpoint bias and hallucinate unobserved regions, while text-conditioned models lack precise visual fidelity. The authors propose Text–Image Conditioned 3D Generation, a task requiring joint reasoning over visual exemplars and textual descriptions, and introduce TIGON—a minimalist dual-branch baseline that fuses separate image- and text-conditioned DiT backbones via zero-initialized cross-modal bridges and simple prediction averaging. This matters because it offers users more flexible control by combining pixel-aligned appearance cues with high-level semantic guidance.

High-quality 3D assets are essential for VR/AR, industrial design, and entertainment, motivating growing interest in generative models that create 3D content from user prompts. Most existing 3D generators, however, rely on a single conditioning modality: image-conditioned models achieve high visual fidelity by exploiting pixel-aligned cues but suffer from viewpoint bias when the input view is limited or ambiguous, while text-conditioned models provide broad semantic guidance yet lack low-level visual detail. This limits how users can express intent and raises a natural question: can these two modalities be combined for more flexible and faithful 3D generation? Our diagnostic study shows that even simple late fusion of text- and image-conditioned predictions outperforms single-modality models, revealing strong cross-modal complementarity. We therefore formalize Text-Image Conditioned 3D Generation, which requires joint reasoning over a visual exemplar and a textual specification. To address this task, we introduce TIGON, a minimalist dual-branch baseline with separate image- and text-conditioned backbones and lightweight cross-modal fusion. Extensive experiments show that text-image conditioning consistently improves over single-modality methods, highlighting complementary vision-language guidance as a promising direction for future 3D generation research. Project page: https://jumpat.github.io/tigon-page
0
cs.CV Jingnan Luo, Mingqi Gao, Jun Liu et al. · Mar 23, 2026

This paper addresses video reasoning segmentation—segmenting objects in videos based on complex human instructions—by proposing TrajSeg, a unified framework built on Multimodal Large Language Models (MLLMs). The core innovation is bidirectional text-trajectory alignment, where the model learns both text-to-trajectory grounding and trajectory-to-text captioning, alongside a Frame-level Content Integration (FCI) module and a unified mask decoder that eliminates the need for separate key-frame and tracking models. The work matters because it simplifies training pipelines and aims to improve trajectory perception in dynamic video contexts.

The prosperity of Multimodal Large Language Models (MLLMs) has stimulated the demand for video reasoning segmentation, which aims to segment video objects based on human instructions. Previous studies rely on unidirectional and implicit text-trajectory alignment, which struggles with trajectory perception when faced with severe video dynamics. In this work, we propose TrajSeg, a simple and unified framework built upon MLLMs. Concretely, we introduce bidirectional text-trajectory alignment, where MLLMs accept grounding-intended (text-to-trajectory) and captioning-intended (trajectory-to-text) instructions. This way, MLLMs can benefit from enhanced correspondence and better perceive object trajectories in videos. The mask generation from trajectories is achieved via a frame-level content integration (FCI) module and a unified mask decoder. The former adapts the MLLM-parsed trajectory-level token to frame-specific information. The latter unifies segmentation for all frames into a single structure, enabling the proposed framework to be simplified and end-to-end trainable. Extensive experiments on referring and reasoning video segmentation datasets demonstrate the effectiveness of TrajSeg, which outperforms all video reasoning segmentation methods on all metrics. The code will be publicly available at https://github.com/haodi19/TrajSeg.
0
cs.CVcs.AI Haoyu Zhen, Xiaolong Li, Yilin Zhao et al. · Mar 23, 2026

3D-Layout-R1 tackles language-guided 3D spatial editing by training LLMs/VLMs to perform structured reasoning over explicit scene graphs. Instead of free-form chains-of-thought, the model outputs JSON graph edits that iteratively transform object poses and relations, combined with GRPO-based RL using dense 3D IoU and collision-aware rewards. This approach yields measurable gains in layout accuracy while maintaining interpretability across sorting, spatial alignment, and room-editing tasks.

Large Language Models (LLMs) and Vision Language Models (VLMs) have shown impressive reasoning abilities, yet they struggle with spatial understanding and layout consistency when performing fine-grained visual editing. We introduce a Structured Reasoning framework that performs text-conditioned spatial layout editing via scene-graph reasoning. Given an input scene graph and a natural-language instruction, the model reasons over the graph to generate an updated scene graph that satisfies the text condition while maintaining spatial coherence. By explicitly guiding the reasoning process through structured relational representations, our approach improves both interpretability and control over spatial relationships. We evaluate our method on a new text-guided layout editing benchmark encompassing sorting, spatial alignment, and room-editing tasks. Our training paradigm yields an average 15% improvement in IoU and 25% reduction in center-distance error compared to Chain of Thought Fine-tuning (CoT-SFT) and vanilla GRPO baselines. Compared to SOTA zero-shot LLMs, our best models achieve up to 20% higher mIoU, demonstrating markedly improved spatial precision.
0
cs.ROcs.CV Bowen Jing, Ruiyang Hao, Weitao Zhou et al. · Mar 22, 2026

Existing safety-critical scenario generation methods force collisions through brute-force perturbations, destroying trajectory realism. CounterScene reframes this as a counterfactual inference problem within diffusion-based BEV world models: given a safe scene, identify the single agent whose behavioral change would maximally increase collision risk, then minimally intervene on that agent alone via structured diffusion guidance. This targets the realism-adversarial trade-off by allowing danger to emerge through natural interaction propagation rather than global trajectory distortion.

Generating safety-critical driving scenarios requires understanding why dangerous interactions arise, rather than merely forcing collisions. However, existing methods rely on heuristic adversarial agent selection and unstructured perturbations, lacking explicit modeling of interaction dependencies and thus exhibiting a realism--adversarial trade-off. We present CounterScene, a framework that endows closed-loop generative BEV world models with structured counterfactual reasoning for safety-critical scenario generation. Given a safe scene, CounterScene asks: what if the causally critical agent had behaved differently? To answer this, we introduce causal adversarial agent identification to identify the critical agent and classify conflict types, and develop a conflict-aware interactive world model in which a causal interaction graph is used to explicitly model dynamic inter-agent dependencies. Building on this structure, stage-adaptive counterfactual guidance performs minimal interventions on the identified agent, removing its spatial and temporal safety margins while allowing risk to emerge through natural interaction propagation. Extensive experiments on nuScenes demonstrate that CounterScene achieves the strongest adversarial effectiveness while maintaining superior trajectory realism across all horizons, improving long-horizon collision rate from 12.3% to 22.7% over the strongest baseline with better realism (ADE 1.88 vs.2.09). Notably, this advantage further widens over longer rollouts, and CounterScene generalizes zero-shot to nuPlan with state-of-the-art realism.
0
cs.CV Yuntian Bo, Yazhou Zhu, Piotr Koniusz et al. · Mar 22, 2026

Few-shot medical image segmentation (FSMIS) aims to segment anatomical structures with minimal annotations, but Segment Anything Model (SAM) based approaches suffer from over-segmentation due to ambiguous medical boundaries. This paper reformulates SAM-based FSMIS as a background-centric prompt localization task, proposing FoB (Focus on Background) to generate precise background prompts that constrain SAM’s predictions. By modeling contextual dependencies and ring-like structural priors, the method achieves state-of-the-art performance across CT, MRI, and dermatoscopic imaging while maintaining strong cross-domain generalization.

Conventional few-shot medical image segmentation (FSMIS) approaches face performance bottlenecks that hinder broader clinical applicability. Although the Segment Anything Model (SAM) exhibits strong category-agnostic segmentation capabilities, its direct application to medical images often leads to over-segmentation due to ambiguous anatomical boundaries. In this paper, we reformulate SAM-based FSMIS as a prompt localization task and propose FoB (Focus on Background), a background-centric prompt generator that provides accurate background prompts to constrain SAM's over-segmentation. Specifically, FoB bridges the gap between segmentation and prompt localization by category-agnostic generation of support background prompts and localizing them directly in the query image. To address the challenge of prompt localization for novel categories, FoB models rich contextual information to capture foreground-background spatial dependencies. Moreover, inspired by the inherent structural patterns of background prompts in medical images, FoB models this structure as a constraint to progressively refine background prompt predictions. Experiments on three diverse medical image datasets demonstrate that FoB outperforms other baselines by large margins, achieving state-of-the-art performance on FSMIS, and exhibiting strong cross-domain generalization. Our code is available at https://github.com/primebo1/FoB_SAM.
0
cs.CV Zhixiang Lu, Shijie Xu, Kaicheng Yan et al. · Mar 22, 2026

Multimodal skin cancer diagnosis with vision-language models faces a trilemma of computational cost, data scarcity, and black-box opacity. SkinCLIP-VL tackles this via a "frozen perception, adaptive reasoning" architecture that keeps CLIP frozen, adapts a quantized Qwen2.5-VL via LoRA, and introduces the Consistency-aware Focal Alignment (CFA) Loss to jointly handle class imbalance, cross-modal alignment, and calibration. The paper matters because it couples strong empirical performance with a clinician validation study, aiming to bridge the gap between AI accuracy and clinical trust.

The deployment of vision-language models (VLMs) in dermatology is hindered by the trilemma of high computational costs, extreme data scarcity, and the black-box nature of deep learning. To address these challenges, we present SkinCLIP-VL, a resource-efficient framework that adapts foundation models for trustworthy skin cancer diagnosis. Adopting a frozen perception, adaptive reasoning paradigm, we integrate a frozen CLIP encoder with a lightweight, quantized Qwen2.5-VL via low-rank adaptation (LoRA). To strictly align visual regions with clinical semantics under long-tailed distributions, we propose the Consistency-aware Focal Alignment (CFA) Loss. This objective synergizes focal re-weighting, semantic alignment, and calibration. On ISIC and Derm7pt benchmarks, SkinCLIP-VL surpasses 13B-parameter baselines by 4.3-6.2% in accuracy with 43% fewer parameters. Crucially, blinded expert evaluation and out-of-distribution testing confirm that our visually grounded rationales significantly enhance clinical trust compared to traditional saliency maps.
0
cs.CV Zifeng Zhu, Jiaming Han, Jiaxiang Zhao et al. · Mar 22, 2026

GIDE addresses a key challenge in image editing: applying training-free editing techniques to Diffusion Large Language Models (DLLMs). Unlike continuous diffusion models where DDIM inversion is well-established, DLLMs use discrete tokenization that prevents direct application of standard noise inversion. GIDE introduces a three-stage framework (grounding, inversion, refinement) that enables precise localized editing via points, boxes, or text prompts while preserving background content. The significance lies in bridging discrete token spaces with high-fidelity inversion without additional training.

While Diffusion Large Language Models (DLLMs) have demonstrated remarkable capabilities in multi-modal generation, performing precise, training-free image editing remains an open challenge. Unlike continuous diffusion models, the discrete tokenization inherent in DLLMs hinders the application of standard noise inversion techniques, often leading to structural degradation during editing. In this paper, we introduce GIDE (Grounded Inversion for DLLM Image Editing), a unified framework designed to bridge this gap. GIDE incorporates a novel Discrete Noise Inversion mechanism that accurately captures latent noise patterns within the discrete token space, ensuring high-fidelity reconstruction. We then decompose the editing pipeline into grounding, inversion, and refinement stages. This design enables GIDE supporting various editing instructions (text, point and box) and operations while strictly preserving the unedited background. Furthermore, to overcome the limitations of existing single-step evaluation protocols, we introduce GIDE-Bench, a rigorous benchmark comprising 805 compositional editing scenarios guided by diverse multi-modal inputs. Extensive experiments on GIDE-Bench demonstrate that GIDE significantly outperforms prior training-free methods, improving Semantic Correctness by 51.83% and Perceptual Quality by 50.39%. Additional evaluations on ImgEdit-Bench confirm its broad applicability, demonstrating consistent gains over trained baselines and yielding photorealistic consistency on par with leading models.
0
cs.CV Bahram Mohammadi, Yanqiu Wu, Vu Minh Hieu Phan et al. · Mar 22, 2026

DGRNet addresses two critical gaps in brain tumor segmentation: reliable uncertainty quantification and under-utilization of radiology reports. The core idea transforms prediction disagreement among multiple lightweight view-specific adapters into an active signal that guides targeted refinement in ambiguous regions, integrated with clinical text conditioning. This approach achieves state-of-the-art accuracy on the TextBraTS benchmark while providing clinically meaningful uncertainty estimates calibrated to actual errors.

Accurate brain tumor segmentation from MRI scans is critical for diagnosis and treatment planning. Despite the strong performance of recent deep learning approaches, two fundamental limitations remain: (1) the lack of reliable uncertainty quantification in single-model predictions, which is essential for clinical deployment because the level of uncertainty may impact treatment decision-making, and (2) the under-utilization of rich information in radiology reports that can guide segmentation in ambiguous regions. In this paper, we propose the Disagreement-Guided Refinement Network (DGRNet), a novel framework that addresses both limitations through multi-view disagreement-based uncertainty estimation and text-conditioned refinement. DGRNet generates diverse predictions via four lightweight view-specific adapters attached to a shared encoder-decoder, enabling efficient uncertainty quantification within a single forward pass. Afterward, we build disagreement maps to identify regions of high segmentation uncertainty, which are then selectively refined according to clinical reports. Moreover, we introduce a diversity-preserving training strategy that combines pairwise similarity penalties and gradient isolation to prevent view collapse. The experimental results on the TextBraTS dataset show that DGRNet favorably improves state-of-the-art segmentation accuracy by 2.4% and 11% in main metrics Dice and HD95, respectively, while providing meaningful uncertainty estimates.
0
cs.CV Jiatong Xia, Lingqiao Liu · Mar 22, 2026

The paper presents a training-free pipeline for reconstructing instance-aware 3D scenes from 10-20 unposed RGB images and rendering novel views using diffusion. It combines MV-DUSt3R for geometry, SAM for 2D segmentation with warping-based cross-view unification, and the See3D diffusion model for inpainting holes in point-cloud projections. The system enables object-level editing by manipulating the point cloud directly, avoiding per-scene optimization.

We introduce a novel, training-free system for reconstructing, understanding, and rendering 3D indoor scenes from a sparse set of unposed RGB images. Unlike traditional radiance field approaches that require dense views and per-scene optimization, our pipeline achieves high-fidelity results without any training or pose preprocessing. The system integrates three key innovations: (1) A robust point cloud reconstruction module that filters unreliable geometry using a warping-based anomaly removal strategy; (2) A warping-guided 2D-to-3D instance lifting mechanism that propagates 2D segmentation masks into a consistent, instance-aware 3D representation; and (3) A novel rendering approach that projects the point cloud into new views and refines the renderings with a 3D-aware diffusion model. Our method leverages the generative power of diffusion to compensate for missing geometry and enhances realism, especially under sparse input conditions. We further demonstrate that object-level scene editing such as instance removal can be naturally supported in our pipeline by modifying only the point cloud, enabling the synthesis of consistent, edited views without retraining. Our results establish a new direction for efficient, editable 3D content generation without relying on scene-specific optimization. Project page: https://jiatongxia.github.io/TID3R/
0
cs.NIcs.CVcs.MM Aizierjiang Aiersilan, Zhangfei Yang · Mar 22, 2026

OrbitStream addresses adaptive 360° video streaming for teleoperation by proposing a training-free framework that combines semantic scene understanding with robust control theory. It formulates viewport prediction as a Gravitational Viewport Prediction (GVP) problem where semantic objects (pedestrians, vehicles) generate potential fields that "attract" user gaze with task-relevant mass, while a Saturation-Based Proportional-Derivative (PD) Controller handles bitrate adaptation. This offers an interpretable, zero-shot alternative to black-box Deep Reinforcement Learning methods for safety-critical systems where deployment constraints prohibit lengthy training.

Adaptive 360{\deg} video streaming for teleoperation faces dual challenges: viewport prediction under uncertain gaze patterns and bitrate adaptation over volatile wireless channels. While data-driven and Deep Reinforcement Learning (DRL) methods achieve high Quality of Experience (QoE), their "black-box" nature and reliance on training data can limit deployment in safety-critical systems. To address this, we propose OrbitStream, a training-free framework that combines semantic scene understanding with robust control theory. We formulate viewport prediction as a Gravitational Viewport Prediction (GVP) problem, where semantic objects generate potential fields that attract user gaze. Furthermore, we employ a Saturation-Based Proportional-Derivative (PD) Controller for buffer regulation. On object-rich teleoperation traces, OrbitStream achieves a 94.7\% zero-shot viewport prediction accuracy without user-specific profiling, approaching trajectory-extrapolation baselines ($\sim$98.5\%). Across 3,600 Monte Carlo simulations on diverse network traces, OrbitStream yields a mean QoE of 2.71. It ranks second among 12 evaluated algorithms, close to the top-performing BOLA-E (2.80) while outperforming FastMPC (1.84). The system exhibits an average decision latency of 1.01 ms with minimal rebuffering events. By providing competitive QoE with interpretability and zero training overhead, OrbitStream demonstrates that physics-based control, combined with semantic modeling, offers a practical solution for 360{\deg} streaming in teleoperation.
0
cs.CV Binesh Sadanandan, Vahid Behzadan · Mar 22, 2026

Medical vision-language models (VLMs) are increasingly evaluated for consistency—the invariance of predictions under paraphrased prompts—as a proxy for clinical reliability. This paper demonstrates that consistency alone is a fundamentally flawed safety metric because models can achieve perfect consistency by learning text shortcuts while completely ignoring the input image. The authors introduce a four-quadrant per-sample taxonomy that jointly evaluates consistency and image reliance, revealing that models optimized for low flip rates often shift samples into a 'Dangerous' quadrant where predictions are stable, accurate, and confident yet unchanged when the image is removed. Their findings expose a critical deployment trap: standard evaluation pipelines risk preferentially selecting models that appear reliable while being decision-invariant to visual evidence.

Consistency under paraphrase, the property that semantically equivalent prompts yield identical predictions, is increasingly used as a proxy for reliability when deploying medical vision-language models (VLMs). We show this proxy is fundamentally flawed: a model can achieve perfect consistency by relying on text patterns rather than the input image. We introduce a four-quadrant per-sample safety taxonomy that jointly evaluates consistency (stable predictions across paraphrased prompts) and image reliance (predictions that change when the image is removed). Samples are classified as Ideal (consistent and image-reliant), Fragile (inconsistent but image-reliant), Dangerous (consistent but not image-reliant), or Worst (inconsistent and not image-reliant). Evaluating five medical VLM configurations across two chest X-ray datasets (MIMIC-CXR, PadChest), we find that LoRA fine-tuning dramatically reduces flip rates but shifts a majority of samples into the Dangerous quadrant: LLaVA-Rad Base achieves a 1.5% flip rate on PadChest while 98.5% of its samples are Dangerous. Critically, Dangerous samples exhibit high accuracy (up to 99.6%) and low entropy, making them invisible to standard confidence-based screening. We observe a negative correlation between flip rate and Dangerous fraction (r = -0.89, n=10) and recommend that deployment evaluations always pair consistency checks with a text-only baseline: a single additional forward pass that exposes the false reliability trap.
0
cs.CVcs.LG Kelly Cui, Nikhil Prakash, Ayush Raina et al. · Mar 23, 2026

This paper investigates how vision-language models (VLMs) perform spatial reasoning—the binding of objects to spatial relations. It reveals that VLMs rely on two concurrent mechanisms: a dominant one where the vision encoder encodes object layout globally across visual tokens (extending into background regions), and a secondary one where the language model backbone forms ordering representations over object tokens. The finding that enhancing these vision-derived spatial representations improves performance without fine-tuning challenges the prevailing focus on LM backbones and highlights the critical role of vision encoders in multimodal reasoning.

Many multimodal tasks, such as image captioning and visual question answering, require vision-language models (VLMs) to associate objects with their properties and spatial relations. Yet it remains unclear where and how such associations are computed within VLMs. In this work, we show that VLMs rely on two concurrent mechanisms to represent such associations. In the language model backbone, intermediate layers represent content-independent spatial relations on top of visual tokens corresponding to objects. However, this mechanism plays only a secondary role in shaping model predictions. Instead, the dominant source of spatial information originates in the vision encoder, whose representations encode the layout of objects and are directly exploited by the language model backbone. Notably, this spatial signal is distributed globally across visual tokens, extending beyond object regions into surrounding background areas. We show that enhancing these vision-derived spatial representations globally across all image tokens improves spatial reasoning performance on naturalistic images. Together, our results clarify how spatial association is computed within VLMs and highlight the central role of vision encoders in enabling spatial reasoning.
0
cs.CVcs.CL Swan Htet Aung, Hein Htet, Htoo Say Wah Khaing et al. · Mar 23, 2026

This paper introduces BHDD, the first public benchmark dataset for handwritten Burmese digits. Myanmar script's distinctive circular letterforms—originally developed for writing on palm leaves—create recognition challenges distinct from Latin digits, with pairs like 0 and 1 differing only by whether a circle is closed. The authors release 87,561 verified images (28×28 grayscale, MNIST-compatible format) from over 150 contributors, with writer-independent train/test splits and baseline models reaching up to 99.83% accuracy.

We introduce the Burmese Handwritten Digit Dataset (BHDD), a collection of 87,561 grayscale images of handwritten Burmese digits in ten classes. Each image is 28x28 pixels, following the MNIST format. The training set has 60,000 samples split evenly across classes; the test set has 27,561 samples with class frequencies as they arose during collection. Over 150 people of different ages and backgrounds contributed samples. We analyze the dataset's class distribution, pixel statistics, and morphological variation, and identify digit pairs that are easily confused due to the round shapes of the Myanmar script. Simple baselines (an MLP, a two-layer CNN, and an improved CNN with batch normalization and augmentation) reach 99.40%, 99.75%, and 99.83% test accuracy respectively. BHDD is available under CC BY-SA 4.0 at https://github.com/baseresearch/BHDD
0
cs.CVcs.LG Mohamed A Mabrok · Mar 22, 2026

HamVision proposes using damped harmonic oscillator dynamics as a structured inductive bias for medical image analysis. The core idea is that phase-space decomposition yields three representations—position $q$ (features), momentum $p$ (gradients), and energy $H = rac{1}{2}|z|^2$ (saliency)—that serve both segmentation and classification tasks without modifying the shared bottleneck. This physics-constrained approach aims to replace generic learned transformations with interpretable, dynamics-based feature extraction across diverse medical imaging modalities.

We present HamVision, a framework for medical image analysis that uses the damped harmonic oscillator, a fundamental building block of signal processing, as a structured inductive bias for both segmentation and classification tasks. The oscillator's phase-space decomposition yields three functionally distinct representations: position~$q$ (feature content), momentum~$p$ (spatial gradients that encode boundary and texture information), and energy $H = \tfrac{1}{2}|z|^2$ (a parameter-free saliency map). These representations emerge from the dynamics, not from supervision, and can be exploited by different task-specific heads without any modification to the oscillator itself. For segmentation, energy gates the skip connections while momentum injects boundary information at every decoder level (HamSeg). For classification, the three representations are globally pooled and concatenated into a phase-space feature vector (HamCls). We evaluate HamVision across ten medical imaging benchmarks spanning five imaging modalities. On segmentation, HamSeg achieves state-of-the-art Dice scores on ISIC\,2018 (89.38\%), ISIC\,2017 (88.40\%), TN3K (87.05\%), and ACDC (92.40\%), outperforming most baselines with only 8.57M parameters. On classification, HamCls achieves state-of-the-art accuracy on BloodMNIST (98.85\%) and PathMNIST (96.65\%), and competitive results on the remaining MedMNIST datasets against MedMamba and MedViT. Diagnostic analysis confirms that the oscillator's momentum consistently encodes an interior$\,{>}\,$boundary$\,{>}\,$exterior gradient for segmentation and that the energy map correlates with discriminative regions for classification, properties that emerge entirely from the Hamiltonian dynamics. Code is available at https://github.com/Minds-R-Lab/hamvision.
0
cs.CV Qifan Li, Xingyu Zhou, Jinhua Zhang et al. · Mar 22, 2026

This paper addresses a subtle but critical issue in latent diffusion models (LDMs): VAE tokenizers tend to collapse latent variance toward zero to minimize reconstruction error, creating overly compact manifolds that are brittle against sampling perturbations. The authors propose a Variance Expansion (VE) loss that adaptively counteracts this collapse via an inverse-variance term $\mathcal{L}_{\text{var}} = 1/(\sigma^2 + \delta)$, allowing the latent space to absorb stochastic diffusion noise while maintaining reconstruction fidelity. The work achieves state-of-the-art FID 1.18 on ImageNet 256$\times$256 and provides both theoretical grounding and empirical validation across multiple architectures.

Latent diffusion models have emerged as the dominant framework for high-fidelity and efficient image generation, owing to their ability to learn diffusion processes in compact latent spaces. However, while previous research has focused primarily on reconstruction accuracy and semantic alignment of the latent space, we observe that another critical factor, robustness to sampling perturbations, also plays a crucial role in determining generation quality. Through empirical and theoretical analyses, we show that the commonly used $\beta$-VAE-based tokenizers in latent diffusion models, tend to produce overly compact latent manifolds that are highly sensitive to stochastic perturbations during diffusion sampling, leading to visual degradation. To address this issue, we propose a simple yet effective solution that constructs a latent space robust to sampling perturbations while maintaining strong reconstruction fidelity. This is achieved by introducing a Variance Expansion loss that counteracts variance collapse and leverages the adversarial interplay between reconstruction and variance expansion to achieve an adaptive balance that preserves reconstruction accuracy while improving robustness to stochastic sampling. Extensive experiments demonstrate that our approach consistently enhances generation quality across different latent diffusion architectures, confirming that robustness in latent space is a key missing ingredient for stable and faithful diffusion sampling.
0
cs.CVcs.AI Junrong Guo, Shancheng Fang, Yadong Qu et al. · Mar 23, 2026

This paper tackles the visual perception gap in automated text layout generation. While existing Multimodal Large Language Models (MLLMs) generate layout code (SVG/JSON) to render text on images, they operate blind to the actual rendered output, producing layouts with overlapping text, poor contrast, or misalignment. The authors propose Visual Feedback Layout Model (VFLM), which closes the loop by rendering generated SVGs and feeding the visual results back to the model for iterative reflection and refinement. The framework uses a two-stage pipeline—cold-start supervised fine-tuning followed by reinforcement learning with GRPO—and introduces a specialized layout reward model trained on fine-grained quality hierarchies. A surprising finding is that simple outcome-based rewards outperform complex process-oriented rewards that explicitly encode step-wise incentives.

Recent advances in Multimodal Large Language Models (MLLMs) have enabled automated generation of structured layouts from natural language descriptions. Existing methods typically follow a code-only paradigm that generates code to represent layouts, which are then rendered by graphic engines to produce final images. However, they are blind to the rendered visual outcome, making it difficult to guarantee readability and aesthetics. In this paper, we identify visual feedback as a critical factor in layout generation and propose Visual Feedback Layout Model (VFLM), a self-improving framework that leverages visual feedback iterative refinement. VFLM is capable of performing adaptive reflective generation, which leverages visual information to reflect on previous issues and iteratively generates outputs until satisfactory quality is achieved. It is achieved through reinforcement learning with a visually grounded reward model that incorporates OCR accuracy. By rewarding only the final generated outcome, we can effectively stimulate the model's iterative and reflective generative capabilities. Experiments across multiple benchmarks show that VFLM consistently outperforms advanced MLLMs, existing layout models, and code-only baselines, establishing visual feedback as critical for design-oriented MLLMs. Our code and data are available at https://github.com/FolSpark/VFLM.
0
cs.CV Bahram Mohammadi, Ta Duc Huy, Afrouz Sheikholeslami et al. · Mar 22, 2026

Brain tumor segmentation from MRI scans faces challenges because the three target sub-regions—Whole Tumor (WT), Tumor Core (TC), and Enhancing Tumor (ET)—have ambiguous visual boundaries. This paper proposes TextCSP, a hierarchical framework that integrates radiological reports by replacing the standard single global text embedding with sub-region-aware prompts and a soft cascade decoder that enforces the anatomical hierarchy $ET \subset TC \subset WT$. The method builds on the TextBraTS baseline and achieves modest gains on its paired MRI-text dataset.

Brain tumor segmentation remains challenging because the three standard sub-regions, i.e., whole tumor (WT), tumor core (TC), and enhancing tumor (ET), often exhibit ambiguous visual boundaries. Integrating radiological description texts with imaging has shown promise. However, most multimodal approaches typically compress a report into a single global text embedding shared across all sub-regions, overlooking their distinct clinical characteristics. We propose TextCSP (text-modulated soft cascade architecture), a hierarchical text-guided framework that builds on the TextBraTS baseline with three novel components: (1) a text-modulated soft cascade decoder that predicts WT->TC->ET in a coarse-to-fine manner consistent with their anatomical containment hierarchy. (2) sub-region-aware prompt tuning, which uses learnable soft prompts with a LoRA-adapted BioBERT encoder to generate specialized text representations tailored for each sub-region; (3) text-semantic channel modulators that convert the aforementioned representations into channel-wise refinement signals, enabling the decoder to emphasize features aligned with clinically described patterns. Experiments on the TextBraTS dataset demonstrate consistent improvements across all sub-regions against state-of-the-art methods by 1.7% and 6% on the main metrics Dice and HD95.
0
cs.LGcs.CV Alois Bachmann · Mar 23, 2026

dynActivation addresses the rigidity of fixed activation functions by introducing per-layer trainable scalars that interpolate between a base nonlinearity and a linear path. The method adds only two parameters per layer ($\alpha_i$ and $\beta_i$) via $f_i(x) = \text{BaseAct}(x)(\alpha_i - \beta_i) + \beta_i x$, allowing adaptive nonlinearity allocation across depth. Results show strong vision benchmarks (+14% on CIFAR-10), robustness to extreme depth scaling (95%+ accuracy on 75-layer MNIST), and faster convergence (24% AUC reduction), though LLM perplexity gains vanish in long-run training.

This paper proposes $\mathrm{dynActivation}$, a per-layer trainable activation defined as $f_i(x) = \mathrm{BaseAct}(x)(\alpha_i - \beta_i) + \beta_i x$, where $\alpha_i$ and $\beta_i$ are lightweight learned scalars that interpolate between the base nonlinearity and a linear path and $\mathrm{BaseAct}(x)$ resembles any ReLU-like function. The static and dynamic ReLU-like variants are then compared across multiple vision tasks, language modeling tasks, and ablation studies. The results suggest that dynActivation variants tend to linearize deep layers while maintaining high performance, which can improve training efficiency by up to $+54\%$ over ReLU. On CIFAR-10, dynActivation(Mish) improves over static Mish by up to $+14.02\%$ on AttentionCNN with an average improvment by $+6.00\%$, with a $24\%$ convergence-AUC reduction relative to Mish (2120 vs. 2785). In a 1-to-75-layer MNIST depth-scaling study, dynActivation never drops below $95\%$ test accuracy ($95.3$--$99.3\%$), while ReLU collapses below $80\%$ at 25 layers. Under FGSM at $\varepsilon{=}0.08$, dynActivation(Mish) incurs a $55.39\%$ accuracy drop versus $62.79\%$ for ReLU ($7.40\%$ advantage). Transferred to language modeling, a new proposed dynActGLU-variant achieves a $10.3\%$ relative perplexity reduction over SwiGLU at 5620 steps (4.047 vs. 4.514), though the gap vanishes at 34300 steps.
0
cs.CVcs.AI Nour Alhuda Albashir, Lars Pernickel, Danial Hamoud et al. · Mar 23, 2026

Autonomous vehicles struggle with adverse weather perception. This paper proposes LRC-WeatherNet, a lightweight fusion network combining LiDAR, RADAR, and camera via early BEV fusion and mid-level gating to classify weather conditions in real-time. The approach achieves $86.66\%$ accuracy on the MSU-4S dataset with $7.13\,\mathrm{ms}$ inference, demonstrating that adaptive multi-modal fusion outperforms unimodal baselines, though dataset limitations restrict generalization to rare weather events.

Autonomous vehicles face major perception and navigation challenges in adverse weather such as rain, fog, and snow, which degrade the performance of LiDAR, RADAR, and RGB camera sensors. While each sensor type offers unique strengths, such as RADAR robustness in poor visibility and LiDAR precision in clear conditions, they also suffer distinct limitations when exposed to environmental obstructions. This study proposes LRC-WeatherNet, a novel multi-sensor fusion framework that integrates LiDAR, RADAR, and camera data for real-time classification of weather conditions. By employing both early fusion using a unified Bird's Eye View representation and mid-level gated fusion of modality-specific feature maps, our approach adapts to the varying reliability of each sensor under changing weather. Evaluated on the extensive MSU-4S dataset covering nine weather types, LRC-WeatherNet achieves superior classification performance and computational efficiency, significantly outperforming unimodal baselines in adverse conditions. This work is the first to combine all three modalities for robust, real-time weather classification in autonomous driving. We release our trained models and source code in https://github.com/nouralhudaalbashir/LRC-WeatherNet.