Nothing here yet
FeatDistill tackles robust detection of AI-generated images under real-world degradations via a multi-expert ensemble of CLIP and SigLIP backbones. The framework combines extensive data expansion with a two-stage training paradigm featuring feature-level self-distillation. It aims to balance strong generalization across unseen generators with practical inference efficiency.
StreamingEval introduces a unified evaluation framework for Video-LLMs under realistic streaming constraints, moving beyond offline benchmarks to assess continuous, real-time video understanding with limited memory. The protocol enforces a fixed-capacity memory bank and jointly measures encoding throughput (MaxFPS), decoding latency (TTFT), memory usage, and task accuracy via a composite StreamingScore. Experiments reveal that current "online" models often fail under strict streaming constraints, while offline models adapted with FIFO memory banks frequently outperform specialized streaming architectures at the cost of higher resource consumption.
This paper tackles the Close Small Object Unmixing (CSOU) problem for infrared imagery, where distant clustered targets appear as overlapping mixed spots due to optical diffraction limits. The authors propose DSCSNet, a deep-unfolded network that unrolls the ADMM algorithm with learnable parameters to recover target count, sub-pixel positions, and radiant intensities from mixed spots. The core idea is to replace the traditional ℓ2-norm smoothness terms with strict ℓ1-norm sparsity constraints and add a dynamic thresholding mechanism for scene-adaptive reconstruction.
OrbitStream addresses adaptive 360° video streaming for teleoperation by proposing a training-free framework that combines semantic scene understanding with robust control theory. It formulates viewport prediction as a Gravitational Viewport Prediction (GVP) problem where semantic objects (pedestrians, vehicles) generate potential fields that "attract" user gaze with task-relevant mass, while a Saturation-Based Proportional-Derivative (PD) Controller handles bitrate adaptation. This offers an interpretable, zero-shot alternative to black-box Deep Reinforcement Learning methods for safety-critical systems where deployment constraints prohibit lengthy training.
This paper exposes a critical vulnerability in Multimodal Large Language Models (MLLMs): safety alignment fails when harmful intent is embedded in structured visual narratives. The authors introduce ComicJailbreak, a benchmark of 1,167 three-panel comics where panels 1–2 establish narrative context and panel 3 contains a blank speech bubble filled with a paraphrased harmful goal. The model is prompted to "complete the comic" by generating the fourth panel. Across 15 state-of-the-art MLLMs, comic-based attacks achieve ensemble success rates exceeding 90% on Gemini-family models and 85%+ on most open-source models—substantially outperforming plain-text and random-image baselines. The work also reveals that existing defenses (AdaShield, Attack as Defense) trigger severe over-refusal on benign prompts, and that automated safety judges are unreliable on sensitive-but-benign content.
This paper tackles the domain generalization problem in image deraining, where models trained on synthetic data fail catastrophically on out-of-distribution (OOD) real-world scenarios. The authors propose a three-stage pipeline—Superpixel Generation, Resolution-adaptive Fusion, and Pseudo-label Re-Synthesis—that adapts source-domain models to target domains using only unpaired rain-free images, eliminating the need for costly paired rainy data collection.
This paper tackles multimodal misinformation detection by distinguishing between harmful and harmless visual content manipulation—a nuance often overlooked by existing methods. The authors propose Havc-m4d, a framework that extracts manipulation and intention features using weakly-supervised positive-unlabeled (PU) learning to overcome the lack of ground-truth manipulation labels. By treating real articles with manipulated visuals as likely harmless and fake articles as potentially harmful, the method introduces intention-aware cues that consistently improve detection across four benchmark datasets.