Nothing here yet
This paper tackles parameter-efficient multi-task learning (PEFT-MTL), where the challenge is to share parameters across tasks without interference while maintaining the efficiency of methods like LoRA. The core idea is Free Sinewich: it modulates a shared low-rank convolutional adapter (Sine-AWB) using task-specific sinusoidal frequencies generated by a lightweight Clock Net, achieving task specialization without duplicating parameters. This frequency-switching mechanism is inspired by biological oscillatory multiplexing and aims to decorrelate task weights while boosting effective rank.
SSAM tackles the problem of merging independently trained multimodal large language models (e.g., vision-language and audio-language specialists) into a single model capable of processing arbitrary modality combinations without any paired multimodal training data. The core idea is to project language-specific parameter updates (task vectors) onto a shared low-rank subspace identified via SVD, thereby aligning consistent update directions while filtering conflicting ones before merging. This is significant because it offers a training-free alternative to expensive joint multimodal training, achieving state-of-the-art results on four benchmarks.
JANUS addresses jailbreaking of text-to-image models by reframing the discrete prompt search as optimization over a structured distribution. The framework mixes two Gaussian-anchored prompt distributions—one around the target harmful prompt and one around a sanitized 'clean' version—and uses policy gradient on a single scalar mixing parameter $\alpha$ to maximize end-to-end reward. This avoids both proxy-loss optimization and costly LLM-based generators, achieving substantial efficiency gains while exposing weaknesses in current safety pipelines.
This paper addresses vision-only UAV navigation in GNSS-denied environments by moving beyond the standard "matching-to-tile" (M2T) paradigm. Instead of retrieving discrete satellite tiles, the proposed Bearing-UAV method jointly regresses continuous position and heading from four neighboring satellite tiles and a UAV view patch, enabling sub-tile localization accuracy while maintaining a lightweight model. The work also introduces Bearing-UAV-90K, a multi-city dataset with heading annotations designed for unaligned cross-view scenarios.
This paper introduces ChronoCon, a self-supervised method that repurposes Rank-N-Contrast learning to use temporal ordering of longitudinal medical scans instead of expert severity labels. By assuming monotonic progression in irreversible diseases, the method learns progression-aware representations from routinely archived clinical metadata. The core finding is that under few-shot scenarios—using labels from only 5 patients—the model achieves an ICC of 86% for disease severity prediction on rheumatoid arthritis radiographs, potentially reducing reliance on costly expert annotations.
This paper tackles the instability of Group Relative Policy Optimization (GRPO) when applied to video generation. The core problem is that converting deterministic ODE samplers to SDE for exploration injects excess noise in high-noise regimes, causing off-manifold drift that degrades rollout quality and destabilizes reward updates. SAGE-GRPO introduces a precise SDE with logarithmic curvature correction to keep exploration closer to the flow trajectory, plus a Dual Trust Region mechanism combining periodic moving anchors with stepwise KL constraints to prevent long-horizon drift. The method is evaluated on HunyuanVideo1.5 using VideoAlign rewards, showing improvements over DanceGRPO, FlowGRPO, and CPS.
Hyperbolic Vision-Language Models (VLMs) improve hierarchical structure preservation over Euclidean counterparts, yet existing approaches treat all part-whole relationships as equally informative. This paper proposes UNCHA (UNcertainty-guided Compositional Hyperbolic Alignment), which leverages the hyperbolic radius as an uncertainty measure to quantify the varying semantic representativeness of image parts to the whole scene. By incorporating this uncertainty into adaptive temperature scaling for contrastive learning and an entropy-regularized entailment loss, UNCHA achieves state-of-the-art performance on zero-shot classification, retrieval, and fine-grained compositional benchmarks, demonstrating that modeling heterogeneous part-whole strength is critical for complex multi-object understanding.
Adversarial Camouflage proposes a wearable privacy defense against facial recognition by optimizing simple face paint patterns (stripes or chevrons) to adversarially minimize embedding similarities across multiple recognition models. The core idea is to restrict the attack space to low-dimensional, user-reproducible geometric parameters (color, angle, width) that can be painted onto semantically valid facial regions, enabling protesters and privacy-conscious individuals to evade automated surveillance without specialized equipment.
This paper tackles camera-agnostic pruning of 3D Gaussian splats for standardized interchange settings like MPEG I-3DGS, where training images, camera parameters, and gradients are unavailable. The authors propose BetaDescPrune, a one-shot post-training method that computes Hybrid Splat Feature Histogram (HSFH) descriptors to capture local geometric and appearance consistency, then models pruning decisions via Beta-distributed evidence with uncertainty-aware confidence scoring. The core insight is that reliable splat importance can be inferred from intrinsic neighborhood structure alone without rendering supervision.
SteelDefectX introduces a vision-language dataset for steel defect detection that aggregates 7,778 images from four existing sources with novel coarse-to-fine textual annotations—ranging from class-level defect descriptions to sample-level attributes (shape, size, depth, position, contrast) generated via GPT-4o. The paper establishes a four-task benchmark showing that rich textual supervision improves cross-material transfer, though it reveals a tension where fine-grained annotations unexpectedly hurt few-shot performance.
SpatialReward addresses the persistent problem of spatial inconsistencies in text-to-image generation, where models produce globally plausible images with incorrect object positioning and relationships. The paper proposes a three-stage verifiable reward model that decomposes free-form prompts into structured constraints, verifies object attributes via expert detectors, and employs vision-language chain-of-thought reasoning to assess complex spatial layouts. Integrated into Flow-GRPO reinforcement learning for Stable Diffusion and FLUX, the approach significantly improves spatial consistency while maintaining overall image quality.
Ctrl-A addresses automated data augmentation by framing it as a control problem, dynamically adjusting per-operation augmentation strengths via a feedback loop that balances training and validation loss ratios. The method introduces Relative Operation Response (ROR) curves to individually tune transformation distributions without manual initialization or expensive search phases. While it achieves competitive results on CIFAR and SVHN benchmarks with minimal computational overhead (~10% vs. TrivialAugment), the evaluation relies on a modified training setup with extended epochs, raising questions about separability of algorithmic gains from training protocol changes.
Video diffusion models suffer from prohibitive inference costs, but standard image distillation techniques like DMD cause severe oversaturation and temporal collapse when naively extended to video. This work introduces a video-specific distillation framework featuring an adaptive regression loss that dynamically reweights real-data supervision to prevent color artifacts, a temporal variance regularizer to combat static output, and an inference-time frame interpolation module that halves sequence length during high-noise steps to accelerate generation. Applied to Wan2.1, the method enables stable 4-step synthesis with state-of-the-art VBench scores.
SegMaFormer proposes a hybrid encoder for 3D medical image segmentation that places Mamba state-space layers in early high-resolution stages (for linear-complexity sequence mixing) and self-attention only in deeper low-resolution stages (where quadratic cost is manageable). The goal is to reduce the prohibitive compute of full 3D attention while preserving global context. With just 2M parameters and 15 GFLOPs, the authors claim competitive results on BraTS, Synapse, and ACDC benchmarks against models up to 75\times larger.
This paper addresses video moment retrieval (VMR) for complex multi-verb queries by proposing a two-stage framework that generates auxiliary short videos via text-to-video diffusion (CogVideoX) as temporal motion priors, then processes them through a linear-time Mamba network. The approach tackles the limitation of static image augmentations—which miss motion dynamics—while avoiding the quadratic complexity of Transformer-based methods on long untrimmed videos. The framework achieves state-of-the-art results on TVR with particular strength on multi-verb queries, though its effectiveness depends heavily on external video generation quality.
SPECTRE-G2 tackles epistemic uncertainty in safety-critical systems by detecting 'unknown unknowns'—inputs that violate the structural assumptions of the training distribution. Unlike prior work that relies on single signals (confidence, density, or reconstruction error), this paper proposes a multi-expert architecture combining eight complementary signals from a dual-backbone network. The core idea is that diverse structural anomalies require diverse detection mechanisms. The method achieves strong empirical results across synthetic causal, tabular, image, and RL environments, though some baseline implementations appear problematic.
WorldCache addresses the prohibitive latency of Diffusion Transformers (DiTs) for video world models by replacing static feature caching with a content-aware dynamical approximation framework. The method introduces motion-adaptive thresholds, saliency-weighted drift estimation, and optimal feature blending to eliminate ghosting artifacts during fast motion. Achieving 2.3× speedup on Cosmos-Predict2.5 with 99.4% quality retention, it offers a training-free path toward interactive world simulation.
SHAPE addresses unsupervised domain adaptation for medical image segmentation, where models trained on one imaging modality (e.g., MRI) degrade sharply when applied to another (e.g., CT). The core innovation shifts the paradigm from pixel-level correctness to global anatomical plausibility through a DINOv3 foundation model, a Hierarchical Feature Modulation (HFM) module for class-aware alignment, and a Hypergraph Plausibility Estimation (HPE) pipeline that validates pseudo-labels using higher-order anatomical relationships. This matters for deploying robust clinical segmentation models across diverse imaging environments without costly manual re-annotation.
Traditional latent diffusion models require staging—first train a VAE tokenizer, freeze it, then train a diffusion model on top. UNITE proposes a single-stage approach where a shared "Generative Encoder" serves as both tokenizer and denoiser via weight sharing, achieving FID 1.73 on ImageNet 256×256 without adversarial losses or pretrained encoders like DINOv2.
Personalized image generation with diffusion models relies on Low-Rank Adaptation (LoRA) to fine-tune models efficiently, but current practice uses a fixed rank across all layers regardless of subject complexity. This paper proposes LoRA2, which learns adaptive ranks per LoRA component via a variational framework that imposes an importance ordering over rank indices using a discretized exponential distribution. The method achieves better subject fidelity and prompt alignment while using significantly less memory than high-rank baselines, addressing the combinatorial explosion of searching $S K^L$ architectural configurations.