Nothing here yet
This paper reframes climate disinformation detection from classification to retrieval, treating narrative core messages as queries to rank corpus texts without fixed taxonomies. They propose SpecFi, which generates hypothetical documents using community summaries from graph-based detection (NodeRAG) as few-shot examples. The approach achieves MAP 0.505 on CARDS and demonstrates robustness to high narrative variance that cripples standard baselines.
The paper addresses a fundamental limitation in 3D generation: image-conditioned models suffer from viewpoint bias and hallucinate unobserved regions, while text-conditioned models lack precise visual fidelity. The authors propose Text–Image Conditioned 3D Generation, a task requiring joint reasoning over visual exemplars and textual descriptions, and introduce TIGON—a minimalist dual-branch baseline that fuses separate image- and text-conditioned DiT backbones via zero-initialized cross-modal bridges and simple prediction averaging. This matters because it offers users more flexible control by combining pixel-aligned appearance cues with high-level semantic guidance.
This paper investigates how users decode emotions in text-based communication through electronic nonverbal cues (eNVCs)—orthographic signals like elongation, punctuation, and emojis that approximate paralinguistic features. The authors propose a taxonomy grounded in nonverbal communication theory (kinesics and paralinguistics) and test it across three complementary studies: a content analysis developing a regex detection toolkit, a within-subjects experiment manipulating eNVC presence and sarcasm ($n=513$), and focus groups exploring interpretive strategies. The work identifies sarcasm as a critical boundary condition where eNVCs fail to aid interpretation and provides an open-source Python/R package for automated cue detection.
This paper addresses video reasoning segmentation—segmenting objects in videos based on complex human instructions—by proposing TrajSeg, a unified framework built on Multimodal Large Language Models (MLLMs). The core innovation is bidirectional text-trajectory alignment, where the model learns both text-to-trajectory grounding and trajectory-to-text captioning, alongside a Frame-level Content Integration (FCI) module and a unified mask decoder that eliminates the need for separate key-frame and tracking models. The work matters because it simplifies training pipelines and aims to improve trajectory perception in dynamic video contexts.
This paper identifies "semantic shift"—the intrinsic evolution of meaning within a text—as the root cause of embedding pathologies like anisotropy and length-induced collapse. The authors argue that pooling-based aggregation forces "semantic smoothing," where diverse sentences compromise into a diluted representation. They formalize semantic shift as the product of local evolution and global dispersion ($\mathrm{Shift}(k) = \mathrm{Local}(k) \cdot \mathrm{Disp}(k)$), showing through controlled concatenation experiments that it predicts embedding concentration and retrieval degradation better than text length alone. The work reframes geometric pathologies not as inherent model defects but as consequences of content structure interacting with pooling mechanics.
3D-Layout-R1 tackles language-guided 3D spatial editing by training LLMs/VLMs to perform structured reasoning over explicit scene graphs. Instead of free-form chains-of-thought, the model outputs JSON graph edits that iteratively transform object poses and relations, combined with GRPO-based RL using dense 3D IoU and collision-aware rewards. This approach yields measurable gains in layout accuracy while maintaining interpretability across sorting, spatial alignment, and room-editing tasks.
This paper extends In-Context Operator Networks (ICONs)—which learn PDE solution operators via in-context learning without retraining—to higher-order and higher-dimensional PDEs. The authors test on 19 problem types including the heat equation and 3D linear PDEs, finding that while point-wise accuracy degrades for complex OOD problems, the model retains qualitative solution behavior.
This paper studies the coupling between three design axes in audio representation learning: input frontend (raw waveform vs. spectrogram), backbone architecture (Mamba vs. attention), and sequence length. The authors introduce HELIX, a minimal hybrid architecture with five bidirectional Mamba layers and one attention bottleneck at matched 8.3M parameter capacity. The key finding is that these choices are not independent: raw waveforms help with Mamba but not attention, attention hurts on short environmental sounds but becomes critical at 30,000 tokens (5 minutes), where pure attention fails with OOM errors and HELIX closes an 11.5-point gap over pure Mamba on speaker identification.
This paper investigates whether cross-lingual transfer (CLT)—prompting models to translate queries to English, reason in English, then translate answers back—can bridge the performance gap for low-resource languages. The authors benchmark eight LLMs across 2,000 responses in Kazakh and Mongolian, finding that CLT selectively benefits bilingual models (+2.2–4.3pp) but not English-first architectures, while revealing a concerning "fluency illusion" where models appear fluent in LRLs while producing less accurate content.
Quantum machine learning model selection currently lacks principled guidelines, forcing practitioners to train numerous expensive configurations. This paper introduces QBET (Quantum Bias-Expressivity Toolbox), an unsupervised pre-screening framework that evaluates hybrid quantum-classical transformers using LZ-complexity-based Simplicity Bias (AUC) and Expressivity metrics without gradient descent. The core idea is that architectures with higher AUC (stronger bias toward simple Boolean functions) correlate with better downstream task performance, offering a filter to identify promising quantum attention variants before committing to full training on NISQ devices.
Existing safety-critical scenario generation methods force collisions through brute-force perturbations, destroying trajectory realism. CounterScene reframes this as a counterfactual inference problem within diffusion-based BEV world models: given a safe scene, identify the single agent whose behavioral change would maximally increase collision risk, then minimally intervene on that agent alone via structured diffusion guidance. This targets the realism-adversarial trade-off by allowing danger to emerge through natural interaction propagation rather than global trajectory distortion.
LLM annotations encode some human perspectives better than others, especially in subjective tasks where demographic background shapes judgments. This paper introduces Perspective-Driven Inference (PDI), a statistical framework that treats the distribution of group-specific annotations as a vector estimand $\theta^* = (\theta^*_{g_1}, \dots, \theta^*_{g_K})$ and adaptively allocates limited human labels to groups where LLM proxies are least reliable. The core contribution is an error-predictor-driven sampling rule that improves estimation accuracy for harder-to-model demographics while maintaining valid frequentist coverage.
AdditiveLLM2 is a domain-adapted multi-modal LLM for additive manufacturing built by fine-tuning Gemma 3 12B on ~50 million tokens from open-access AM journal articles. The work addresses the challenge of specializing general LLMs for technical domains without consuming context window space (as with RAG) or requiring massive datasets. Using domain adaptive pretraining (DAPT) for both text and vision plus visual instruction tuning (VIT), the authors demonstrate that even relatively small curated datasets can yield domain expertise exceeding 90% accuracy on AM knowledge tasks.
While most bias mitigation research targets binary classification, multi-class fairness remains under-explored. This paper proposes Generalised Exponentiated Gradient (GEG), an in-processing method that extends the Exponentiated Gradient framework to multi-class settings and enables simultaneous optimization of multiple fairness constraints via positive-label moment conditions. Evaluated on ten datasets against six baselines, GEG achieves fairness improvements up to 92% with moderate accuracy trade-offs, filling a critical gap in fair machine learning toolboxes.
Few-shot medical image segmentation (FSMIS) aims to segment anatomical structures with minimal annotations, but Segment Anything Model (SAM) based approaches suffer from over-segmentation due to ambiguous medical boundaries. This paper reformulates SAM-based FSMIS as a background-centric prompt localization task, proposing FoB (Focus on Background) to generate precise background prompts that constrain SAM’s predictions. By modeling contextual dependencies and ring-like structural priors, the method achieves state-of-the-art performance across CT, MRI, and dermatoscopic imaging while maintaining strong cross-domain generalization.
Multimodal skin cancer diagnosis with vision-language models faces a trilemma of computational cost, data scarcity, and black-box opacity. SkinCLIP-VL tackles this via a "frozen perception, adaptive reasoning" architecture that keeps CLIP frozen, adapts a quantized Qwen2.5-VL via LoRA, and introduces the Consistency-aware Focal Alignment (CFA) Loss to jointly handle class imbalance, cross-modal alignment, and calibration. The paper matters because it couples strong empirical performance with a clinician validation study, aiming to bridge the gap between AI accuracy and clinical trust.
GIDE addresses a key challenge in image editing: applying training-free editing techniques to Diffusion Large Language Models (DLLMs). Unlike continuous diffusion models where DDIM inversion is well-established, DLLMs use discrete tokenization that prevents direct application of standard noise inversion. GIDE introduces a three-stage framework (grounding, inversion, refinement) that enables precise localized editing via points, boxes, or text prompts while preserving background content. The significance lies in bridging discrete token spaces with high-fidelity inversion without additional training.
DGRNet addresses two critical gaps in brain tumor segmentation: reliable uncertainty quantification and under-utilization of radiology reports. The core idea transforms prediction disagreement among multiple lightweight view-specific adapters into an active signal that guides targeted refinement in ambiguous regions, integrated with clinical text conditioning. This approach achieves state-of-the-art accuracy on the TextBraTS benchmark while providing clinically meaningful uncertainty estimates calibrated to actual errors.
The paper proposes the Conspiracy Frame, a semiotic and frame-semantic representation of conspiratorial narratives with five elements (plan, secret, in-group, out-group, call-to-action), and introduces Con.Fra., a span-annotated Telegram corpus. The core hypothesis is that injecting FrameNet-derived semantic frames into LLM prompts will improve conspiracy detection and explainability. Results show that while frame-guided prompting achieves comparable classification scores to few-shot learning, it does not consistently outperform it, though it reveals interesting abstract semantic patterns.
The paper presents a training-free pipeline for reconstructing instance-aware 3D scenes from 10-20 unposed RGB images and rendering novel views using diffusion. It combines MV-DUSt3R for geometry, SAM for 2D segmentation with warping-based cross-view unification, and the See3D diffusion model for inpainting holes in point-cloud projections. The system enables object-level editing by manipulating the point cloud directly, avoiding per-scene optimization.