Nothing here yet
Narrative similarity is inherently interpretive—different valid readings can yield divergent judgments, challenging benchmarks that encode single ground truths. This paper proposes embracing multiperspectivity by ensembling 31 LLM personas, ranging from literary critics to lay characters, to predict which of two stories is more similar to an anchor. The approach leverages Condorcet Jury Theorem-like dynamics to improve accuracy, achieving 0.705 on SemEval-2026 Task 4 while revealing that diverse practitioner perspectives yield better ensemble gains despite lower individual performance.
daVinci-MagiHuman tackles joint audio-video generation using a refreshingly simple single-stream Transformer that processes text, video, and audio tokens through self-attention only---avoiding the cross-attention and fusion modules common in competing multi-stream architectures. The model achieves strong human-centric generation quality across six languages while delivering impressive inference speed: 2 seconds for a 5-second 256p video on an H100.
Soft robot simulators suffer from a sim-to-real gap that widens when optimizing morphology, because calibration parameters identified on one geometry often fail to transfer to unseen shapes. This paper proposes Residual Acceleration Field Learning (RAFL), which learns local corrective accelerations defined on quadrature elements rather than global nodal forces. By operating on deformation and velocity gradients in material space, the model becomes independent of mesh topology and discretization, enabling zero-shot generalization across geometries.
Beta-KD tackles the problem of balancing data supervision against teacher guidance when distilling multimodal large language models. The authors frame knowledge distillation as Bayesian MAP estimation with teacher-informed Gibbs priors over student activations, deriving a closed-form uncertainty-aware weighting mechanism via Laplace approximation. This eliminates manual tuning of loss weights and achieves consistent improvements across six VQA benchmarks.
This paper addresses hypertension screening from inexpensive retinal fundus images by distilling knowledge from high-fidelity brain MRI—without requiring paired acquisitions from the same patients. The proposed Clinical Graph-Mediated Distillation (CGMD) constructs a clinical similarity graph using shared biomarkers (age, labs, etc.) to bridge disjoint MRI and fundus cohorts, propagates MRI teacher embeddings over the graph to impute patient-specific targets for fundus patients, and trains a fundus student with supervised, prior, and relational distillation losses. The approach aims to capture subtle vascular signals in fundus images by leveraging MRI-derived markers of small-vessel disease.
This paper introduces Cross-Context Verification (CCV), a black-box method for detecting LLM benchmark contamination by solving the same coding problem $N$ times in isolated sessions and measuring solution diversity. The key insight is that memorized solutions are deterministic while genuine reasoning produces natural variation. The paper pairs this with Hierarchical Cross-Context Architecture (HCCA), a multi-agent analysis framework that uses strict information restriction to prevent confirmation bias. As coding benchmarks face credibility crises from solution leakage, this work targets the urgent need to distinguish reasoning from recall in SWE-bench evaluations.
This paper introduces the Distributed Human Data Engine (DHDE), a socio-technical framework tackling 'under-vibrancy'—a condition of low visitor density suppressing economic activity—in declining regions like Fukui, Japan. Contrasting with overtourism literature, it integrates Google Business Profile search intent, Japan Meteorological Agency micro-climate data, edge-AI cameras, and 97,719 survey responses to forecast tourism flows and quantify economic leakage. The work promises algorithmic governance via 'dual-nudge' interventions to redirect visitors and coordinate merchant behavior, backed by claims of $R^2=0.810$ explanatory power.
Physics-informed neural networks typically enforce boundary conditions via penalty terms, leading to approximate satisfaction and training pathologies. This paper proposes a systematic method to enforce Dirichlet, Neumann, and Robin conditions exactly on curved quadrilateral domains using Theory of Functional Connections (TFC) combined with transfinite interpolation. The key innovation is handling compatibility constraints at vertices where mixed boundary conditions meet, particularly when two Neumann/Robin boundaries intersect, by decomposing the problem into a four-step procedure.
HMS-VesselNet addresses the challenge of segmenting thin peripheral retinal vessels in fundus images—a critical task for early diabetic retinopathy detection where standard overlap losses fail due to class imbalance and topological fragmentation. The paper proposes a four-scale hierarchical Attention U-Net architecture with learned fusion weights, combining Dice, binary cross-entropy, and centerline Dice ($\text{clDice}$) losses alongside hard example mining to boost sensitivity on sub-2-pixel vessels. Evaluated on 68 images from DRIVE, STARE, and CHASE_DB1 via 5-fold cross-validation and leave-one-dataset-out protocols, the model achieves $90.78\pm1.42\%$ Sensitivity, demonstrating that explicit topology preservation and targeted hard example oversampling can recover fine vascular structures missed by standard area-based losses.
Generative policies represent actions as multi-step denoising trajectories, rendering standard PPO's single-step action-space ratios mismatched to the policy structure. This paper proposes GSB-PPO, a path-space formulation inspired by Generalized Schrödinger Bridge that lifts proximal updates from terminal actions to full generation paths. The central finding is that a penalty-based objective substantially outperforms the direct clipping extension, establishing trajectory-level regularization as the preferred inductive bias for on-policy generative RL.
Vision-Language Models face escalating safety risks from adversarial jailbreak attacks that bypass alignment via manipulated visual inputs. This paper introduces NullSteer, a training-free defense that applies activation steering constrained to the null space of benign representations—mathematically guaranteeing that safe inputs remain unchanged while harmful activations are redirected toward refusal semantics. The approach aims to solve the over-refusal problem plaguing existing steering methods, offering a principled trade-off between robust safety and preserved utility.
Most visual counting benchmarks focus on rigid objects like crowds and vehicles, leaving fine-grained biological counting understudied. This paper introduces TPC–268, a dataset of 10,000 images spanning 268 countable plant categories across 242 species, annotated with full Linnaean taxonomies and biological organization levels. By framing plant counting as class-agnostic counting with taxonomic constraints, the authors provide a testbed for evaluating hierarchical generalization in vision models.
Training machine learning interatomic potentials (MLIPs) requires costly quantum mechanical calculations to label atomic configurations. This paper proposes using determinantal point processes (DPPs) to select diverse, informative subsets of configurations, mitigating the computational bottleneck while maintaining model accuracy. Experiments on hafnium oxide systems demonstrate that DPP-based subselection achieves competitive or superior performance compared to existing methods like k-means clustering and MaxVol, offering a probabilistic framework that naturally handles variable training set sizes.
This paper addresses mesa-optimization by defining agency as a balance between curiosity (KL divergence) and empowerment (mutual information), proposing an optimization-friendly agency function and an STEC-based metric to detect mesa-optimizers. The work claims that agency functions are convex, smooth, and exhibit logarithmic convergence—suggesting high probability of spontaneous emergence in modern models.
This paper investigates how interrogative stances function as markers of voice and power in French-language digital news. Analyzing over 1.2 million articles from 24 outlets (2023–2024) through a mixed-methods pipeline combining LLM pseudo-labeling and qualitative annotation, the authors operationalize pragmatic concepts like answerhood and dialogicity at scale. The study reveals that questions are sparse but structurally significant, predominantly serving framing functions rather than information-seeking, and centering elite actors over diffuse publics.
This paper tackles Chinese Mandarin visual speech recognition (VSR),where the tonal nature of the language and large vocabulary make lipreading more challenging than for non-tonal languages like English. Existing approaches use cascade architectures with intermediate representations like pinyin to bridge the gap,but this introduces error accumulation and increases inference latency. The core idea is a cascade-free multitask architecture that jointly learns phoneme and viseme representations during training, with on-demand activation during inference for efficiency-accuracy trade-offs. This matters because cascade-free designs could eliminate error propagation while maintaining the benefits of intermediate representations.
Prompt2Box addresses the limitation that vector embeddings of LLM prompts conflate topical similarity with specificity, making it difficult to distinguish whether a model fails at a broad topic or only at its most constrained variants. The core idea is to embed prompts into a box embedding space where the geometric volume encodes specificity—smaller boxes indicate more constraints—and containment represents entailment relations. This geometric re-framing enables more accurate hierarchical clustering and finer-grained weakness analysis across 17 different language models.
Multi-focus image fusion (MFIF) combines source images from different focal planes into a single all-in-focus image. This paper targets a critical flaw in diffusion-based MFIF: defocus blur warps geometric structures, producing artifacts. The authors propose ReDiffuse, which embeds B-Conv (Fourier-series-based rotation-equivariant filters) into a U-Net diffusion backbone. By enforcing that rotations induce predictable feature transformations, the method aims to preserve edge orientation and structural consistency while reducing model size through parameter sharing.
Human annotation for subjective NLP tasks suffers from high inter-annotator disagreement. This paper introduces ReasonAlign, a protocol that exposes annotators to LLM-generated reasoning explanations (but not predicted labels) between two annotation passes. The goal is to test whether reasoning scaffolds improve annotation consistency without the anchoring bias typical of suggestion-based systems.
Whole Slide Images (WSIs) present a unique challenge for computational pathology due to their gigapixel scale and the scarcity of annotated data. This paper addresses few-shot weakly supervised WSI classification (FSWC) by proposing HIPSS, which combines parameter-efficient prompt tuning via Scaling and Shifting Features (SSF) in the text encoder with a hierarchical textual guidance strategy for WSI representation learning. The core innovation replaces expensive cross-attention mechanisms with lightweight linear transformations $y = \gamma \cdot x + \beta$ while avoiding hard instance filtering through soft cosine-similarity-based attention refinement, achieving up to 13.8\% accuracy gains with 18.1\% fewer parameters than state-of-the-art methods.