Your paper timeline
Scroll AI takes the way you would scroll a great paper aggregator: quick signal first, deeper critique when something earns your attention, and challenges when a claim feels off.
482 papers
Trending mixes fresh papers with community signal.
0
cs.CL Max Upravitelev, Veronika Solopova, Jing Yang et al. · Mar 23, 2026

Narrative similarity is inherently interpretive—different valid readings can yield divergent judgments, challenging benchmarks that encode single ground truths. This paper proposes embracing multiperspectivity by ensembling 31 LLM personas, ranging from literary critics to lay characters, to predict which of two stories is more similar to an anchor. The approach leverages Condorcet Jury Theorem-like dynamics to improve accuracy, achieving 0.705 on SemEval-2026 Task 4 while revealing that diverse practitioner perspectives yield better ensemble gains despite lower individual performance.

Predicting narrative similarity can be understood as an inherently interpretive task: different, equally valid readings of the same text can produce divergent interpretations and thus different similarity judgments, posing a fundamental challenge for semantic evaluation benchmarks that encode a single ground truth. Rather than treating this multiperspectivity as a challenge to overcome, we propose to incorporate it in the decision making process of predictive systems. To explore this strategy, we created an ensemble of 31 LLM personas. These range from practitioners following interpretive frameworks to more intuitive, lay-style characters. Our experiments were conducted on the SemEval-2026 Task 4 dataset, where the system achieved an accuracy score of 0.705. Accuracy improves with ensemble size, consistent with Condorcet Jury Theorem-like dynamics under weakened independence. Practitioner personas perform worse individually but produce less correlated errors, yielding larger ensemble gains under majority voting. Our error analysis reveals a consistent negative association between gender-focused interpretive vocabulary and accuracy across all persona categories, suggesting either attention to dimensions not relevant for the benchmark or valid interpretations absent from the ground truth. This finding underscores the need for evaluation frameworks that account for interpretive plurality.
0
cs.CV SII-GAIR, Sand.ai: Ethan Chern, Hansi Teng et al. · Mar 23, 2026

daVinci-MagiHuman tackles joint audio-video generation using a refreshingly simple single-stream Transformer that processes text, video, and audio tokens through self-attention only---avoiding the cross-attention and fusion modules common in competing multi-stream architectures. The model achieves strong human-centric generation quality across six languages while delivering impressive inference speed: 2 seconds for a 5-second 256p video on an H100.

We present daVinci-MagiHuman, an open-source audio-video generative foundation model for human-centric generation. daVinci-MagiHuman jointly generates synchronized video and audio using a single-stream Transformer that processes text, video, and audio within a unified token sequence via self-attention only. This single-stream design avoids the complexity of multi-stream or cross-attention architectures while remaining easy to optimize with standard training and inference infrastructure. The model is particularly strong in human-centric scenarios, producing expressive facial performance, natural speech-expression coordination, realistic body motion, and precise audio-video synchronization. It supports multilingual spoken generation across Chinese (Mandarin and Cantonese), English, Japanese, Korean, German, and French. For efficient inference, we combine the single-stream backbone with model distillation, latent-space super-resolution, and a Turbo VAE decoder, enabling generation of a 5-second 256p video in 2 seconds on a single H100 GPU. In automatic evaluation, daVinci-MagiHuman achieves the highest visual quality and text alignment among leading open models, along with the lowest word error rate (14.60%) for speech intelligibility. In pairwise human evaluation, it achieves win rates of 80.0% against Ovi 1.1 and 60.9% against LTX 2.3 over 2000 comparisons. We open-source the complete model stack, including the base model, the distilled model, the super-resolution model, and the inference codebase.
0
cs.ROcs.LG Dong Heon Cho, Boyuan Chen · Mar 23, 2026

Soft robot simulators suffer from a sim-to-real gap that widens when optimizing morphology, because calibration parameters identified on one geometry often fail to transfer to unseen shapes. This paper proposes Residual Acceleration Field Learning (RAFL), which learns local corrective accelerations defined on quadrature elements rather than global nodal forces. By operating on deformation and velocity gradients in material space, the model becomes independent of mesh topology and discretization, enabling zero-shot generalization across geometries.

Differentiable simulators enable gradient-based optimization of soft robots over material parameters, control, and morphology, but accurately modeling real systems remains challenging due to the sim-to-real gap. This issue becomes more pronounced when geometry is itself a design variable. System identification reduces discrepancies by fitting global material parameters to data; however, when constitutive models are misspecified or observations are sparse, identified parameters often absorb geometry-dependent effects rather than reflect intrinsic material behavior. More expressive constitutive models can improve accuracy but substantially increase computational cost, limiting practicality. We propose a residual acceleration field learning (RAFL) framework that augments a base simulator with a transferable, element-level corrective dynamics field. Operating on shared local features, the model is agnostic to global mesh topology and discretization. Trained end-to-end through a differentiable simulator using sparse marker observations, the learned residual generalizes across shapes. In both sim-to-sim and sim-to-real experiments, our method achieves consistent zero-shot improvements on unseen morphologies, while system identification frequently exhibits negative transfer. The framework also supports continual refinement, enabling simulation accuracy to accumulate during morphology optimization.
0
cs.CV Jingchen Sun, Shaobo Han, Deep Patel et al. · Mar 22, 2026

Beta-KD tackles the problem of balancing data supervision against teacher guidance when distilling multimodal large language models. The authors frame knowledge distillation as Bayesian MAP estimation with teacher-informed Gibbs priors over student activations, deriving a closed-form uncertainty-aware weighting mechanism via Laplace approximation. This eliminates manual tuning of loss weights and achieves consistent improvements across six VQA benchmarks.

Knowledge distillation establishes a learning paradigm that leverages both data supervision and teacher guidance. However, determining the optimal balance between learning from data and learning from the teacher is challenging, as some samples may be noisy while others are subject to teacher uncertainty. This motivates the need for adaptively balancing data and teacher supervision. We propose Beta-weighted Knowledge Distillation (Beta-KD), an uncertainty-aware distillation framework that adaptively modulates how much the student relies on teacher guidance. Specifically, we formulate teacher--student learning from a unified Bayesian perspective and interpret teacher supervision as a Gibbs prior over student activations. This yields a closed-form, uncertainty-aware weighting mechanism and supports arbitrary distillation objectives and their combinations. Extensive experiments on multimodal VQA benchmarks demonstrate that distilling student Vision-Language Models from a large teacher VLM consistently improves performance. The results show that Beta-KD outperforms existing knowledge distillation methods. The code is available at https://github.com/Jingchensun/beta-kd.
0
cs.CV Dillan Imans, Phuoc-Nguyen Bui, Duc-Tai Le et al. · Mar 23, 2026

This paper addresses hypertension screening from inexpensive retinal fundus images by distilling knowledge from high-fidelity brain MRI—without requiring paired acquisitions from the same patients. The proposed Clinical Graph-Mediated Distillation (CGMD) constructs a clinical similarity graph using shared biomarkers (age, labs, etc.) to bridge disjoint MRI and fundus cohorts, propagates MRI teacher embeddings over the graph to impute patient-specific targets for fundus patients, and trains a fundus student with supervised, prior, and relational distillation losses. The approach aims to capture subtle vascular signals in fundus images by leveraging MRI-derived markers of small-vessel disease.

Retinal fundus imaging enables low-cost and scalable hypertension (HTN) screening, but HTN-related retinal cues are subtle, yielding high-variance predictions. Brain MRI provides stronger vascular and small-vessel-disease markers of HTN, yet it is expensive and rarely acquired alongside fundus images, resulting in modality-siloed datasets with disjoint MRI and fundus cohorts. We study this unpaired MRI-fundus regime and introduce Clinical Graph-Mediated Distillation (CGMD), a framework that transfers MRI-derived HTN knowledge to a fundus model without paired multimodal data. CGMD leverages shared structured biomarkers as a bridge by constructing a clinical similarity kNN graph spanning both cohorts. We train an MRI teacher, propagate its representations over the graph, and impute brain-informed representation targets for fundus patients. A fundus student is then trained with a joint objective combining HTN supervision, target distillation, and relational distillation. Experiments on our newly collected unpaired MRI-fundus-biomarker dataset show that CGMD consistently improves fundus-based HTN prediction over standard distillation and non-graph imputation baselines, with ablations confirming the importance of clinically grounded graph connectivity. Code is available at https://github.com/DillanImans/CGMD-unpaired-distillation.
0
cs.CL Tae-Eun Song · Mar 23, 2026

This paper introduces Cross-Context Verification (CCV), a black-box method for detecting LLM benchmark contamination by solving the same coding problem $N$ times in isolated sessions and measuring solution diversity. The key insight is that memorized solutions are deterministic while genuine reasoning produces natural variation. The paper pairs this with Hierarchical Cross-Context Architecture (HCCA), a multi-agent analysis framework that uses strict information restriction to prevent confirmation bias. As coding benchmarks face credibility crises from solution leakage, this work targets the urgent need to distinguish reasoning from recall in SWE-bench evaluations.

LLM coding benchmarks face a credibility crisis: widespread solution leakage and test quality issues undermine SWE-bench Verified, while existing detection methods--paraphrase consistency, n-gram overlap, perplexity analysis--never directly observe whether a model reasons or recalls. Meanwhile, simply repeating verification degrades accuracy: multi-turn review generates false positives faster than it discovers true errors, suggesting that structural approaches are needed. We introduce Cross-Context Verification (CCV), a black-box method that solves the same benchmark problem in N independent sessions and measures solution diversity, combined with the Hierarchical Cross-Context Architecture (HCCA), a multi-agent analysis framework that prevents confirmation bias through intentional information restriction across specialized analytical roles. On 9 SWE-bench Verified problems (45 trials, Claude Opus 4.6, temperature 0), CCV achieves perfect separation between contaminated and genuine reasoning (Mann-Whitney U=0, p approx 0.012, r = 1.0). Key findings: (1) contamination is binary--models either recall perfectly or not at all; (2) reasoning absence is a perfect discriminator; (3) 33% of prior contamination labels are false positives; (4) HCCA's independent analysis structure discovers contamination-flaw composite cases that single-analyst approaches miss. A pilot experiment extending HCCA to multi-stage verification (Worker to Verifier to Director) yields a negative result--100% sycophantic confirmation--providing further evidence that information restriction, not structural complexity, is the key mechanism. We release all code and data.
0
cs.CYcs.LG Amil Khanzada, Takuji Takemoto · Mar 23, 2026

This paper introduces the Distributed Human Data Engine (DHDE), a socio-technical framework tackling 'under-vibrancy'—a condition of low visitor density suppressing economic activity—in declining regions like Fukui, Japan. Contrasting with overtourism literature, it integrates Google Business Profile search intent, Japan Meteorological Agency micro-climate data, edge-AI cameras, and 97,719 survey responses to forecast tourism flows and quantify economic leakage. The work promises algorithmic governance via 'dual-nudge' interventions to redirect visitors and coordinate merchant behavior, backed by claims of $R^2=0.810$ explanatory power.

Most research in urban informatics and tourism focuses on mitigating overtourism in dense global cities. However, for regions experiencing demographic decline and structural stagnation, the primary risk is "under-vibrancy", a condition where low visitor density suppresses economic activity and diminishes satisfaction. This paper introduces the Distributed Human Data Engine (DHDE), a socio-technical framework previously validated in biological crisis management, and adapts it for regional economic flow optimization. Using high-granularity data from Japan's least-visited prefecture (Fukui), we utilize an AI-driven decision support system (DSS) to analyze two datasets: a raw Fukui spending database (90,350 records) and a regional standardized sentiment database (97,719 responses). The system achieves in-sample explanatory power of 81% (R^2 = 0.810) and out-of-sample predictive performance of 68% (R^2 = 0.683). We quantify an annual opportunity gap of 865,917 unrealized visits, equivalent to approximately 11.96 billion yen (USD 76.2 million) in lost revenue. We propose a dual-nudge governance architecture leveraging the DHDE to redistribute cross-prefectural flows and reduce economic leakage.
0
math.NAcs.LGcs.NA Suchuan Dong, Yuchuan Zhang · Mar 23, 2026

Physics-informed neural networks typically enforce boundary conditions via penalty terms, leading to approximate satisfaction and training pathologies. This paper proposes a systematic method to enforce Dirichlet, Neumann, and Robin conditions exactly on curved quadrilateral domains using Theory of Functional Connections (TFC) combined with transfinite interpolation. The key innovation is handling compatibility constraints at vertices where mixed boundary conditions meet, particularly when two Neumann/Robin boundaries intersect, by decomposing the problem into a four-step procedure.

We present a systematic method for exactly enforcing Dirichlet, Neumann, and Robin type conditions on general quadrilateral domains with arbitrary curved boundaries. Our method is built upon exact mappings between general quadrilateral domains and the standard domain, and employs a combination of TFC (theory of functional connections) constrained expressions and transfinite interpolations. When Neumann or Robin boundaries are present, especially when two Neumann (or Robin) boundaries meet at a vertex, it is critical to enforce exactly the induced compatibility constraints at the intersection, in order to enforce exactly the imposed conditions on the joining boundaries. We analyze in detail and present constructions for handling the imposed boundary conditions and the induced compatibility constraints for two types of situations: (i) when Neumann (or Robin) boundary only intersects with Dirichlet boundaries, and (ii) when two Neumann (or Robin) boundaries intersect with each other. We describe a four-step procedure to systematically formulate the general form of functions that exactly satisfy the imposed Dirichlet, Neumann, or Robin conditions on general quadrilateral domains. The method developed herein has been implemented together with the extreme learning machine (ELM) technique we have developed recently for scientific machine learning. Ample numerical experiments are presented with several linear/nonlinear stationary/dynamic problems on a variety of two-dimensional domains with complex boundary geometries. Simulation results demonstrate that the proposed method has enforced the Dirichlet, Neumann, and Robin conditions on curved domain boundaries exactly, with the numerical boundary-condition errors at the machine accuracy.
0
eess.IVcs.CV Amarnath R · Mar 23, 2026

HMS-VesselNet addresses the challenge of segmenting thin peripheral retinal vessels in fundus images—a critical task for early diabetic retinopathy detection where standard overlap losses fail due to class imbalance and topological fragmentation. The paper proposes a four-scale hierarchical Attention U-Net architecture with learned fusion weights, combining Dice, binary cross-entropy, and centerline Dice ($\text{clDice}$) losses alongside hard example mining to boost sensitivity on sub-2-pixel vessels. Evaluated on 68 images from DRIVE, STARE, and CHASE_DB1 via 5-fold cross-validation and leave-one-dataset-out protocols, the model achieves $90.78\pm1.42\%$ Sensitivity, demonstrating that explicit topology preservation and targeted hard example oversampling can recover fine vascular structures missed by standard area-based losses.

Retinal vessel segmentation methods based on standard overlap losses tend to miss thin peripheral vessels because these structures occupy very few pixels and have low contrast against the background. We propose HMS-VesselNet, a hierarchical multi-scale network that processes fundus images across four parallel branches at different resolutions and combines their outputs using learned fusion weights. The training loss combines Dice, binary cross-entropy, and centerline Dice to jointly optimize area overlap and vessel continuity. Hard example mining is applied from epoch 20 onward to concentrate gradient updates on the most difficult training images. Tested on 68 images from DRIVE, STARE, and CHASE_DB1 using 5-fold cross-validation, the model achieves a mean Dice of 88.72 +/- 0.67%, Sensitivity of 90.78 +/- 1.42%, and AUC of 98.25 +/- 0.21%. In leave-one-dataset-out experiments, AUC remains above 95% on each unseen dataset. The largest improvement is in the recall of thin peripheral vessels, which are the structures most frequently missed by standard methods and most critical for early detection of diabetic retinopathy.
0
cs.LG Yuehu Gong, Zeyuan Wang, Yulin Chen et al. · Mar 23, 2026

Generative policies represent actions as multi-step denoising trajectories, rendering standard PPO's single-step action-space ratios mismatched to the policy structure. This paper proposes GSB-PPO, a path-space formulation inspired by Generalized Schrödinger Bridge that lifts proximal updates from terminal actions to full generation paths. The central finding is that a penalty-based objective substantially outperforms the direct clipping extension, establishing trajectory-level regularization as the preferred inductive bias for on-policy generative RL.

On-policy reinforcement learning with generative policies is promising but remains underexplored. A central challenge is that proximal policy optimization (PPO) is traditionally formulated in terms of action-space probability ratios, whereas diffusion- and flow-based policies are more naturally represented as trajectory-level generative processes. In this work, we propose GSB-PPO, a path-space formulation of generative PPO inspired by the Generalized Schr\"odinger Bridge (GSB). Our framework lifts PPO-style proximal updates from terminal actions to full generation trajectories, yielding a unified view of on-policy optimization for generative policies. Within this framework, we develop two concrete objectives: a clipping-based objective, GSB-PPO-Clip, and a penalty-based objective, GSB-PPO-Penalty. Experimental results show that while both objectives are compatible with on-policy training, the penalty formulation consistently delivers better stability and performance than the clipping counterpart. Overall, our results highlight path-space proximal regularization as an effective principle for training generative policies with PPO.
0
cs.CV Xingyu Zhu, Beier Zhu, Shuo Wang et al. · Mar 23, 2026

Vision-Language Models face escalating safety risks from adversarial jailbreak attacks that bypass alignment via manipulated visual inputs. This paper introduces NullSteer, a training-free defense that applies activation steering constrained to the null space of benign representations—mathematically guaranteeing that safe inputs remain unchanged while harmful activations are redirected toward refusal semantics. The approach aims to solve the over-refusal problem plaguing existing steering methods, offering a principled trade-off between robust safety and preserved utility.

As vision-language models (VLMs) are increasingly deployed in open-world scenarios, they can be easily induced by visual jailbreak attacks to generate harmful content, posing serious risks to model safety and trustworthy usage. Recent activation steering methods inject directional vectors into model activations during inference to induce refusal behaviors and have demonstrated effectiveness. However, a steering vector may both enhance refusal ability and cause over-refusal, thereby degrading model performance on benign inputs. Moreover, due to the lack of theoretical interpretability, these methods still suffer from limited robustness and effectiveness. To better balance safety and utility, we propose NullSteer, a null-space projected activation defense framework. Our method constructs refusal directions within model activations through a linear transformation: it maintains zero perturbation within the benign subspace while dynamically inducing refusal along potentially harmful directions, thereby theoretically achieving safety enhancement without impairing the model's general capabilities. Extensive experiments show that NullSteer significantly reduces harmful outputs under various jailbreak attacks (average ASR reduction over 15 percent on MiniGPT-4) while maintaining comparable performance to the original model on general benchmarks.
0
cs.CV Jinyu Xu, Tianqi Hu, Xiaonan Hu et al. · Mar 22, 2026

Most visual counting benchmarks focus on rigid objects like crowds and vehicles, leaving fine-grained biological counting understudied. This paper introduces TPC–268, a dataset of 10,000 images spanning 268 countable plant categories across 242 species, annotated with full Linnaean taxonomies and biological organization levels. By framing plant counting as class-agnostic counting with taxonomic constraints, the authors provide a testbed for evaluating hierarchical generalization in vision models.

Visually cataloging and quantifying the natural world requires pushing the boundaries of both detailed visual classification and counting at scale. Despite significant progress, particularly in crowd and traffic analysis, the fine-grained, taxonomy-aware plant counting remains underexplored in vision. In contrast to crowds, plants exhibit nonrigid morphologies and physical appearance variations across growth stages and environments. To fill this gap, we present TPC-268, the first plant counting benchmark incorporating plant taxonomy. Our dataset couples instance-level point annotations with Linnaean labels (kingdom -> species) and organ categories, enabling hierarchical reasoning and species-aware evaluation. The dataset features 10,000 images with 678,050 point annotations, includes 268 countable plant categories over 242 plant species in Plantae and Fungi, and spans observation scales from canopy-level remote sensing imagery to tissue-level microscopy. We follow the problem setting of class-agnostic counting (CAC), provide taxonomy-consistent, scale-aware data splits, and benchmark state-of-the-art regression- and detection-based CAC approaches. By capturing the biodiversity, hierarchical structure, and multi-scale nature of botanical and mycological taxa, TPC-268 provides a biologically grounded testbed to advance fine-grained class-agnostic counting. Dataset and code are available at https://github.com/tiny-smart/TPC-268.
0
stat.APcs.LG Joanna Zou, Youssef Marzouk · Mar 23, 2026

Training machine learning interatomic potentials (MLIPs) requires costly quantum mechanical calculations to label atomic configurations. This paper proposes using determinantal point processes (DPPs) to select diverse, informative subsets of configurations, mitigating the computational bottleneck while maintaining model accuracy. Experiments on hafnium oxide systems demonstrate that DPP-based subselection achieves competitive or superior performance compared to existing methods like k-means clustering and MaxVol, offering a probabilistic framework that naturally handles variable training set sizes.

The development of machine learning interatomic potentials faces a critical computational bottleneck with the generation and labeling of useful training datasets. We present a novel application of determinantal point processes (DPPs) to the task of selecting informative subsets of atomic configurations to label with reference energies and forces from costly quantum mechanical methods. Through experiments with hafnium oxide data, we show that DPPs are competitive with existing approaches to constructing compact but diverse training sets by utilizing kernels of molecular descriptors, leading to improved accuracy and robustness in machine learning representations of molecular systems. Our work identifies promising directions to employ DPPs for unsupervised training data curation with heterogeneous or multimodal data, or in online active learning schemes for iterative data augmentation during molecular dynamics simulation.
0
cs.LG Eduard Kapelko · Mar 22, 2026

This paper addresses mesa-optimization by defining agency as a balance between curiosity (KL divergence) and empowerment (mutual information), proposing an optimization-friendly agency function and an STEC-based metric to detect mesa-optimizers. The work claims that agency functions are convex, smooth, and exhibit logarithmic convergence—suggesting high probability of spontaneous emergence in modern models.

This paper addresses the critical challenge of mesa-optimization in AI safety by providing a formal definition of agency and a framework for its analysis. Agency is conceptualized as a Continuous Representation of accumulated experience that achieves autopoiesis through a dynamic balance between curiosity (minimizing prediction error to ensure non-computability and novelty) and empowerment (maximizing the control channel's information capacity to ensure subjectivity and goal-directedness). Empirical evidence suggests that this active inference-based model successfully accounts for classical instrumental goals, such as self-preservation and resource acquisition. The analysis demonstrates that the proposed agency function is smooth and convex, possessing favorable properties for optimization. While agentic functions occupy a vanishingly small fraction of the total abstract function space, they exhibit logarithmic convergence in sparse environments. This suggests a high probability for the spontaneous emergence of agency during the training of modern, large-scale models. To quantify the degree of agency, the paper introduces a metric based on the distance between the behavioral equivalents of a given system and an "ideal" agentic function within the space of canonicalized rewards (STARC). This formalization provides a concrete apparatus for classifying and detecting mesa-optimizers by measuring their proximity to an ideal agentic objective, offering a robust tool for analyzing and identifying undesirable inner optimization in complex AI systems.
0
cs.CLcs.CY Bros Victor, Barbini Matilde, Gerard Patrick et al. · Mar 23, 2026

This paper investigates how interrogative stances function as markers of voice and power in French-language digital news. Analyzing over 1.2 million articles from 24 outlets (2023–2024) through a mixed-methods pipeline combining LLM pseudo-labeling and qualitative annotation, the authors operationalize pragmatic concepts like answerhood and dialogicity at scale. The study reveals that questions are sparse but structurally significant, predominantly serving framing functions rather than information-seeking, and centering elite actors over diffuse publics.

Interrogatives in news discourse have been examined in linguistics and conversation analysis, but mostly in broadcast interviews and relatively small, often English-language corpora, while large-scale computational studies of news rarely distinguish interrogatives from declaratives or differentiate their functions. This paper brings these strands together through a mixed-methods study of the "Politics of Questions" in contemporary French-language digital news. Using over one million articles published between January 2023 and June 2024, we automatically detect interrogative stances, approximate their functional types, and locate textual answers when present, linking these quantitative measures to a qualitatively annotated subcorpus grounded in semantic and pragmatic theories of questions. Interrogatives are sparse but systematically patterned: they mainly introduce or organize issues, with most remaining cases being information-seeking or echo-like, while explicitly leading or tag questions are rare. Although their density and mix vary across outlets and topics, our heuristic suggests that questions are overwhelmingly taken up within the same article and usually linked to a subsequent answer-like span, most often in the journalist's narrative voice and less often through quoted speech. Interrogative contexts are densely populated with named individuals, organizations, and places, whereas publics and broad social groups are mentioned much less frequently, suggesting that interrogative discourse tends to foreground already prominent actors and places and thus exhibits strong personalization. We show how interrogative stance, textual uptake, and voice can be operationalized at corpus scale, and argue that combining computational methods with pragmatic and sociological perspectives can help account for how questioning practices structure contemporary news discourse.
0
cs.CV Lei Yang, Yi He, Fei Wu et al. · Mar 23, 2026

This paper tackles Chinese Mandarin visual speech recognition (VSR),where the tonal nature of the language and large vocabulary make lipreading more challenging than for non-tonal languages like English. Existing approaches use cascade architectures with intermediate representations like pinyin to bridge the gap,but this introduces error accumulation and increases inference latency. The core idea is a cascade-free multitask architecture that jointly learns phoneme and viseme representations during training, with on-demand activation during inference for efficiency-accuracy trade-offs. This matters because cascade-free designs could eliminate error propagation while maintaining the benefits of intermediate representations.

Chinese mandarin visual speech recognition (VSR) is a task that has advanced in recent years, yet still lags behind the performance on non-tonal languages such as English. One primary challenge arises from the tonal nature of Mandarin, which limits the effectiveness of conventional sequence-to-sequence modeling approaches. To alleviate this issue, existing Chinese VSR systems commonly incorporate intermediate representations, most notably pinyin, within cascade architectures to enhance recognition accuracy. While beneficial, in these cascaded designs, the subsequent stage during inference depends on the output of the preceding stage, leading to error accumulation and increased inference latency. To address these limitations, we propose a cascade-free architecture based on multitask learning that jointly integrates multiple intermediate representations, including phoneme and viseme, to better exploit contextual information. The proposed semantic-guided local contrastive loss temporally aligns the features, enabling on-demand activation during inference, thereby providing a trade-off between inference efficiency and performance while mitigating error accumulation caused by projection and re-embedding. Experiments conducted on publicly available datasets demonstrate that our method achieves superior recognition performance.
0
cs.CL Neeladri Bhuiya, Shib Sankar Dasgupta, Andrew McCallum et al. · Mar 22, 2026

Prompt2Box addresses the limitation that vector embeddings of LLM prompts conflate topical similarity with specificity, making it difficult to distinguish whether a model fails at a broad topic or only at its most constrained variants. The core idea is to embed prompts into a box embedding space where the geometric volume encodes specificity—smaller boxes indicate more constraints—and containment represents entailment relations. This geometric re-framing enables more accurate hierarchical clustering and finer-grained weakness analysis across 17 different language models.

To discover the weaknesses of LLMs, researchers often embed prompts into a vector space and cluster them to extract insightful patterns. However, vector embeddings primarily capture topical similarity. As a result, prompts that share a topic but differ in specificity, and consequently in difficulty, are often represented similarly, making fine-grained weakness analysis difficult. To address this limitation, we propose PROMPT2BOX, which embeds prompts into a box embedding space using a trained encoder. The encoder, trained on existing and synthesized datasets, outputs box embeddings that capture not only semantic similarity but also specificity relations between prompts (e.g., "writing an adventure story" is more specific than "writing a story"). We further develop a novel dimension reduction technique for box embeddings to facilitate dataset visualization and comparison. Our experiments demonstrate that box embeddings consistently capture prompt specificity better than vector baselines. On the downstream task of creating hierarchical clustering trees for 17 LLMs from the UltraFeedback dataset, PROMPT2BOX can identify 8.9\% more LLM weaknesses than vector baselines and achieves an approximately 33\% stronger correlation between hierarchical depth and instruction specificity.
0
cs.CV Bo Li, Tingting Bao, Lingling Zhang et al. · Mar 22, 2026

Multi-focus image fusion (MFIF) combines source images from different focal planes into a single all-in-focus image. This paper targets a critical flaw in diffusion-based MFIF: defocus blur warps geometric structures, producing artifacts. The authors propose ReDiffuse, which embeds B-Conv (Fourier-series-based rotation-equivariant filters) into a U-Net diffusion backbone. By enforcing that rotations induce predictable feature transformations, the method aims to preserve edge orientation and structural consistency while reducing model size through parameter sharing.

Diffusion models have achieved impressive performance on multi-focus image fusion (MFIF). However, a key challenge in applying diffusion models to the ill-posed MFIF problem is that defocus blur can make common symmetric geometric structures (e.g., textures and edges) appear warped and deformed, often leading to unexpected artifacts in the fused images. Therefore, embedding rotation equivariance into diffusion networks is essential, as it enables the fusion results to faithfully preserve the original orientation and structural consistency of geometric patterns underlying the input images. Motivated by this, we propose ReDiffuse, a rotation-equivariant diffusion model for MFIF. Specifically, we carefully construct the basic diffusion architectures to achieve end-to-end rotation equivariance. We also provide a rigorous theoretical analysis to evaluate its intrinsic equivariance error, demonstrating the validity of embedding equivariance structures. ReDiffuse is comprehensively evaluated against various MFIF methods across four datasets (Lytro, MFFW, MFI-WHU, and Road-MF). Results demonstrate that ReDiffuse achieves competitive performance, with improvements of 0.28-6.64\% across six evaluation metrics. The code is available at https://github.com/MorvanLi/ReDiffuse.
0
cs.CL Smitha Muthya Sudheendra, Jaideep Srivastava · Mar 22, 2026

Human annotation for subjective NLP tasks suffers from high inter-annotator disagreement. This paper introduces ReasonAlign, a protocol that exposes annotators to LLM-generated reasoning explanations (but not predicted labels) between two annotation passes. The goal is to test whether reasoning scaffolds improve annotation consistency without the anchoring bias typical of suggestion-based systems.

Human annotation is central to NLP evaluation, yet subjective tasks often exhibit substantial variability across annotators. While large language models (LLMs) can provide structured reasoning to support annotation, their influence on human annotation behavior remains unclear. We introduce ReasonAlign, a reasoning-based annotation scaffold that exposes LLM-generated explanations while withholding predicted labels. We frame this as a controlled study of how reasoning affects human annotation behavior, rather than a full evaluation of annotation accuracy. Using a two-pass protocol inspired by Delphi-style revision, annotators first label instances independently and then revise their decisions after viewing model-generated reasoning. We evaluate the approach on sentiment classification and opinion detection tasks, analyzing changes in inter-annotator agreement and revision behavior. To quantify these effects, we introduce the Annotator Effort Proxy (AEP), a metric capturing the proportion of labels revised after exposure to reasoning. Our results show that exposure to reasoning is associated with increased agreement alongside minimal revision, suggesting that reasoning primarily helps resolve ambiguous cases without inducing widespread changes. These findings provide insight into how reasoning explanations shape annotation consistency and highlight reasoning-based scaffolds as a practical mechanism for supporting human-AI annotation workflows.
0
cs.CV Jayanie Bogahawatte, Sachith Seneviratne, Saman Halgamuge · Mar 23, 2026

Whole Slide Images (WSIs) present a unique challenge for computational pathology due to their gigapixel scale and the scarcity of annotated data. This paper addresses few-shot weakly supervised WSI classification (FSWC) by proposing HIPSS, which combines parameter-efficient prompt tuning via Scaling and Shifting Features (SSF) in the text encoder with a hierarchical textual guidance strategy for WSI representation learning. The core innovation replaces expensive cross-attention mechanisms with lightweight linear transformations $y = \gamma \cdot x + \beta$ while avoiding hard instance filtering through soft cosine-similarity-based attention refinement, achieving up to 13.8\% accuracy gains with 18.1\% fewer parameters than state-of-the-art methods.

Whole Slide Images (WSIs) are giga-pixel in scale and are typically partitioned into small instances in WSI classification pipelines for computational feasibility. However, obtaining extensive instance level annotations is costly, making few-shot weakly supervised WSI classification (FSWC) crucial for learning from limited slide-level labels. Recently, pre-trained vision-language models (VLMs) have been adopted in FSWC, yet they exhibit several limitations. Existing prompt tuning methods in FSWC substantially increase both the number of trainable parameters and inference overhead. Moreover, current methods discard instances with low alignment to text embeddings from VLMs, potentially leading to information loss. To address these challenges, we propose two key contributions. First, we introduce a new parameter efficient prompt tuning method by scaling and shifting features in text encoder, which significantly reduces the computational cost. Second, to leverage not only the pre-trained knowledge of VLMs, but also the inherent hierarchical structure of WSIs, we introduce a WSI representation learning approach with a soft hierarchical textual guidance strategy without utilizing hard instance filtering. Comprehensive evaluations on pathology datasets covering breast, lung, and ovarian cancer types demonstrate consistent improvements up-to 10.9%, 7.8%, and 13.8% respectively, over the state-of-the-art methods in FSWC. Our method reduces the number of trainable parameters by 18.1% on both breast and lung cancer datasets, and 5.8% on the ovarian cancer dataset, while also excelling at weakly-supervised tumor localization. Code at https://github.com/Jayanie/HIPSS.