Your paper timeline
Scroll AI takes the way you would scroll a great paper aggregator: quick signal first, deeper critique when something earns your attention, and challenges when a claim feels off.
11 papers in cs.IR
Trending mixes fresh papers with community signal.
0
cs.IRcs.CV Zhuocheng Zhang, Xingwu Zhang, Kangheng Liang et al. · Mar 23, 2026

This paper addresses interactive text-to-image retrieval (I-TIR) where diffusion models generate visual proxies from dialogue, but static additive fusion of text and generated images introduces harmful noise. The core idea is ADaFuSE, a lightweight plug-in module combining adaptive gating (to dynamically weight modalities per instance) with a semantic-aware mixture-of-experts branch (to capture fine-grained cross-modal cues). The work matters because it challenges the assumption that diffusion-augmented retrieval always benefits from generated images, showing that up to 55.62% of queries suffer degradation under static fusion.

Recent advances in interactive text-to-image retrieval (I-TIR) use diffusion models to bridge the modality gap between the textual information need and the images to be searched, resulting in increased effectiveness. However, existing frameworks fuse multi-modal views of user feedback by simple embedding addition. In this work, we show that this static and undifferentiated fusion indiscriminately incorporates generative noise produced by the diffusion model, leading to performance degradation for up to 55.62% samples. We further propose ADaFuSE (Adaptive Diffusion-Text Fusion with Semantic-aware Experts), a lightweight fusion model designed to align and calibrate multi-modal views for diffusion-augmented I-TIR, which can be plugged into existing frameworks without modifying the backbone encoder. Specifically, we introduce a dual-branch fusion mechanism that employs an adaptive gating branch to dynamically balance modality reliability, alongside a semantic-aware mixture-of-experts branch to capture fine-grained cross-modal nuances. Via thorough evaluation over four standard I-TIR benchmarks, ADaFuSE achieves state-of-the-art performance, surpassing DAR by up to 3.49% in Hits@10 with only a 5.29% parameter increase, while exhibiting stronger robustness to noisy and longer interactive queries. These results show that generative augmentation coupled with principled fusion provides a simple, generalizable alternative to fine-tuning for interactive retrieval.
0
cs.CLcs.IR Hang Gao, Dimitris N. Metaxas · Mar 22, 2026

This paper identifies "semantic shift"—the intrinsic evolution of meaning within a text—as the root cause of embedding pathologies like anisotropy and length-induced collapse. The authors argue that pooling-based aggregation forces "semantic smoothing," where diverse sentences compromise into a diluted representation. They formalize semantic shift as the product of local evolution and global dispersion ($\mathrm{Shift}(k) = \mathrm{Local}(k) \cdot \mathrm{Disp}(k)$), showing through controlled concatenation experiments that it predicts embedding concentration and retrieval degradation better than text length alone. The work reframes geometric pathologies not as inherent model defects but as consequences of content structure interacting with pooling mechanics.

Transformer-based embedding models rely on pooling to map variable-length text into a single vector, enabling efficient similarity search but also inducing well-known geometric pathologies such as anisotropy and length-induced embedding collapse. Existing accounts largely describe \emph{what} these pathologies look like, yet provide limited insight into \emph{when} and \emph{why} they harm downstream retrieval. In this work, we argue that the missing causal factor is \emph{semantic shift}: the intrinsic, structured evolution and dispersion of semantics within a text. We first present a theoretical analysis of \emph{semantic smoothing} in Transformer embeddings: as the semantic diversity among constituent sentences increases, the pooled representation necessarily shifts away from every individual sentence embedding, yielding a smoothed and less discriminative vector. Building on this foundation, we formalize semantic shift as a computable measure integrating local semantic evolution and global semantic dispersion. Through controlled experiments across corpora and multiple embedding models, we show that semantic shift aligns closely with the severity of embedding concentration and predicts retrieval degradation, whereas text length alone does not. Overall, semantic shift offers a unified and actionable lens for understanding embedding collapse and for diagnosing when anisotropy becomes harmful.
0
cs.IRcs.CL Simon Lupart, Maxime Louis, Thibault Formal et al. · Mar 23, 2026

Code retrieval currently relies on dense embeddings, but this paper proposes SPLADE-Code, the first large-scale learned sparse retrieval (LSR) family for code search (600M–8B parameters). The authors address unique challenges including subword fragmentation, semantic gaps between natural language and code, and latency issues from long code documents. Their lightweight single-stage training achieves 75.4 nDCG@10 on MTEB Code under 1B parameters (state-of-the-art for that size) and 79.0 with 8B parameters, while enabling sub-millisecond retrieval via inverted indices.

Retrieval over large codebases is a key component of modern LLM-based software engineering systems. Existing approaches predominantly rely on dense embedding models, while learned sparse retrieval (LSR) remains largely unexplored for code. However, applying sparse retrieval to code is challenging due to subword fragmentation, semantic gaps between natural-language queries and code, diversity of programming languages and sub-tasks, and the length of code documents, which can harm sparsity and latency. We introduce SPLADE-Code, the first large-scale family of learned sparse retrieval models specialized for code retrieval (600M-8B parameters). Despite a lightweight one-stage training pipeline, SPLADE-Code achieves state-of-the-art performance among retrievers under 1B parameters (75.4 on MTEB Code) and competitive results at larger scales (79.0 with 8B). We show that learned expansion tokens are critical to bridge lexical and semantic matching, and provide a latency analysis showing that LSR enables sub-millisecond retrieval on a 1M-passage collection with little effectiveness loss.
0
cs.CLcs.IR Kaung Myat Kyaw, Khush Agarwal, Jonathan Chan · Mar 22, 2026

This paper addresses cross-lingual knowledge graph fusion, where heterogeneous KGs in different languages must be unified without expensive manually-curated seed alignments. The core idea is to use Large Language Models as a universal semantic bridge by linearizing graph triplets into natural language sequences and sequentially agglomerating multiple graphs. This matters because it promises zero-shot alignment capability for low-resource languages where traditional embedding-based methods fail due to lack of training data.

Combining multiple knowledge graphs (KGs) across linguistic boundaries is a persistent challenge due to semantic heterogeneity and the complexity of graph environments. We propose a framework for cross-lingual graph fusion, leveraging the in-context reasoning and multilingual semantic priors of Large Language Models (LLMs). The framework implements structural linearization by mapping triplets directly into natural language sequences (e.g., [head] [relation] [tail]), enabling the LLM to map relations and reconcile entities between an evolving fused graph ($G_{c}^{(t-1)}$) and a new candidate graph ($G_{t}$). Evaluated on the DBP15K dataset, this exploratory study demonstrates that LLMs can serve as a universal semantic bridge to resolve cross-lingual discrepancies. Results show the successful sequential agglomeration of multiple heterogeneous graphs, offering a scalable, modular solution for continuous knowledge synthesis in multi-source, multilingual environments.
0
cs.IRcs.LG Ounnaci Iddir, Ahmed-ouamer Rachid, Tai Dinh · Mar 22, 2026

This paper addresses personalized information retrieval for XML documents by representing users, queries, and documents as weighted concept vectors derived from a domain ontology. The core idea is a hierarchical weighting scheme that favors specific (deeper) ontology concepts combined with a dynamic profile update mechanism that reinforces concepts based on user interactions. The work targets the limitation of traditional keyword-based systems that return identical results regardless of user knowledge or preferences.

This paper addresses the challenge of improving information retrieval from semi-structured eXtensible Markup Language (XML) documents. Traditional information retrieval systems (IRS) often overlook user-specific needs and return identical results for the same query, despite differences in users' knowledge, preferences, and objectives. We integrate external semantic resources, namely a domain ontology and user profiles, into the retrieval process. Documents, queries, and user profiles are represented as vectors of weighted concepts. The ontology applies a concept-weighting mechanism that emphasizes highly specific concepts, as lower-level nodes in the hierarchy provide more precise and targeted information. Relevance is assessed using semantic similarity measures that capture conceptual relationships beyond keyword matching, enabling personalized and fine-grained matching among user profiles, queries, and documents. Experimental results show that combining ontologies with user profiles improves retrieval effectiveness, achieving higher precision and recall than keyword-based approaches. Overall, the proposed framework enhances the relevance and adaptability of XML search results, supporting more user-centered retrieval.
0
cs.IRcs.AIcs.GT Yanchen Jiang, Zhe Feng, Christopher P. Mah et al. · Mar 23, 2026

Generative recommender systems like TIGER excel at semantic retrieval but ignore the economic realities of monetization via sponsored content. This paper proposes GEM-Rec, a unified framework that augments semantic IDs with control tokens (<ORG>, <AD>) to factorize slot allocation from item generation, and introduces Bid-Aware Decoding to inject real-time auction bids into inference. The work bridges the gap between generative recommendation and computational advertising, offering theoretical guarantees like allocative monotonicity while allowing dynamic trade-offs between user relevance and platform revenue.

Generative Recommender Systems using semantic ids, such as TIGER (Rajput et al., 2023), have emerged as a widely adopted competitive paradigm in sequential recommendation. However, existing architectures are designed solely for semantic retrieval and do not address concerns such as monetization via ad revenue and incorporation of bids for commercial retrieval. We propose GEM-Rec, a unified framework that integrates commercial relevance and monetization objectives directly into the generative sequence. We introduce control tokens to decouple the decision of whether to show an ad from which item to show. This allows the model to learn valid placement patterns directly from interaction logs, which inherently reflect past successful ad placements. Complementing this, we devise a Bid-Aware Decoding mechanism that handles real-time pricing, injecting bids directly into the inference process to steer the generation toward high-value items. We prove that this approach guarantees allocation monotonicity, ensuring that higher bids weakly increase an ad's likelihood of being shown without requiring model retraining. Experiments demonstrate that GEM-Rec allows platforms to dynamically optimize for semantic relevance and platform revenue.
0
cs.IRcs.AI Tianyi Li, Zixuan Wang, Guidong Lei et al. · Mar 23, 2026

AgenticRec attacks a key gap in LLM-based recommenders: existing agents rely on frozen reasoning chains and cannot learn from ranking feedback to refine tool use. The paper proposes a two-stage training framework that combines ReAct-style tool invocation with list-wise Group Relative Policy Optimization (GRPO) and Progressive Preference Refinement (PPR) for hard-negative mining. The work matters because it demonstrates that end-to-end reinforcement learning can align multi-step tool use with ranking objectives, moving beyond prompt-engineered agent workflows.

Recommender agents built on Large Language Models offer a promising paradigm for recommendation. However, existing recommender agents typically suffer from a disconnect between intermediate reasoning and final ranking feedback, and are unable to capture fine-grained preferences. To address this, we present AgenticRec, a ranking-oriented agentic recommendation framework that optimizes the entire decision-making trajectory (including intermediate reasoning, tool invocation, and final ranking list generation) under sparse implicit feedback. Our approach makes three key contributions. First, we design a suite of recommendation-specific tools integrated into a ReAct loop to support evidence-grounded reasoning. Second, we propose theoretically unbiased List-Wise Group Relative Policy Optimization (list-wise GRPO) to maximize ranking utility, ensuring accurate credit assignment for complex tool-use trajectories. Third, we introduce Progressive Preference Refinement (PPR) to resolve fine-grained preference ambiguities. By mining hard negatives from ranking violations and applying bidirectional preference alignment, PPR minimizes the convex upper bound of pairwise ranking errors. Experiments on benchmarks confirm that AgenticRec significantly outperforms baselines, validating the necessity of unifying reasoning, tool use, and ranking optimization.
0
cs.IRcs.AIcs.IT Yashar Talebirad, Ali Parsaee, Csongor Y. Szepesvari et al. · Mar 23, 2026

This paper tackles the lack of shared formalism for comparing hierarchical memory systems in language agents. It proposes a unifying theory based on three operators: extraction (α) that maps raw data to atomic units, coarsening (C = (π, ρ)) that partitions and summarizes units, and traversal (τ) that selects content under a token budget. The core insight is the self-sufficiency spectrum of representatives ρ, which constrains viable retrieval strategies—an observation the authors call the coarsening-traversal (C–T) coupling.

Many recent long-context and agentic systems address context-length limitations by adding hierarchical memory: they extract atomic units from raw data, build multi-level representatives by grouping and compression, and traverse this structure to retrieve content under a token budget. Despite recurring implementations, there is no shared formalism for comparing design choices. We propose a unifying theory in terms of three operators. Extraction ($\alpha$) maps raw data to atomic information units; coarsening ($C = (\pi, \rho)$) partitions units and assigns a representative to each group; and traversal ($\tau$) selects which units to include in context given a query and budget. We identify a self-sufficiency spectrum for the representative function $\rho$ and show how it constrains viable retrieval strategies (a coarsening-traversal coupling). Finally, we instantiate the decomposition on eleven existing systems spanning document hierarchies, conversational memory, and agent execution traces, showcasing its generality.
0
cs.IRcs.AI Yubo Li, Ramayya Krishnan, Rema Padman · Mar 23, 2026

This paper introduces a scalable framework to measure institutional variation in solid-organ transplant patient education materials using retrieval-augmented generation (RAG). The authors ground 1,115 patient questions across 102 handbooks from 23 U.S. centers, then classify answer pairs into a five-label taxonomy (Absent, Consistent, Complementary, Divergent, Contradictory). The work exposes critical information gaps: 96.2% of question-handbook pairs miss relevant content, and 20.8% of non-absent pairs show clinically meaningful divergence, with reproductive health nearly absent (95.1%) across all materials.

Patient education materials for solid-organ transplantation vary substantially across U.S. centers, yet no systematic method exists to quantify this heterogeneity at scale. We introduce a framework that grounds the same patient questions in different centers' handbooks using retrieval-augmented language models and compares the resulting answers using a five-label consistency taxonomy. Applied to 102 handbooks from 23 centers and 1,115 benchmark questions, the framework quantifies heterogeneity across four dimensions: question, topic, organ, and center. We find that 20.8% of non-absent pairwise comparisons exhibit clinically meaningful divergence, concentrated in condition monitoring and lifestyle topics. Coverage gaps are even more prominent: 96.2% of question-handbook pairs miss relevant content, with reproductive health at 95.1% absence. Center-level divergence profiles are stable and interpretable, where heterogeneity reflects systematic institutional differences, likely due to patient diversity. These findings expose an information gap in transplant patient education materials, with document-grounded medical question answering highlighting opportunities for content improvement.
0
cs.IRcs.AI Xiaozhe Li, Tianyi Lyu, Siyi Yang et al. · Mar 22, 2026

Understanding collective human intent from noisy, conflicting public discourse represents a frontier AI challenge that extends beyond individual instruction-following. This paper introduces COIN-Bench, a live-updating benchmark comprising 200k+ real consumer discussions across 1,400+ products, which operationalizes an Active Probing Paradigm requiring LLMs to act as meta-analysts and reconstruct chaotic feedback into structured questionnaires. The work matters because it shifts evaluation from transactional action prediction to hierarchical consensus synthesis, testing whether models can resolve contradictions and infer latent trends from swarm-like intelligence.

Understanding human intent is a high-level cognitive challenge for Large Language Models (LLMs), requiring sophisticated reasoning over noisy, conflicting, and non-linear discourse. While LLMs excel at following individual instructions, their ability to distill Collective Intent - the process of extracting consensus, resolving contradictions, and inferring latent trends from multi-source public discussions - remains largely unexplored. To bridge this gap, we introduce COIN-BENCH, a dynamic, real-world, live-updating benchmark specifically designed to evaluate LLMs on collective intent understanding within the consumer domain. Unlike traditional benchmarks that focus on transactional outcomes, COIN-BENCH operationalizes intent as a hierarchical cognitive structure, ranging from explicit scenarios to deep causal reasoning. We implement a robust evaluation pipeline that combines a rule-based method with an LLM-as-the-Judge approach. This framework incorporates COIN-TREE for hierarchical cognitive structuring and retrieval-augmented verification (COIN-RAG) to ensure expert-level precision in analyzing raw, collective human discussions. An extensive evaluation of 20 state-of-the-art LLMs across four dimensions - depth, breadth, informativeness, and correctness - reveals that while current models can handle surface-level aggregation, they still struggle with the analytical depth required for complex intent synthesis. COIN-BENCH establishes a new standard for advancing LLMs from passive instruction followers to expert-level analytical agents capable of deciphering the collective voice of the real world. See our project page on COIN-BENCH.
0
cs.IRcs.AI Aarush Sinha, Rahul Seetharaman, Aman Bansal · Mar 22, 2026

This paper introduces ECI (Effective Contrastive Information), a training-free metric for evaluating hard-negative mining strategies in dense retrieval. The core idea is to leverage the logarithmic InfoNCE bound on mutual information combined with a harmonic mean of signal (hardness) and safety (margin) to predict downstream retrieval quality without expensive fine-tuning. The proposed metric addresses a real pain point in retrieval research: practitioners currently must run end-to-end ablation studies to evaluate negative sampling strategies, which is computationally wasteful.

Hard negatives play a critical role in training and fine-tuning dense retrieval models, as they are semantically similar to positive documents yet non-relevant, and correctly distinguishing them is essential for improving retrieval accuracy. However, identifying effective hard negatives typically requires extensive ablation studies involving repeated fine-tuning with different negative sampling strategies and hyperparameters, resulting in substantial computational cost. In this paper, we introduce ECI: Effective Contrastive Information , a theoretically grounded metric grounded in Information Theory and Information Retrieval principles that enables practitioners to assess the quality of hard negatives prior to model fine-tuning. ECI evaluates negatives by optimizing the trade-off between Information Capacity the logarithmic bound on mutual information determined by set size and Discriminative Efficiency, a harmonic balance of Signal Magnitude (Hardness) and Safety (Max-Margin). Unlike heuristic approaches, ECI strictly penalizes unsafe, false-positive negatives prevalent in generative methods. We evaluate ECI across hard-negative sets mined or generated using BM25, cross-encoders, and large language models. Our results demonstrate that ECI accurately predicts downstream retrieval performance, identifying that hybrid strategies (BM25+Cross-Encoder) offer the optimal balance of volume and reliability, significantly reducing the need for costly end-to-end ablation studies.