COINBench: Moving Beyond Individual Perspectives to Collective Intent Understanding
Understanding collective human intent from noisy, conflicting public discourse represents a frontier AI challenge that extends beyond individual instruction-following. This paper introduces COIN-Bench, a live-updating benchmark comprising 200k+ real consumer discussions across 1,400+ products, which operationalizes an Active Probing Paradigm requiring LLMs to act as meta-analysts and reconstruct chaotic feedback into structured questionnaires. The work matters because it shifts evaluation from transactional action prediction to hierarchical consensus synthesis, testing whether models can resolve contradictions and infer latent trends from swarm-like intelligence.
The paper makes a substantive contribution by formalizing collective intent understanding as hierarchical cognitive stratification rather than surface-level aggregation. The COIN-Tree taxonomy and active probing framework offer genuine methodological innovation for evaluating depth of understanding. However, the evaluation pipeline exhibits circularity: GPT-4o is used to construct the benchmark (branch extraction, node aggregation in Appendix A.4), verify correctness (COIN-RAG), and serve as the judge for expert comparisons, potentially biasing evaluation toward GPT-like architectures and undermining claims of objective "expert-level precision."
The five-level COIN-Tree taxonomy (L1 Usage Scenarios to L5 Predictive Tendency) provides a rigorous framework for distinguishing surface observation from deep causal reasoning. The empirical finding of a "depth cliff"—where all models including GPT-o3 collapse to near-zero scores at L5 (0.07) while maintaining reasonable L1 performance (16.17)—is robustly supported by Table 2 and validates the hierarchical design. The ablation study (Table 3) revealing that structured COIN-Tree input improves small models (Qwen2.5-7B +7.09 Correctness) but degrades SOTA models (GPT-5.2 -29.31 Correctness) offers a nuanced insight that expert-level understanding requires navigating raw noise rather than relying on pre-digested structure.
The benchmark construction pipeline relies entirely on GPT-4o for ground-truth generation (extracting semantic branches, merging nodes, weighting by engagement), creating a confound when evaluating competing architectures against GPT-4o-derived "truth." The claim of evaluating "expert-level" synthesis is undermined by minimal expert validation—only 30 expert questionnaires are analyzed (Table 4), showing GPT-5.2 trails human experts by substantial margins across all depth levels. The "live-update" mechanism and contamination prevention strategies are asserted but technically underspecified; the paper states it "ensures real-time updates" without detailing the refresh rate, versioning, or deduplication protocols. Additionally, the Correctness metric $\text{Correctness} = \frac{\text{Aligned Inferences}}{\text{Total Inferences}}$ relies on retriever quality, yet the retrieval hyperparameters (chunk size, top-$k$) are omitted.
The evidence strongly supports the central claim that current LLMs struggle with deep collective intent synthesis, with proprietary models achieving only 0-1.95% on L5 (Future Tendencies) compared to 16-45% on L1 (Usage Scenarios). Comparisons to related work appropriately distinguish COIN-Bench from SocialIQA (individual social reasoning) and Shop-R1 (transactional action prediction) by emphasizing multi-source consensus extraction. However, the analysis conflates "reasoning models" (e.g., GPT-o3) with "general models" without controlling for scale or training data overlap, and the Informativeness metrics (TTR $= \frac{\text{Count(unique token)}}{\text{Count(tokens)}}$ and Distinct-$n$) may conflate lexical diversity with analytical quality, as verbosity could artificially inflate scores.
While the project page is referenced, critical experimental details are missing: inference hyperparameters (temperature, top-$p$, sampling methods) for the 20 evaluated models are not specified, nor are the exact prompts for the Active Probing stage (only template references in Figure 2). The COIN-RAG pipeline mentions dual embeddings (TF-IDF and all-MiniLM-L6-v2) but omits chunking strategy, overlap parameters, and retrieval depth ($k$). Data will be released under a restrictive academic license (Appendix A.2), which limits commercial reproducibility. The live-updating nature of the benchmark, while methodologically sound for preventing contamination, introduces temporal variance that is not characterized or controlled across the experimental runs.
Understanding human intent is a high-level cognitive challenge for Large Language Models (LLMs), requiring sophisticated reasoning over noisy, conflicting, and non-linear discourse. While LLMs excel at following individual instructions, their ability to distill Collective Intent - the process of extracting consensus, resolving contradictions, and inferring latent trends from multi-source public discussions - remains largely unexplored. To bridge this gap, we introduce COIN-BENCH, a dynamic, real-world, live-updating benchmark specifically designed to evaluate LLMs on collective intent understanding within the consumer domain. Unlike traditional benchmarks that focus on transactional outcomes, COIN-BENCH operationalizes intent as a hierarchical cognitive structure, ranging from explicit scenarios to deep causal reasoning. We implement a robust evaluation pipeline that combines a rule-based method with an LLM-as-the-Judge approach. This framework incorporates COIN-TREE for hierarchical cognitive structuring and retrieval-augmented verification (COIN-RAG) to ensure expert-level precision in analyzing raw, collective human discussions. An extensive evaluation of 20 state-of-the-art LLMs across four dimensions - depth, breadth, informativeness, and correctness - reveals that while current models can handle surface-level aggregation, they still struggle with the analytical depth required for complex intent synthesis. COIN-BENCH establishes a new standard for advancing LLMs from passive instruction followers to expert-level analytical agents capable of deciphering the collective voice of the real world. See our project page on COIN-BENCH.
Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.
No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.