MIND: Multi-agent inference for negotiation dialogue in travel planning
This paper tackles the challenge of multi-party consensus-building in travel planning, where agents must negotiate conflicting subjective preferences rather than converge on objective truths. MIND (Multi-agent Inference for Negotiation Dialogue) introduces a Theory-of-Mind-inspired framework where agents infer hidden preference intensities (willingness scores $w \in [1,10]$) from linguistic cues and dynamically adjust their tone between warmth and toughness. The work matters because it extends Multi-Agent Debate (MAD) from factual domains to social coordination problems requiring compromise.
MIND presents a creative integration of Theory of Mind principles into multi-agent negotiation, with thoughtfully designed prompts and a useful ablation disentangling tone from cognitive appraisal. However, the paper's central claim of 90.2% inference accuracy is misleadingly presented without immediate qualification that this metric allows a ±2 point margin on a 10-point scale (precise accuracy within ±1 is only 67.7%). The framework shows promise for preference aggregation but relies on a fallback mechanism that essentially abandons negotiation in favor of dictatorship by the highest-willingness agent when consensus fails (~7% of cases).
The prompt engineering and experimental methodology demonstrate rigor. The verbatim disclosure of system prompts in Appendix B and negotiation traces in Appendix D enables inspection of the agent reasoning process. The ablation study (Table not explicitly numbered but described in Section 4.3) effectively isolates the contribution of tone injection versus strategic appraisal, showing that either component alone creates pathologies ('Stubborn Deadlocks' or 'Silent Submission') while their combination enables functional negotiation. The scalability analysis (Table 2) provides empirical evidence that the Debate Ratio degrades gracefully from 96.1% (2 agents) to 88.4% (4 agents), unlike the baseline which collapses to 64.5%.
The accuracy reporting is problematic: the abstract highlights 90.2% accuracy without noting this permits ±2 errors on a 10-point scale, while the tighter ±1 threshold yields only 67.7%. The MAE of 1.27 (12.7% of the scale) suggests moderate rather than high precision. The 'Debate Hit-Rate' metric (34.65%) and 'High-w Hit' prioritization create a potential circularity—the fallback mechanism explicitly enforces high-w preference when negotiation fails, yet this is counted as successful deliberation in the 93.18% 'Debate Ratio.' Sample size concerns exist: only 201 negotiation scenarios and 359 inference instances support strong claims about ToM capabilities. No statistical significance tests (p-values, confidence intervals) are reported for the performance gaps.
The baseline comparison adapts MAD frameworks from Liang et al. (2024) to the travel domain, which is reasonable but creates an asymmetry: MIND is architected specifically for negotiation with built-in willingness modeling, while the baseline treats preferences as flat constraints. The comparison would be stronger with simpler baselines (e.g., random proposer selection, round-robin voting without inference). The LLM-as-a-Judge evaluation (Table 4) uses GPT-4.1 as the evaluator—the same model family as the agents (GPT-4.1-mini)—creating potential reviewer bias toward outputs that match its training distribution. The 'Fairness' metric (Jain's Index) remains nearly identical between Base (0.6849) and MIND (0.6838), undercutting claims of superior social welfare despite higher satisfaction scores.
Reproducibility is partially strong but incomplete. The paper provides exceptional disclosure of prompt templates (Tables B1-B9), algorithmic pseudocode (Algorithm C10), and qualitative traces (Tables D11-D15). Hyperparameters are specified (temperature 0.4, model gpt-4.1-mini-2025-04-14). However, no code repository URL is provided, and the augmented dataset combining TravelPlanner with Stravl preferences is not publicly released. The reliance on a specific, dated model version (gpt-4.1-mini-2025-04-14) creates reproducibility concerns as API behavior shifts over time. Cost estimates ($0.02 per session) are provided but without breakdown of token counts. No variance estimates or random seed specifications are included.
While Multi-Agent Debate (MAD) research has advanced, its efficacy in coordinating complex stakeholder interests such as travel planning remains largely unexplored. To bridge this gap, we propose MIND (Multi-agent Inference for Negotiation Dialogue), a framework designed to simulate realistic consensus-building among travelers with heterogeneous preferences. Grounded in the Theory of Mind (ToM), MIND introduces a Strategic Appraisal phase that infers opponent willingness (w) from linguistic nuances with 90.2% accuracy. Experimental results demonstrate that MIND outperforms traditional MAD frameworks, achieving a 20.5% improvement in High-w Hit and a 30.7% increase in Debate Hit-Rate, effectively prioritizing high-stakes constraints. Furthermore, qualitative evaluations via LLM-as-a-Judge confirm that MIND surpasses baselines in Rationality (68.8%) and Fluency (72.4%), securing an overall win rate of 68.3%. These findings validate that MIND effectively models human negotiation dynamics to derive persuasive consensus.
Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.
No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.