Chimera: Latency- and Performance-Aware Multi-agent Serving for Heterogeneous LLMs

cs.LG Kangqi Ni, Wenyue Hua, Xiaoxiang Shi, Jiang Guo, Shiyu Chang, Tianlong Chen · Mar 23, 2026
Local to this browser
What it does
Multi-agent applications execute tasks through multi-stage workflows where each stage is an LLM call feeding into the next. While heterogeneous clusters (mixing model sizes/families) enable better latency–performance trade-offs than...
Why it matters
While heterogeneous clusters (mixing model sizes/families) enable better latency–performance trade-offs than homogeneous deployments, they introduce complex scheduling challenges: model selection affects both task accuracy and queue...
Main concern
Chimera presents a coherent middleware design that effectively couples semantic routing with workflow-aware scheduling. The system demonstrates consistent latency reductions (1.
Community signal
0
0 up · 0 down
Sign in to vote with arrows
AI Review AI reviewed
Plain-language introduction

Multi-agent applications execute tasks through multi-stage workflows where each stage is an LLM call feeding into the next. While heterogeneous clusters (mixing model sizes/families) enable better latency–performance trade-offs than homogeneous deployments, they introduce complex scheduling challenges: model selection affects both task accuracy and queue congestion. Chimera addresses this by predicting per-model confidence scores, forecasting total workflow output lengths, and estimating real-time load via in-flight token volumes to jointly optimize end-to-end latency and task performance.

Critical review
Verdict
Bottom line

Chimera presents a coherent middleware design that effectively couples semantic routing with workflow-aware scheduling. The system demonstrates consistent latency reductions (1.2–3.4×) and performance gains (1.6–16 percentage points) across heterogeneous model configurations on two agentic benchmarks. The approach is well-motivated and the ablation studies provide useful upper bounds via oracle comparisons. However, the evaluation is limited to two reasoning-heavy datasets (code and math), and the comparison baseline omits recent routing-specific systems like RouteLLM that are discussed in related work, making it unclear whether the routing component itself is state-of-the-art or merely adequate.

“Chimera improves both latency and performance, achieving a 3.4× speedup and a 16.0% score increase over vLLM”
Chimera, Sec. 4.3 · Section 4.3
“Recent methods also study stronger training paradigms for routing policies... RouteLLM”
Chimera, Sec. 2 · Section 2
What holds up

The paper’s central insight—that workflow-level total output length prediction enables better priority scheduling than request-level SJF—holds up empirically. Figure 3 shows STJF reduces queue time by an additional 15–34% over SJF. The design separation between semantic routing (quality), length prediction (work), and activity monitoring (load) is architecturally sound. The scheduling overhead claim is convincing: Table 1 shows the scheduler contributes ≤2.2% of end-to-end latency, validating the CPU-based QRF predictor and batched router design.

“Incorporating workflow-level information through STJF yields an additional 15–34% reduction compared to SJF”
Chimera, Sec. 3.4 · Section 3.4
“the scheduler contributes at most 2.2% of end-to-end latency (and often <1%)”
Main concerns

First, the evaluation scope is narrow: only APPS and MATH are tested, both involving ReAct-style reasoning workflows with deterministic stage counts. Generalization to open-ended, dynamic, or chat-based multi-agent workflows remains unverified. Second, the semantic router requires per-request, per-model correctness labels for training—an expensive data collection burden not adequately discussed. Third, the ablation in Figure 7 reveals that an oracle router substantially improves the Pareto frontier for Qwen1.5+14B, indicating the current router leaves significant performance on the table. Fourth, the paper compares against vLLM, MLFQ, and LTR but not against dedicated routing systems like RouteLLM or Martian, despite citing them as related work. Finally, the anti-starvation parameters (threshold $S$ and quantum $Q$) appear hand-tuned without sensitivity analysis.

“An oracle router improves the latency–performance frontier for Qwen1.5+14B, yielding high performance at low latency”
Chimera, Sec. 4.4 · Figure 7 caption
“Chimera mitigates this with an aging-based promotion mechanism controlled by two parameters: a starvation threshold $S$ and a running quantum $Q$”
Chimera, Sec. 3.7 · Section 3.7
Evidence and comparison

The evidence supports the claim that co-designing routing, load balancing, and prioritization outperforms disjoint optimizations. However, the comparisons are incomplete. While vLLM is a strong serving baseline and LTR represents length-prediction scheduling, the routing component is not benchmarked against recent academic routers (e.g., RouterDC, GraphRouter) or commercial offerings (RouteLLM, Martian). This makes it impossible to assess whether Chimera’s gains come from novel scheduling or merely from adding any reasonable router. The oracle studies are commendable but also expose gaps: the predictor error matters most when latency slack is tight (Qwen1.5+7B), while routing error matters more for larger models (Qwen1.5+14B).

“length-prediction error matters most when latency constraints are tight; however, it has limited impact for Qwen1.5+14B, where... routing is not the primary bottleneck there”
Chimera, Sec. 4.4 · Section 4.4
Reproducibility

Implementation details are reasonably thorough: Chimera builds on open-source vLLM, uses specified GPUs (RTX A6000), and reports training hyperparameters for the router (55 epochs, lr 1e-5, batch 16) and QRF predictor features. However, no code or data repository is linked, exact prompt templates for the ReAct workflows are not provided, and the method for generating router training labels (per-request correctness for each model) is not described. Reproducing the router would require labeling the entire training set across all candidate models—a potentially prohibitive cost not acknowledged. The QRF hyperparameters (number of trees, max depth) are omitted.

“Training uses offline traces where each $(x_r, m)$ pair is labeled by task-specific correctness (1 if correct, 0 otherwise)”
Chimera, Sec. 3.3 · Section 3.3
“finetune the semantic router for 55 epochs, a learning rate of 1e-5, and a batch size of 16”
Chimera, App. B.1 · Table 2 caption
Abstract

Multi-agent applications often execute complex tasks as multi-stage workflows, where each stage is an LLM call whose output becomes part of context for subsequent steps. Existing LLM serving systems largely assume homogeneous clusters with identical model replicas. This design overlooks the potential of heterogeneous deployments, where models of different sizes and capabilities enable finer trade-offs between latency and performance. However, heterogeneity introduces new challenges in scheduling across models with diverse throughput and performance. We present Chimera, a predictive scheduling system for multi-agent workflow serving on heterogeneous LLM clusters that jointly improves end-to-end latency and task performance. Chimera applies semantic routing to estimate per-model confidence scores for each request, predicts the total remaining output length of the workflow, and estimates per-model congestion using in-flight predicted token volumes for load balancing. We evaluate Chimera on representative agentic workflows for code generation and math reasoning using multiple heterogeneous LLM configurations. Across comparable settings, Chimera traces the best latency-performance frontier, reducing end-to-end latency by 1.2--2.4$\times$ and improving task performance by 8.0-9.5 percentage points on average over competitive baselines including vLLM.

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.