ConsRoute:Consistency-Aware Adaptive Query Routing for Cloud-Edge-Device Large Language Models

cs.AI Haoyu Qiao, Hao Zhang, Shanwen Mao, Siyao Cheng, Jie Liu · Mar 22, 2026
Local to this browser
What it does
ConsRoute tackles the challenge of routing queries across cloud-edge-device LLM tiers by proposing a consistency-aware approach that uses reranker-based semantic similarity rather than scalar quality gaps. The core innovation lies in...
Why it matters
The core innovation lies in reusing device-side LLM (DLM) prefilling hidden states as query representations and applying cluster-specific adaptive thresholds learned via Bayesian optimization. This addresses the tension between response...
Main concern
ConsRoute presents a technically sound and well-engineered solution to multi-tier LLM routing. The reuse of DLM prefilling states avoids extra encoder overhead, and the shift from scalar quality differences to semantic consistency...
Community signal
0
0 up · 0 down
Sign in to vote with arrows
AI Review AI reviewed
Plain-language introduction

ConsRoute tackles the challenge of routing queries across cloud-edge-device LLM tiers by proposing a consistency-aware approach that uses reranker-based semantic similarity rather than scalar quality gaps. The core innovation lies in reusing device-side LLM (DLM) prefilling hidden states as query representations and applying cluster-specific adaptive thresholds learned via Bayesian optimization. This addresses the tension between response quality and latency/cost in resource-constrained mobile environments, claiming to achieve ≥95% of cloud accuracy while cutting latency and cost by ~40%.

Critical review
Verdict
Bottom line

ConsRoute presents a technically sound and well-engineered solution to multi-tier LLM routing. The reuse of DLM prefilling states avoids extra encoder overhead, and the shift from scalar quality differences to semantic consistency supervision is conceptually appealing. However, the evaluation relies heavily on simulated network conditions and synthetic cost models rather than real deployment traces, and the claim of maintaining ≥95% cloud performance depends on carefully tuned utility weights that may not generalize across all workload types. The approach is suitable for settings where response semantic alignment correlates with task accuracy, but less validated for open-ended generation where correctness is ambiguous.

“achieves near-cloud performance (≥95%) while reducing end-to-end latency and inference cost by nearly 40%”
paper · Abstract
“This score captures whether the DLM response preserves the core semantics of the CLM response”
paper · IV-C2
What holds up

The design choice to reuse DLM prefilling hidden states (Equation 2) is compelling: by appending a consistency-focused prompt and extracting the EOS token representation $h_T$, ConsRoute avoids deploying separate BERT encoders or calling embedding APIs, minimizing device-side overhead. The semantic consistency supervision via reranker scores (Equation 3) addresses a genuine flaw in prior work that collapses rich textual outputs into scalar quality gaps. Figure 4 provides empirical support that reranker scores correlate better with human consistency judgments than reward models or BartScore. The cluster-based Bayesian optimization for thresholds (Equation 9) elegantly handles heterogeneous query types without manual per-task tuning.

“h_T := DLM(x')[EOS]”
paper · IV-B
“This two-step compression discards much of the fine-grained relational information between model outputs”
paper · IV-C1
Main concerns

First, the supervision signal assumes reranker semantic similarity is a reliable proxy for task-specific correctness, but the paper does not establish that high reranker scores always imply functional equivalence on downstream tasks. For mathematical reasoning or code generation, semantic similarity may mask logically critical differences. Second, the online adaptation mechanism (Algorithm 2) requires per-query correctness feedback $u(x)$ via an indicator function $\mathbb{I}[\text{correct}]$, which for open-ended responses (MT-Bench) necessitates an expensive LLM-as-judge—introducing latency and cost not fully accounted for in the routing overhead analysis. Third, the 40% cost reduction claim relies on a specific cost model $\text{Cost}(\pi)$ based on activated parameters and token count; this ignores memory bandwidth bottlenecks, KV-cache eviction costs, and heterogeneous pricing of cloud APIs versus edge ownership.

“u(x) = λ_1 · I[correct] - λ_2 · Latency(x) - λ_3 · Cost(x)”
paper · IV-D
“estimate Cost(π) based on the total number of activated model parameters multiplied by the number of generated tokens”
paper · III-C
Evidence and comparison

The evaluation against RouteLLM and MixLLM demonstrates consistent Pareto improvements on RouterBench subsets (GSM8K, MMLU, HumanEval), though these baselines were primarily designed for two-tier (device-cloud) rather than three-tier hierarchies. The cross-family deployment experiment (LLaMA-3.2-3B device, Qwen3-14B edge, DeepSeek-V3 cloud) strengthens generalizability claims. However, the comparison lacks recent baselines like cascading with speculative decoding or prompt-based routing, and the network sensitivity experiments (Section V-D2) rely on simulated bandwidth/latency parameters rather than real-world mobile trace data. The ablation in Section V-E confirms that both label augmentation and dynamic thresholding contribute independently, validating the design decomposition.

“LLaMA-3.2-3B on the device, Qwen3-14B on the edge, and DeepSeek-V3 in the cloud”
paper · V-B2
“device-to-edge link was configured with: 10,000 Kbps downlink bandwidth...”
paper · V-D2
Reproducibility

Reproduction is partially blocked by missing implementation details. The paper does not provide a code repository link or mention artifact availability. Critical hyperparameters for the Bayesian optimization (number of initialization points, GP kernel choice, acquisition function budget $T_{\text{off}}$) are not specified in the main text (referenced only as 'Algorithm 2'). The exact prompt templates for explicit routing baselines (Appendix A) are not included in the provided text. While the DLM (Qwen3-1.7B) and reranker (Qwen3-Reranker-4B) are public, the cluster count $K$ determined by the 'elbow method' is not reported, nor is the sensitivity of results to this choice. The utility weights $(\lambda_1, \lambda_2, \lambda_3)$ used for the main results are not explicitly stated, complicating fair comparison.

“The number of clusters K is automatically determined using the elbow method”
paper · IV-D
Abstract

Large language models (LLMs) deliver impressive capabilities but incur substantial inference latency and cost, which hinders their deployment in latency-sensitive and resource-constrained scenarios. Cloud-edge-device collaborative inference has emerged as a promising paradigm by dynamically routing queries to models of different capacities across tiers. In this paper, we propose ConsRoute, a lightweight, semantic-aware, and adaptive routing framework that significantly improves inference efficiency while minimizing impact on response quality. Unlike prior routing methods that rely on predicting coarse-grained output quality gaps, ConsRoute leverages a reranker to directly assess the semantic consistency between responses generated by models at different tiers, yielding fine-grained soft supervision signals for routing. To minimize device-side overhead, ConsRoute reuses hidden states from the LLM prefilling stage as compact query representations, avoiding additional encoders or inference passes. Furthermore, these representations are clustered, and Bayesian optimization is employed to learn cluster-specific routing thresholds that dynamically balance quality, latency, and cost under heterogeneous query distributions. Extensive experiments demonstrate that ConsRoute achieves near-cloud performance (>=95%) while reducing end-to-end latency and inference cost by nearly 40%, consistently outperforming existing routing baselines in both response quality and system efficiency.

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.