Optimizing Feature Extraction for On-device Model Inference with User Behavior Sequences

cs.LG cs.AI cs.HC Chen Gong, Zhenzhe Zheng, Yiliu Chen, Sheng Wang, Fan Wu, Guihai Chen · Mar 23, 2026
Local to this browser
What it does
Machine learning models on mobile devices spend 61-86% of execution time extracting features from user behavior logs rather than running inference. This paper introduces AutoFeature, a graph-based engine that eliminates redundant...
Why it matters
33×-4. 53× end-to-end latency reduction without accuracy loss.
Main concern
The paper presents a compelling systems contribution by identifying and solving the feature extraction bottleneck in on-device ML pipelines through graph abstraction and redundancy elimination. The industrial-scale evaluation across...
Community signal
0
0 up · 0 down
Sign in to vote with arrows
AI Review AI reviewed
Plain-language introduction

Machine learning models on mobile devices spend 61-86% of execution time extracting features from user behavior logs rather than running inference. This paper introduces AutoFeature, a graph-based engine that eliminates redundant operations across features and consecutive executions using directed acyclic graph optimization and intelligent caching. Tested across five industrial services including TikTok and e-commerce platforms, it achieves 1.33×-4.53× end-to-end latency reduction without accuracy loss.

Critical review
Verdict
Bottom line

The paper presents a compelling systems contribution by identifying and solving the feature extraction bottleneck in on-device ML pipelines through graph abstraction and redundancy elimination. The industrial-scale evaluation across diverse mobile services (search, video, e-commerce) with real user data strengthens the empirical claims, though the reliance on proprietary infrastructure and small test cohort (10 users) limits broad reproducibility.

What holds up

The graph abstraction formalism and hierarchical filtering algorithm provide solid technical foundations. The paper demonstrates empirically that Retrieve and Decode operations dominate feature extraction costs (15× and 300× slower than Filter and Compute respectively), justifying the optimization targets. The knapsack formulation for caching decisions and the greedy 2-approximation policy are theoretically grounded and practically validated across varying memory budgets.

“Retrieve and Decode nodes dominate feature extraction time... these nodes consume 15× more time than Filter nodes and 300× more time than Compute nodes”
Gong et al., Sec. 3.3 · Section 3.3
Main concerns

The evaluation relies on only 10 testing users, which, despite Kolmogorov-Smirnov tests suggesting distributional similarity to broader populations, raises questions about statistical power and edge-case robustness. The comparison with cloud-based baselines (Decoded Log and Feature Store) presents strawman alternatives that increase storage by 2.6-2.8×, making them unrealistic for production rather than competitive alternatives. Additionally, the claim of being 'the first' to address this bottleneck may overlook prior work on feature stores and data pipelines that, while cloud-focused, addresses similar redundancy issues.

“10 testing users during their daily usage of the app across 2 days”
Gong et al., Sec. 4.1 · Section 4.1
“Decoded Log increases the app log size by 2.61×, and Feature Store increases it by a staggering 2.80×”
Gong et al., Table 1 · Table 1
Evidence and comparison

The evidence supports the core claims that feature extraction dominates latency (61-86%) and that redundancy exists across features and executions, supported by analysis of 20+ production models. However, comparisons rely heavily on ablation studies (w/ Fusion, w/ Cache) rather than competing systems, which is acceptable given the novelty of the problem space but limits external validity. The synthetic experiments validating scalability under varying redundancy levels provide useful sensitivity analysis, though they isolate feature extraction from end-to-end latency.

“feature extraction alone accounts for 61%-86% of the total end-to-end model execution latency”
Gong et al., Sec. 2.2 · Section 2.2
“speedups grow from 7.3× and 1.0× at 0% redundancy to 336× and 21.9× at nearly 90% redundancy”
Gong et al., Fig. 21 · Figure 21
Reproducibility

Reproducibility is severely limited: AutoFeature is integrated into ByteDance's proprietary ByteNN engine and mobile app SDKs, with no public code, datasets, or model configurations provided. The reliance on industrial infrastructure (TikTok, Toutiao) and undisclosed model architectures prevents independent reproduction. While the paper notes that 'AutoFeature operates without hyper-parameters,' the graph optimization algorithms, cost models, and caching policies require implementation details not fully specified for replication.

“In compliance with enterprise data privacy requirements, our evaluation primarily uses ByteNN”
Gong et al., Sec. 4.1 · Section 4.1
“AutoFeature operates without hyper-parameters, eliminating the need for trial-and-error tuning”
Gong et al., Sec. 4.4 · Section 4.4
Abstract

Machine learning models are widely integrated into modern mobile apps to analyze user behaviors and deliver personalized services. Ensuring low-latency on-device model execution is critical for maintaining high-quality user experiences. While prior research has primarily focused on accelerating model inference with given input features, we identify an overlooked bottleneck in real-world on-device model execution pipelines: extracting input features from raw application logs. In this work, we explore a new direction of feature extraction optimization by analyzing and eliminating redundant extraction operations across different model features and consecutive model inferences. We then introduce AutoFeature, an automated feature extraction engine designed to accelerate on-device feature extraction process without compromising model inference accuracy. AutoFeature comprises three core designs: (1) graph abstraction to formulate the extraction workflows of different input features as one directed acyclic graph, (2) graph optimization to identify and fuse redundant operation nodes across different features within the graph; (3) efficient caching to minimize operations on overlapping raw data between consecutive model inferences. We implement a system prototype of AutoFeature and integrate it into five industrial mobile services spanning search, video and e-commerce domains. Online evaluations show that AutoFeature reduces end-to-end on-device model execution latency by 1.33x-3.93x during daytime and 1.43x-4.53x at night.

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.