Optimizing Feature Extraction for On-device Model Inference with User Behavior Sequences
Machine learning models on mobile devices spend 61-86% of execution time extracting features from user behavior logs rather than running inference. This paper introduces AutoFeature, a graph-based engine that eliminates redundant operations across features and consecutive executions using directed acyclic graph optimization and intelligent caching. Tested across five industrial services including TikTok and e-commerce platforms, it achieves 1.33×-4.53× end-to-end latency reduction without accuracy loss.
The paper presents a compelling systems contribution by identifying and solving the feature extraction bottleneck in on-device ML pipelines through graph abstraction and redundancy elimination. The industrial-scale evaluation across diverse mobile services (search, video, e-commerce) with real user data strengthens the empirical claims, though the reliance on proprietary infrastructure and small test cohort (10 users) limits broad reproducibility.
The graph abstraction formalism and hierarchical filtering algorithm provide solid technical foundations. The paper demonstrates empirically that Retrieve and Decode operations dominate feature extraction costs (15× and 300× slower than Filter and Compute respectively), justifying the optimization targets. The knapsack formulation for caching decisions and the greedy 2-approximation policy are theoretically grounded and practically validated across varying memory budgets.
The evaluation relies on only 10 testing users, which, despite Kolmogorov-Smirnov tests suggesting distributional similarity to broader populations, raises questions about statistical power and edge-case robustness. The comparison with cloud-based baselines (Decoded Log and Feature Store) presents strawman alternatives that increase storage by 2.6-2.8×, making them unrealistic for production rather than competitive alternatives. Additionally, the claim of being 'the first' to address this bottleneck may overlook prior work on feature stores and data pipelines that, while cloud-focused, addresses similar redundancy issues.
The evidence supports the core claims that feature extraction dominates latency (61-86%) and that redundancy exists across features and executions, supported by analysis of 20+ production models. However, comparisons rely heavily on ablation studies (w/ Fusion, w/ Cache) rather than competing systems, which is acceptable given the novelty of the problem space but limits external validity. The synthetic experiments validating scalability under varying redundancy levels provide useful sensitivity analysis, though they isolate feature extraction from end-to-end latency.
Reproducibility is severely limited: AutoFeature is integrated into ByteDance's proprietary ByteNN engine and mobile app SDKs, with no public code, datasets, or model configurations provided. The reliance on industrial infrastructure (TikTok, Toutiao) and undisclosed model architectures prevents independent reproduction. While the paper notes that 'AutoFeature operates without hyper-parameters,' the graph optimization algorithms, cost models, and caching policies require implementation details not fully specified for replication.
Machine learning models are widely integrated into modern mobile apps to analyze user behaviors and deliver personalized services. Ensuring low-latency on-device model execution is critical for maintaining high-quality user experiences. While prior research has primarily focused on accelerating model inference with given input features, we identify an overlooked bottleneck in real-world on-device model execution pipelines: extracting input features from raw application logs. In this work, we explore a new direction of feature extraction optimization by analyzing and eliminating redundant extraction operations across different model features and consecutive model inferences. We then introduce AutoFeature, an automated feature extraction engine designed to accelerate on-device feature extraction process without compromising model inference accuracy. AutoFeature comprises three core designs: (1) graph abstraction to formulate the extraction workflows of different input features as one directed acyclic graph, (2) graph optimization to identify and fuse redundant operation nodes across different features within the graph; (3) efficient caching to minimize operations on overlapping raw data between consecutive model inferences. We implement a system prototype of AutoFeature and integrate it into five industrial mobile services spanning search, video and e-commerce domains. Online evaluations show that AutoFeature reduces end-to-end on-device model execution latency by 1.33x-3.93x during daytime and 1.43x-4.53x at night.
Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.
No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.