A Context Engineering Framework for Improving Enterprise AI Agents based on Digital-Twin MDP

cs.AI Xi Yang, Aurelie Lozano, Naoki Abe, Bhavya, Saurabh Jha, Noah Zheutlin, Rohan R. Arora, Yu Deng, Daby M. Sow · Mar 23, 2026

What it does

Why it matters

Main concern

Community signal

0 up · 0 down

AI Review AI reviewed

Plain-language introduction

Enterprise AI agents face a fundamental dilemma: complex reasoning demands large-scale training data, yet enterprise domains offer limited, noisy trajectories and prohibit online self-play. This paper proposes Context Engineering via DT-MDP (DT-MDP-CE), a framework that abstracts LLM agent behavior into a finite Digital-Twin Markov Decision Process, learns per-step rewards via contrastive inverse RL (T-REX) from ranked offline trajectories, and deploys the resulting policy to guide context engineering—enabling performance gains without fine-tuning the base model or interacting with the environment during training.

Critical review

Verdict

Bottom line

DT-MDP-CE offers a pragmatic, lightweight alternative to fine-tuning for enterprise settings where data is scarce and online interaction is restricted. The empirical results demonstrate consistent improvements over baseline agents in Site Reliability Engineering tasks, and the framework generalizes across two agent architectures (EoG and ReAct) and multiple LLM families. However, the evaluation scope is limited to a single domain (IT automation) with only six held-out test scenarios, and the reliance on LLM-as-a-judge for both training labels and evaluation introduces potential circularity. The paper would benefit from comparisons to contemporary RL-for-LLM methods such as DPO or online PPO-based approaches to validate the necessity of the inverse RL component.

“Across both metrics, DT-MDP-CE consistently outperforms the baseline, demonstrating the effectiveness of CE guided by RL policies learned with DT-MDP.”

paper · Section 4.2.1

“All Name-type and Topology-based configurations show statistically significant improvements after correction (p<0.05), while the Name-based configurations do not reach significance despite showing numerical gains.”

paper · Section 4.2.1

What holds up

The Digital-Twin MDP abstraction is a theoretically sound strategy to tame the infinite state-action spaces of LLM agents, converting the intractable POMDP into a finite MDP amenable to offline RL. The choice of T-REX (Trajectory-ranked Reward EXtrapolation) for contrastive inverse RL is particularly appropriate for enterprise settings, as it exploits trajectory rankings rather than assuming optimal demonstrations, enabling learning from mixed-quality data. The three context engineering strategies—suggesting via prompts, pruning explorations, and prioritizing actions—provide concrete, interpretable intervention points that avoid the brittleness of fine-tuning. The use of Conservative Q-Learning (CQL) for offline policy induction is well-justified given the limited data regime.

“T-REX makes use of noisy qualitative trajectory rankings (such as pairwise preferences over demonstrations or ratings for each demonstration on a scale) to learn a reward neural network $\hat{r}_{\theta}(s,a)$ that yields higher cumulative returns for higher-ranked trajectories.”

paper · Section 2.3.1

“Equipped with a reward function $\hat{r}(s,a)$ (e.g., learned via Contrastive IRL), we estimate the optimal policy $\pi(s,a)$ through offline reinforcement learning using approaches such as Deep Q-Network or Conservative Q-Learning.”

paper · Section 2.3.2

Main concerns

The primary limitation is the narrow empirical scope: the framework is tested on only 12 training scenarios and 6 test scenarios from ITBench, all within IT automation. While the authors claim generalizability to 'enterprise environments,' the evidence supports only within-domain transfer to a limited Software Engineering task (Section 4.3.2). The Name-based representations fail to achieve statistical significance (p≥0.05), suggesting the method is sensitive to abstraction quality. More critically, the framework relies on LLM-as-a-judge (Gemini-2.5-Pro) for trajectory rankings during training and for final evaluation, creating a risk of circular validation where the system optimizes for judge-preferred reasoning patterns rather than ground truth. The paper omits comparisons to standard RL fine-tuning baselines such as PPO or DPO, making it impossible to assess whether the complexity of inverse RL and DT-MDP abstraction is necessary compared to direct preference optimization on the raw trajectories.

“Our training set consists of agent-system interaction trajectories collected from 12 SRE diagnosis scenarios in ITBench... we collected 819 trajectories (12,079 turns).”

paper · Section 4.1

“The agent is evaluated online on six ITBench test scenarios, including one Flagd failure, one Chaos Mesh failure, and four customized failures unseen during offline training.”

paper · Section 4.2

“In addition, since inverse RL methods require quality signals over trajectories, we apply an LLM-as-a-judge to assess their quality... The judge compares the agent's outputs against the ground-truth fault propagation to evaluate aspects such as the identified root cause and fault conditions.”

paper · Section 3.1.1

Evidence and comparison

The evidence supports the claim that RL-IRL outperforms Behavior Cloning and sparse-reward RL within the tested domain, as shown in the Critical Difference analysis (Figure 4). However, the comparison to related work is incomplete. While the paper positions itself against full fine-tuning approaches (Section 5), it does not empirically compare against recent context-engineering methods such as REARANK or direct preference optimization (DPO) that also operate without environment interaction. The topology-based abstraction generalizes better than name-based representations across domains (Section 4.3.2), but the sample size (6 test scenarios) is too small to draw robust conclusions about cross-domain transfer. The bootstrap Monte Carlo procedure for Pass@3 estimation provides appropriate variance estimates, though the 200 resamples may be insufficient given the small number of scenarios.

“The diagram shows that the RL-IRL group consistently achieves the best average ranks, while RL-Sparse, BC, and baseline methods tend to occupy lower ranks and are often clustered together, indicating similar performance.”

paper · Section 4.2.2

“Across both metrics, all DT-MDP-based variants outperform the baseline without RL, indicating effective transfer to the SWE setting.”

paper · Section 4.3.2

Reproducibility

The paper provides sufficient methodological detail to reproduce the DT-MDP construction, including the specific prompts for entity extraction (Appendix A.2) and the exact threshold hyperparameters (95th percentile for suggesting, 85th for pruning). The use of open-source libraries (d3rlpy for CQL and BC) and the public ITBench benchmark (Jha et al., 2025) facilitate reproduction. However, the code and trained policies are not released, and the small dataset size (819 trajectories from 12 scenarios) raises concerns about result stability across different random splits. The paper does not report confidence intervals or standard deviations for the main Pass@3 metrics, only noting that 'values are estimated via a bootstrap Monte Carlo procedure' without specifying the variance. The robustness analysis (Section 4.4) shows stability across threshold choices, but does not test sensitivity to the number of states in the DT-MDP abstraction or alternative IRL algorithms.

“We set the default thresholds in Strategies I and II for suggesting and pruning actions to 95 and 85 percentiles among the candidates, respectively.”

paper · Section 4.4.2

“We randomly sample {100, 200, 300, 400} successful trajectories for training and report the initial-value score... BC is highly sensitive to the number of trajectories, while RL, especially RL-IRL, remains more robust.”

paper · Section 4.4.1

Abstract

Despite rapid progress in AI agents for enterprise automation and decision-making, their real-world deployment and further performance gains remain constrained by limited data quality and quantity, complex real-world reasoning demands, difficulties with self-play, and the lack of reliable feedback signals. To address these challenges, we propose a lightweight, model-agnostic framework for improving LLM-based enterprise agents via offline reinforcement learning (RL). The proposed Context Engineering via DT-MDP (DT-MDP-CE) framework comprises three key components: (1) A Digital-Twin Markov Decision Process (DT-MDP), which abstracts the agent's reasoning behavior as a finite MDP; (2) A robust contrastive inverse RL, which, armed with the DT-MDP, to efficiently estimate a well-founded reward function and induces policies from mixed-quality offline trajectories; and (3) RL-guided context engineering, which uses the policy obtained from the integrated process of (1) and (2), to improve the agent's decision-making behavior. As a case study, we apply the framework to a representative task in the enterprise-oriented domain of IT automation. Extensive experimental results demonstrate consistent and significant improvements over baseline agents across a wide range of evaluation settings, suggesting that the framework can generalize to other agents sharing similar characteristics in enterprise environments.

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

Challenges are public to read, but only signed-in members can post them. Your challenge text is stored with your account for moderation, but usernames are not shown in the public thread.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.