TREX: Trajectory Explanations for Multi-Objective Reinforcement Learning

cs.LG cs.AI Dilina Rajapakse, Juan C. Rosero, Ivana Dusparic · Mar 23, 2026

What it does

Why it matters

This paper proposes TREX, a post-hoc trajectory attribution framework that clusters agent behaviors into semantically meaningful segments and quantifies each cluster's influence on objective trade-offs by training complementary policies...

Main concern

Community signal

0 up · 0 down

AI Review AI reviewed

Plain-language introduction

Multi-Objective Reinforcement Learning (MORL) agents must balance competing objectives like speed versus energy consumption, yet existing Explainable RL methods fail to clarify how specific behavioral choices drive Pareto trade-offs. This paper proposes TREX, a post-hoc trajectory attribution framework that clusters agent behaviors into semantically meaningful segments and quantifies each cluster's influence on objective trade-offs by training complementary policies that exclude specific trajectory groups. The work addresses a genuine gap in explainability by moving beyond policy selection to reveal which behavioral patterns (such as "long leaps" versus "short strides") justify the agent's learned trade-off logic.

Critical review

Verdict

Bottom line

TREX presents a conceptually sound and novel contribution to Explainable Multi-Objective RL, offering the first method to quantitatively attribute Pareto trade-offs to discoverable behavioral clusters via trajectory ablation. However, the framework's practical utility is hampered by high computational costs and an unresolved fidelity gap between the expert policy and its surrogate "original policy," with Tables 1-3 showing non-negligible return disparities (e.g., in MO-HalfCheetah preference (0.25,0.75): expert $R^1=1033.4$ vs original $R^1=917.4$) that threaten the validity of attribution scores. The empirical evaluation is also limited to bi-objective MuJoCo tasks with only three preference vectors per environment, leaving scalability to higher-dimensional objective spaces unverified.

“The intuition behind this approach is that if we train two different policies - one trained by removing a set of trajectories from the training data (i.e: complementary policy) and another with the complete training data (i.e: original policy) - the shift in the distribution relative to the original policy is an indication of the importance of the missing trajectories.”

Rajapakse et al., Sec. 3.1 · Section 3.1

“The disparity observed between the expert policy returns and the original policy returns in some instances suggests that the mimicking process (training $\pi_k$ on expert trajectories) may introduce minor behavioural shifts.”

Rajapakse et al., Sec. 6.1 · Section 6.1

What holds up

The methodological adaptation of trajectory attribution to multi-objective settings is technically coherent, particularly the Reward Attribution Score $RAS(c) = |w_1\Delta R^1 - w_2\Delta R^2|$ which specifically captures trade-off shifts rather than mere performance degradation. The qualitative validation effectively corroborates quantitative findings; visual inspection of cluster behaviors in MO-HalfCheetah confirms that Cluster 0 exhibits "smaller, slower hops" while Cluster 2 shows "expansive, high velocity strides," semantically aligning with the observed attribution scores that identify Cluster 0 as driving energy conservation over speed.

“We then calculate the total returns deviation $\Delta R$ in Eq.[2], which is the magnitude of the combined deviation of all the objective returns - calculated as a L2 norm distance.”

Rajapakse et al., Sec. 3.2 · Section 3.2

“Visual inspection support this hypothesis, showing that in Cluster 0, the robot takes smaller, slower hops compared to the more expansive, high velocity strides of Clusters 1 and 2.”

Rajapakse et al., Sec. 5.2 · Section 5.2

Main concerns

The framework requires training $n+1$ policies per preference setting (one original plus $n$ complementary policies), creating computational overhead that scales linearly with cluster count and creates a practical barrier for complex domains or online explanation generation. More critically, the entire attribution analysis depends on the premise that the trained "original policy" $\pi_k$ faithfully mimics the expert $\pi_E$, yet the results show consistent performance gaps between expert and original policies across all environments, raising serious questions about whether the reported $RAS(c)$ scores measure true expert behavior or artifacts of imperfect policy distillation. Additionally, the restriction to silhouette-based K-means clustering (Section 6.1) introduces instability, with the appendix revealing clusters containing multiple distinct behavioral blobs that likely conflate semantically different behaviors.

“A practical limitation of the TREX framework is the computational cost associated with the training of the original and complementary policies in the attribution analysis.”

Rajapakse et al., Sec. 6.1 · Section 6.1

“However some of the clusters have considerably large boundaries and there could be multiple less frequent behaviours within the same cluster.”

Rajapakse et al., Sec. 6.1 · Section 6.1, Limitations

Evidence and comparison

While the paper correctly differentiates TREX from existing XMORL approaches that focus on policy selection (e.g., Tamura et al.'s mismatch metrics or Osika et al.'s summaries), it fails to provide direct comparative baselines or user studies validating that TREX explanations improve human understanding versus simpler methods. The experimental evidence supports the claim that specific behavioral clusters influence objective trade-offs—demonstrating contrary deviations in MO-HalfCheetah where removing Cluster 0 improves energy while reducing speed—but is limited to narrow bi-objective settings with only three preference vectors, leaving unsupported the paper's implicit generalizability to higher-dimensional objective spaces.

“Existing XMORL approaches are scarce, and focus on providing support for policy selection by summarizing objective returns or highlighting salient states, but they offer limited insight into the underlying behaviours that drive objective trade-offs within a policy.”

Rajapakse et al., Sec. 2.5 · Section 2.5

“Experiments on multi-objective MuJoCo environments - HalfCheetah, Ant and Swimmer, demonstrate the framework's ability to isolate and quantify the specific behavioural patterns.”

Rajapakse et al., Abstract · Abstract

Reproducibility

The authors provide a public GitHub repository and utilize standard MO-gymnasium environments with the established D4MORL dataset, which facilitates independent reproduction. However, reproducibility is hindered by insufficient hyperparameter documentation for the complementary policy training (beyond referencing the PEDA framework) and the inherent stochasticity of the K-means clustering process with random initialization. The paper acknowledges that silhouette-based cluster selection can miss behavioral nuances, and the visualization in Figure 7(a) shows Cluster 2 containing multiple disconnected blobs, suggesting that reproductions may yield different cluster assignments and consequently different attribution scores unless random seeds are strictly controlled.

“The implementation of our TREX framework is provided in the github repository https://github.com/dilina-r/trex_xmorl.”

Rajapakse et al., Sec. 4 · Section 4, Experimental Setup

“For example, replacing the clustering algorithm with another or using known initial cluster centres (eg: using HIGHLIGHTS [3]) rather than random initializations could be explored.”

Rajapakse et al., Sec. 6.1 · Section 6.1

Abstract

Reinforcement Learning (RL) has demonstrated its ability to solve complex decision-making problems in a variety of domains, by optimizing reward signals obtained through interaction with an environment. However, many real-world scenarios involve multiple, potentially conflicting objectives that cannot be easily represented by a single scalar reward. Multi-Objective Reinforcement Learning (MORL) addresses this limitation by enabling agents to optimize several objectives simultaneously, explicitly reasoning about trade-offs between them. However, the ``black box" nature of the RL models makes the decision process behind chosen objective trade-offs unclear. Current Explainable Reinforcement Learning (XRL) methods are typically designed for single scalar rewards and do not account for explanations with respect to distinct objectives or user preferences. To address this gap, in this paper we propose TREX, a Trajectory based Explainability framework to explain Multi-objective Reinforcement Learning policies, based on trajectory attribution. TREX generates trajectories directly from the learned expert policy, across different user preferences and clusters them into semantically meaningful temporal segments. We quantify the influence of these behavioural segments on the Pareto trade-off by training complementary policies that exclude specific clusters, measuring the resulting relative deviation on the observed rewards and actions compared to the original expert policy. Experiments on multi-objective MuJoCo environments - HalfCheetah, Ant and Swimmer, demonstrate the framework's ability to isolate and quantify the specific behavioural patterns.

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

Challenges are public to read, but only signed-in members can post them. Your challenge text is stored with your account for moderation, but usernames are not shown in the public thread.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.