The Intelligent Disobedience Game: Formulating Disobedience in Stackelberg Games and Markov Decision Processes
This paper addresses the challenge of "intelligent disobedience" in shared autonomy — when assistive AI must override human commands to prevent harm but remain helpful. The authors formalize this as the Intelligent Disobedience Game (IDG), a sequential Stackelberg game where a human leader proposes actions and an assistive follower with superior environmental awareness decides whether to obey or intervene. The framework aims to provide the mathematical foundations for training safety-critical assistive systems.
The paper presents a reasonable theoretical framework mapping intelligent disobedience to extensive-form Stackelberg games with asymmetric information. The IDG captures the essential tension between obedience and safety, and the MDP translation (Section 4) offers a practical path for RL implementation. However, the analysis remains largely conceptual: it lacks rigorous theorem-proof structure despite invoking "backward induction," and empirical validation is entirely absent. The claim that the IDG "enables both the algorithmic development of agents" is overstated — the paper proposes a model but implements no algorithms.
The formalization correctly identifies that the follower (assistant) possesses the ultimate control over action execution, creating an interesting inversion of typical Stackelberg dynamics despite the leader moving first. The identification of "safety traps" — subsets of states where the follower can indefinitely prevent harm while blocking goal achievement — is a genuine theoretical contribution that exposes strategic tensions in multi-step settings (Section 3). The decoupled MDP representation for leader and follower, with the leader modeled as a POMDP appropriately reflecting their limited observation of action categories, is technically sound.
The mathematical presentation is sloppy: multiple definitions are all numbered "Definition 0" (Stackelberg Game, IDG, Safety trap), suggesting LaTeX counter issues or an incomplete draft. Crucially, the paper assumes the follower can perfectly classify actions into $A_g$, $A_h$, and $A_o$ (goal, harmful, other), sidestepping the core epistemic challenge of real-world intelligent disobedience — learning what constitutes harm. The follower reward function $\mathcal{R}_F$ grants $+1$ for disobeying harmful actions, which requires perfect knowledge of counterfactual outcomes (what would happen if the leader's action were obeyed). This makes the model more of an oracle analysis than a guide for learning agents that must infer harm from uncertain observations.
The comparison to related work is adequate but thin. The off-switch game (Hadfield-Menell et al.) is cited, but the IDG offers no formal comparison regarding equilibrium properties or incentive structures. The paper claims the strategies are "optimal" using loose inductive arguments without stating formal theorems, making verification impossible. While the characterization of follower strategies in the 1-step game is correct (disobey harmful, obey goal-reaching), the multi-step analysis relies on hand-waving about "pressuring" the follower rather than formal subgame perfect equilibrium analysis. No experiments are conducted to demonstrate that the MDL/POMDP formulation actually produces the claimed equilibrium behaviors when trained with RL.
Reproducibility is severely limited. No code repository, environment specifications, or RL training details are provided. The paper is purely theoretical, yet even theoretical reproduction would require formal theorem statements and proofs which are absent (the "induction" in Section 3 has no base case lemmas or theorem statements). The claim that "Equilibria of the IDG can be empirically validated" because optimal policies exist in MDPs is circular — existence does not constitute validation. Without an appendix containing proofs or experimental details, independent verification of the optimal strategy characterizations is not possible.
In shared autonomy, a critical tension arises when an automated assistant must choose between obeying a human's instruction and deliberately overriding it to prevent harm. This safety-critical behavior is known as intelligent disobedience. To formalize this dynamic, this paper introduces the Intelligent Disobedience Game (IDG), a sequential game-theoretic framework based on Stackelberg games that models the interaction between a human leader and an assistive follower operating under asymmetric information. It characterizes optimal strategies for both agents across multi-step scenarios, identifying strategic phenomena such as ``safety traps,'' where the system indefinitely avoids harm but fails to achieve the human's goal. The IDG provides a needed mathematical foundation that enables both the algorithmic development of agents that can learn safe non-compliance and the empirical study of how humans perceive and trust disobedient AI. The paper further translates the IDG into a shared control Multi-Agent Markov Decision Process representation, forming a compact computational testbed for training reinforcement learning agents.
Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.
No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.