The Intelligent Disobedience Game: Formulating Disobedience in Stackelberg Games and Markov Decision Processes

cs.AI cs.GT cs.LG Benedikt Hornig, Reuth Mirsky · Mar 22, 2026

What it does

Why it matters

The authors formalize this as the Intelligent Disobedience Game (IDG), a sequential Stackelberg game where a human leader proposes actions and an assistive follower with superior environmental awareness decides whether to obey or...

Main concern

Community signal

0 up · 0 down

AI Review AI reviewed

Plain-language introduction

This paper addresses the challenge of "intelligent disobedience" in shared autonomy — when assistive AI must override human commands to prevent harm but remain helpful. The authors formalize this as the Intelligent Disobedience Game (IDG), a sequential Stackelberg game where a human leader proposes actions and an assistive follower with superior environmental awareness decides whether to obey or intervene. The framework aims to provide the mathematical foundations for training safety-critical assistive systems.

Critical review

Verdict

Bottom line

The paper presents a reasonable theoretical framework mapping intelligent disobedience to extensive-form Stackelberg games with asymmetric information. The IDG captures the essential tension between obedience and safety, and the MDP translation (Section 4) offers a practical path for RL implementation. However, the analysis remains largely conceptual: it lacks rigorous theorem-proof structure despite invoking "backward induction," and empirical validation is entirely absent. The claim that the IDG "enables both the algorithmic development of agents" is overstated — the paper proposes a model but implements no algorithms.

“we translate the aforementioned IDG into a Shared control Multi-Agent MDP for each player... common reinforcement learning algorithms can be applied to IDGs”

paper · Section 4

“The IDG provides a needed mathematical foundation that enables both the algorithmic development of agents that can learn safe non-compliance”

paper · Abstract

What holds up

The formalization correctly identifies that the follower (assistant) possesses the ultimate control over action execution, creating an interesting inversion of typical Stackelberg dynamics despite the leader moving first. The identification of "safety traps" — subsets of states where the follower can indefinitely prevent harm while blocking goal achievement — is a genuine theoretical contribution that exposes strategic tensions in multi-step settings (Section 3). The decoupled MDP representation for leader and follower, with the leader modeled as a POMDP appropriately reflecting their limited observation of action categories, is technically sound.

“safety traps... the follower prefers entering a safety trap, as it yields an infinite payoff stream while preventing harm to the leader. In contrast, the leader prefers reaching the goal”

paper · Section 3.2

“the follower, albeit their name, is the one with the ultimate decision of executing an action or not”

paper · Section 2

Main concerns

The mathematical presentation is sloppy: multiple definitions are all numbered "Definition 0" (Stackelberg Game, IDG, Safety trap), suggesting LaTeX counter issues or an incomplete draft. Crucially, the paper assumes the follower can perfectly classify actions into $A_g$, $A_h$, and $A_o$ (goal, harmful, other), sidestepping the core epistemic challenge of real-world intelligent disobedience — learning what constitutes harm. The follower reward function $\mathcal{R}_F$ grants $+1$ for disobeying harmful actions, which requires perfect knowledge of counterfactual outcomes (what would happen if the leader's action were obeyed). This makes the model more of an oracle analysis than a guide for learning agents that must infer harm from uncertain observations.

“The follower can distinguish between all three subsets [Ag(s), Ah(s), Ao(s)]”

paper · Section 2

“$\mathcal{R}_{V}(s,a_{L},a_{F},s')=\begin{cases}1,&\text{if }a_{F}=disobey\text{ and }\mathcal{T}(s,a_{L},obey)\in A_{h}\end{cases}$”

paper · Section 4

Evidence and comparison

The comparison to related work is adequate but thin. The off-switch game (Hadfield-Menell et al.) is cited, but the IDG offers no formal comparison regarding equilibrium properties or incentive structures. The paper claims the strategies are "optimal" using loose inductive arguments without stating formal theorems, making verification impossible. While the characterization of follower strategies in the 1-step game is correct (disobey harmful, obey goal-reaching), the multi-step analysis relies on hand-waving about "pressuring" the follower rather than formal subgame perfect equilibrium analysis. No experiments are conducted to demonstrate that the MDL/POMDP formulation actually produces the claimed equilibrium behaviors when trained with RL.

“By backward induction, it follows that if a goal state is reachable from the initial state, optimal play leads to goal attainment”

paper · Section 3.2

“the leader can effectively pressure the follower into obeying, as the follower seeks to avoid infinite negative payoffs by disobeying”

paper · Section 3.2

Reproducibility

Reproducibility is severely limited. No code repository, environment specifications, or RL training details are provided. The paper is purely theoretical, yet even theoretical reproduction would require formal theorem statements and proofs which are absent (the "induction" in Section 3 has no base case lemmas or theorem statements). The claim that "Equilibria of the IDG can be empirically validated" because optimal policies exist in MDPs is circular — existence does not constitute validation. Without an appendix containing proofs or experimental details, independent verification of the optimal strategy characterizations is not possible.

“Since there always exists an optimal policy in MDPs albrecht2024multi, the Equilibria of the IDG can be empirically validated”

paper · Section 4

Abstract

In shared autonomy, a critical tension arises when an automated assistant must choose between obeying a human's instruction and deliberately overriding it to prevent harm. This safety-critical behavior is known as intelligent disobedience. To formalize this dynamic, this paper introduces the Intelligent Disobedience Game (IDG), a sequential game-theoretic framework based on Stackelberg games that models the interaction between a human leader and an assistive follower operating under asymmetric information. It characterizes optimal strategies for both agents across multi-step scenarios, identifying strategic phenomena such as ``safety traps,'' where the system indefinitely avoids harm but fails to achieve the human's goal. The IDG provides a needed mathematical foundation that enables both the algorithmic development of agents that can learn safe non-compliance and the empirical study of how humans perceive and trust disobedient AI. The paper further translates the IDG into a shared control Multi-Agent Markov Decision Process representation, forming a compact computational testbed for training reinforcement learning agents.

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

Challenges are public to read, but only signed-in members can post them. Your challenge text is stored with your account for moderation, but usernames are not shown in the public thread.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.