Active Inference Agency Formalization, Metrics, and Convergence Assessments
This paper addresses mesa-optimization by defining agency as a balance between curiosity (KL divergence) and empowerment (mutual information), proposing an optimization-friendly agency function and an STEC-based metric to detect mesa-optimizers. The work claims that agency functions are convex, smooth, and exhibit logarithmic convergence—suggesting high probability of spontaneous emergence in modern models.
The paper presents an ambitious framework connecting active inference to agency and mesa-optimization. However, foundational mathematical claims are unproven or misapplied, and core theoretical results lack validation. The definition of agency as $A = \alpha L_{Curiosity} + \beta L_{Empowerment} + \gamma L_{Mesa}$ is conceptually interesting but the asserted properties (convexity, smoothness) do not follow from the cited literature. The notion of agency as a "Continuous Representation" remains vague, and the proposed detection framework for mesa-optimizers, while theoretically motivated, offers no empirical substantiation.
The framing of agency as a tension between curiosity (minimizing prediction error) and empowerment (maximizing control capacity) is well-grounded in active inference and information theory. The paper correctly identifies that $D_{KL}(p||q)$ promotes exploration and that empowerment $I(A^n; S_{t+n})$ incentivizes controllability. The connection to instrumental goals (self-preservation, resource acquisition) via this duality is empirically plausible and aligns with prior work in intrinsic motivation. The use of STARC metrics for reward comparison, when applied correctly, represents sound methodology drawn from established RL theory.
The core mathematical claims rely on serious misapplications of results. The paper asserts that the agency function is convex and $C^\infty$ smooth, citing DeVore & Sharpley (1984) Theorem 7.1. However, this theorem concerns maximal functions in harmonic analysis—not the operation $\max_{p(a^n)}$ over finite sets of actions. The empowerment term involves maximizing mutual information over action distributions, which is a non-convex optimization; the resulting function is piecewise-defined, not globally $C^\infty$. The paper then derives $\varepsilon = 10^{-360}$ by equating a lower bound on neural complexity with the number of parameters in modern networks—a dimensional inconsistency that produces nonsensical precision. The measure argument about agentic functions occupying vanishingly small volume is technically correct (zero measure in infinite-dimensional spaces), but the subsequent "$\epsilon$-agentic" construction to recover non-zero probability lacks geometric justification for why the ideal agentic function should be special. Most critically, the paper claims STARC distance detects mesa-optimizers, but STARC measures behavioral equivalence of reward functions, not agency—no connection is established between STARC distance and the agency properties.
The paper contains zero empirical experiments—no simulations, no ML training runs, no detection results for mesa-optimizers. This is a severe limitation for a paper claiming to solve detection of "undesirable inner optimization in complex AI systems." The connection to Yarotsky (2017) and Elbrächter et al. (2021) on neural complexity is misapplied: those works bound approximation rates for compositional functions, not convergence to agentic objectives. The claimed logarithmic convergence in sparse environments cites Agarwal et al. (2012), but that work requires explicit sparsity-inducing regularization ($L_1$), which the agency formulation does not include. The comparison to STARC (Skalse et al., 2024)—which I verified is a sound framework for reward comparison—fails to acknowledge that STARC requires environment dynamics for computing canonicalized rewards, rendering real-world application non-trivial. The paper does not cite or compare against empirical mesa-optimizer detection work (e.g., Hubinger et al.'s 2021 follow-up studies, or mechanistic interpretability approaches).
No code, data, or implementation details are provided. The agency function definition depends on computing empowerment $\mathfrak{E} = \max_{p(a^n)} I(A^n; S_{t+n})$, which requires access to environment transition dynamics $\tau(s,a)$ and optimizing over action sequences—computationally expensive and numerically unstable. The paper does not specify how $n$ (action horizon) is chosen, how the mutual information is estimated, or how $L_{Mesa}$ is computed for detection. The STARC-based metric requires canonicalized rewards $c(R)$ and value functions $V^\pi$, but the paper offers no algorithm for constructing these from observed behavior. The claim of $\varepsilon \approx 10^{-360}$ convergence precision is physically unachievable and suggests the author does not intend empirical verification. Reproduction would require: (1) environment specification, (2) exact hyperparameters for curiosity/empowerment weighting, (3) mesa-optimizer instantiation and ground-truth labeling, and (4) metric implementation with canonicalization oracle—none of which are provided.
This paper addresses the critical challenge of mesa-optimization in AI safety by providing a formal definition of agency and a framework for its analysis. Agency is conceptualized as a Continuous Representation of accumulated experience that achieves autopoiesis through a dynamic balance between curiosity (minimizing prediction error to ensure non-computability and novelty) and empowerment (maximizing the control channel's information capacity to ensure subjectivity and goal-directedness). Empirical evidence suggests that this active inference-based model successfully accounts for classical instrumental goals, such as self-preservation and resource acquisition. The analysis demonstrates that the proposed agency function is smooth and convex, possessing favorable properties for optimization. While agentic functions occupy a vanishingly small fraction of the total abstract function space, they exhibit logarithmic convergence in sparse environments. This suggests a high probability for the spontaneous emergence of agency during the training of modern, large-scale models. To quantify the degree of agency, the paper introduces a metric based on the distance between the behavioral equivalents of a given system and an "ideal" agentic function within the space of canonicalized rewards (STARC). This formalization provides a concrete apparatus for classifying and detecting mesa-optimizers by measuring their proximity to an ideal agentic objective, offering a robust tool for analyzing and identifying undesirable inner optimization in complex AI systems.
Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.
No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.