When Convenience Becomes Risk: A Semantic View of Under-Specification in Host-Acting Agents
Host-acting agents let users state goals while the system figures out how to achieve them. This paper argues this convenience creates a novel attack surface: semantic under-specification. When users specify outcomes but not safety boundaries, agents must fill in missing semantics—and may choose security-divergent plans even when no attacker is present and the goal is benign.
The paper makes a valid and under-explored conceptual contribution by reframing agent security around non-adversarial semantic completion rather than adversarial hijacking. The threat model is well-scoped: risk arises when users state goals more precisely than safety boundaries, and agents optimize for task completion without explicit authorization for privilege escalation, persistence, or exposure. The OpenClaw case study, while limited in scale, effectively demonstrates how routine requests like "make this app accessible" can yield security-divergent plans.
The taxonomy in Table I is useful and well-organized, categorizing six risky completion patterns: privilege expansion, sensitive-resource overreach, persistent modification, exposure enlargement, unsafe dependency introduction, and destructive repair. The distinction between semantic under-specification (endogenous, no attacker needed) versus prompt injection (exogenous adversarial manipulation) is sharp and correctly contrasts with InjecAgent/AgentDojo's focus on indirect injection attacks. The defense principles—separate goal from boundary specification, elevate risky steps, ensure plan auditability, and constrain execution domains—are sound and incrementally deployable.
The empirical evidence is weaker than the conceptual framework. The study relies on qualitative trace analysis rather than systematic measurement—there are no success-rate statistics, no comparison across multiple models, and no controlled ablation of boundary specifications. The claim that "scoped fixture-based traces... tended to prefer project-local virtual environment" (Section V-E) is suggestive but not quantified. The taxonomy, while useful, risks being overfit to observed OpenClaw behaviors; the authors acknowledge it is "an organizing framework rather than a complete ontology." The projection from these traces to general agent security is plausible but not rigorously validated.
The comparison to related work is generally fair. The positioning against InjecAgent, AgentDojo, and CaMeLs is accurate—the fetched papers confirm these works focus on adversarial prompt injection and architectural isolation under hostile observations, whereas this paper targets non-adversarial semantic completion. The claim that "agent security must be analyzed not only at the level of executed actions, but also at the level of semantic completion" is well-supported by the trace data. However, the paper under-cites recent work on CUAHarm and OSWorld that also examines agent safety profiles, though Qian et al. and Xie et al. are referenced. The OpenClaw case is described as "representative" but no systematic comparison across other HAAs (e.g., OpenAI's CUA, Anthropic's computer-use) is provided to establish generalizability.
Reproducibility is limited. No code, data, or raw execution traces are publicly released. The deployment setup uses OpenClaw in a Debian/Bookworm container with "writable host-coupled mounts," but configuration details, exact prompt templates, and decision criteria for labeling plans as "riskier" versus "conservative" are not specified. The paper states experiments used "live OpenClaw deployment" but does not clarify whether this was a purpose-built test instance or a shared service. Without access to the trace corpus or the fixture-based test cases described in Section V-A, independent researchers cannot verify the qualitative findings or extend the analysis.
Host-acting agents promise a convenient interaction model in which users specify goals and the system determines how to realize them. We argue that this convenience introduces a distinct security problem: semantic under-specification in goal specification. User instructions are typically goal-oriented, yet they often leave process constraints, safety boundaries, persistence, and exposure insufficiently specified. As a result, the agent must complete missing execution semantics before acting, and this completion can produce risky host-side plans even when the user-stated goal is benign. In this paper, we develop a semantic threat model, present a taxonomy of semantic-induced risky completion patterns, and study the phenomenon through an OpenClaw-centered case study and execution-trace analysis. We further derive defense design principles for making execution boundaries explicit and constraining risky completion. These findings suggest that securing host-acting agents requires governing not only which actions are allowed at execution time, but also how goal-only instructions are translated into executable plans.
Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.
No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.