ARYA: A Physics-Constrained Composable & Deterministic World Model Architecture
ARYA presents a world model architecture using "nano models"—small specialized components orchestrated by an autonomous agent (AARA)—rather than monolithic neural networks. The system claims physics-constrained determinism, sub-20-second training cycles, and an "unfireable" safety kernel that cannot be bypassed. The authors position this as production-deployed across seven industry domains from aerospace to pharma, achieving state-of-the-art results on six of nine benchmarks with "zero neural network parameters."
This paper makes extraordinary claims about proprietary technology with insufficient verifiable evidence. While the architecture concept—composable specialized models with explicit physics constraints—is methodologically interesting, the evaluation methodology raises serious concerns. The core issue is an apples-to-oranges comparison: ARYA achieves 99.89% on CLadder and 73.30 on PhysReason using deterministic symbolic solvers, while competing against neural language models on benchmarks designed for learned reasoning. This is not a fair capability comparison but rather demonstrates that hard-coded physics equations outperform learned approximations on physics benchmarks. The paper's status as a company white paper (authors are CEO and CTO of ARYA Labs), lack of reproducibility artifacts, and reliance on unverifiable 2026 citations further undermine its scientific rigor.
The architectural philosophy of decomposing world models into composable, physics-constrained components is well-motivated. The Context Network's graph structure for maintaining state with dependency tracking, the Belief Network for uncertainty representation, and the explicit simulation-before-execution pattern align with established principles from Pearl's causal inference framework. The safety architecture—formally verifying self-modifications with Z3 before deployment—is a sound approach in principle. Section 12.3's discussion of the World of Workflows benchmark appropriately acknowledges that frontier LLMs suffer from "dynamics blindness" in enterprise environments, validating the need for explicit dynamics modeling.
The central flaw is misrepresentation of benchmark results. ARYA reports 99.89% on CLadder—beating GPT-4 by nearly 30 percentage points—but this is achieved using symbolic causal inference engines essentially equivalent to the oracle that generated CLadder's ground truth. This is not "causal reasoning" in any comparable sense to LLM evaluation; it is hard-coding the benchmark's solution method. Similarly, the claimed zero-shot protein folding success (Section 6.7) uses first-principles biophysics solvers specifically crafted for that domain, not learned generalization. The "unfireable Safety Kernel" is described architecturally but lacks empirical demonstration of resistance to adaptive attacks. The six-level autonomy framework (A1-A6) culminating in "Open-Ended Self-Improvement" is aspirational marketing language without documented instances of the system actually modifying its own architecture meaningfully. Most critically, the paper fails to report failure modes, uncertainty calibration, or cases where physics-constrained models fail—omitting the essential scientific practice of boundary characterization.
The comparison methodology is systematically biased toward ARYA's strengths. On CLadder (symbolic causal inference), PhysReason (physics equations), and AI Safety Index (rule-based evaluation), symbolic methods naturally excel. These are domains where closed-form solutions exist. On SWE-bench—requiring open-ended software engineering—ARYA scores 0.0%. The paper acknowledges this as a "boundary" rather than recognizing it as evidence that composable nano models fail at tasks requiring broad knowledge integration. The comparison to AlphaFold2 is misleading: AlphaFold2 learned protein folding from 170,000 structures; ARYA uses handcrafted biophysics solvers. The claim that ARYA achieved "AlphaFold2-level accuracy" with "zero training data" obscures that it required zero training data precisely because human experts encoded the relevant physics as solvers. This is automation, not learning. The comparison to V-JEPA 2, GPT-5.2, and Claude Opus 4.6 on video benchmarks is unverifiable as the companion paper (Dobrin & Chmiel 2026b) is self-cited and unavailable.
Reproducibility is severely compromised. No code, model weights, training data, or implementation details are provided. The "nano model" specification (Section 6.1) is referenced but absent from the paper. Hyperparameters, training procedures, and network architectures are undisclosed. The "sub-20-second training" claim lacks any breakdown of hardware, data requirements, or convergence criteria. The seven "production deployments" are described in case studies but cannot be independently verified—companies are unnamed (except NASA EXCITE), metrics are self-reported, and no third-party audits are cited. The safety validation of "zero successful bypasses across 40 attempts" is meaningless without describing who conducted the attempts, what methods were used, and what the attempt distribution covered. For a system making safety claims, the absence of red-team evaluations, adversarial testing protocols, or formal safety proofs is a critical omission.
This paper presents ARYA, a composable, physics-constrained, deterministic world model architecture built on five foundational principles: nano models, composability, causal reasoning, determinism, and architectural AI safety. We demonstrate that ARYA satisfies all canonical world model requirements, including state representation, dynamic prediction, causal and physical awareness, temporal consistency, generalization, learnability, and planning and control. Unlike monolithic foundation models, the ARYA foundation model implements these capabilities through a hierarchical system-of-system-of-systems of specialized nano models, orchestrated by AARA (ARYA Autonomous Research Agent), an always-on cognitive daemon that executes a continuous sense-decide-act-learn loop. The nano model architecture provides linear scaling, sparse activation, selective untraining, and sub-20-second training cycles, resolving the traditional tension between capability and computational efficiency. A central contribution is the Unfireable Safety Kernel: an architecturally immutable safety boundary that cannot be disabled or circumvented by any system component, including its own self-improvement engine. This is not a social or ethical alignment statement; it is a technical framework ensuring human control persists as autonomy increases. Safety is an architectural constraint governing every operation, not a policy layer applied after the fact. We present formal alignment between ARYA's architecture and canonical world model requirements, and report summarizing its state-of-the-art performance across 6 of 9 competitive benchmarks head-to-head with GPT-5.2, Opus 4.6, and V-JEPA-2. All with zero neural network parameters, across seven active industry domain nodes spanning aerospace, pharma manufacturing, oil and gas, smart cities, biotech, defense, and medical devices.
Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.
No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.