Neural Computers
Neural Computers (NCs) propose a new machine form where computation, memory, and I/O are unified inside a learned latent runtime state rather than separated as in conventional computers or external as in agents. This work instantiates early NC prototypes as video models that roll out terminal and desktop interfaces from text, pixels, and actions—showing that basic I/O alignment and short-horizon control are learnable without privileged program state. The results demonstrate early runtime primitives but also highlight that symbolic stability, routine reuse, and runtime governance remain unsolved on the long path toward the envisioned Completely Neural Computer (CNC).
This is an ambitious position paper paired with early empirical prototypes that successfully demonstrate interface rendering and local action fidelity, yet fall short on the symbolic reasoning and long-horizon consistency required for the proposed CNC vision. The work clearly delineates the gap between current video-based NCs—essentially action-conditioned world models for interfaces—and the Turing-complete, universally programmable, behavior-consistent runtime the authors call a CNC. The empirical results validate that short-horizon control and I/O alignment are achievable, but the arithmetic-probe failures (4–83% depending on prompting) and lack of demonstrated routine reuse suggest the CNC roadmap remains largely aspirational.
The video-based instantiation credibly establishes that neural networks can learn to render structured interface state and respond to local action inputs with measurable fidelity. The CLI prototype achieves 54% character-level OCR accuracy and 0.54 exact-line accuracy, while the GUI prototype reaches 98.7% cursor accuracy when given explicit visual supervision. The ablation studies are thorough: they show that data quality dominates scale (110 hours of goal-directed data outperforms 1,400 hours of random exploration), that internal action-injection outperforms external conditioning (SSIM 0.863 vs 0.746), and that reprompting can bootstrap arithmetic performance from 4% to 83%, revealing the models' strength as steerable renderers even if not native reasoners.
The central concern is the leap from impressive interface rendering to claims about future Turing-complete, self-contained computers. Current prototypes exhibit severe symbolic instability: without reprompting, the CLI model achieves only 4% on basic arithmetic probes, indicating that the latent state does not reliably encode symbolic computation. The evaluation is limited to open-loop rollouts against logged traces, so stability under closed-loop interaction and long-horizon task execution remains unverified. The paper also offers no empirical demonstration of the CNC-defining properties of routine reuse, installable capabilities, or behavior consistency—critical gaps given that these are posited as the primary advantages over agents and conventional computers. Finally, the comparison to Sora2 (71% arithmetic accuracy vs 4%) is under-explained, with only speculative hypotheses offered for the disparity.
The evidence supports the narrow claim that video models can learn interface dynamics from I/O traces, but it does not yet support the broader CNC thesis of a unified, programmable runtime. The comparison to related work is conceptually crisp—the authors clearly distinguish NCs from agents (which mediate external computers) and world models (which predict environment dynamics)—yet the empirical gap between the current prototypes and these existing system objects is not quantified. The arithmetic probe results (Table 5) suggest that Sora2 may have latent capabilities the authors' model lacks, but the paper offers only unverified hypotheses for this discrepancy rather than controlled experiments isolating model scale, data, or conditioning factors.
The paper provides substantial technical detail for reproduction, including exact model architectures (Wan2.1-based with DiT stacks), data pipelines (asciinema for CLI, vhs for Clean, Dockerized environments for GUI), and training regimes (∼15,000 H100 hours for CLIGen General, ∼7,000 for Clean, ∼23k GPU-hours for GUIWorld). Hyperparameters are specified (AdamW, lr $5 \times 10^{-5}$, weight decay $10^{-2}$, bfloat16, gradient clipping at 1.0), and the data engine construction is documented in depth. However, no code, model weights, or interactive demonstration environments have been released at the time of writing, which blocks independent verification of the CNC roadmap claims.
We propose a new frontier: Neural Computers (NCs) -- an emerging machine form that unifies computation, memory, and I/O in a learned runtime state. Unlike conventional computers, which execute explicit programs, agents, which act over external execution environments, and world models, which learn environment dynamics, NCs aim to make the model itself the running computer. Our long-term goal is the Completely Neural Computer (CNC): the mature, general-purpose realization of this emerging machine form, with stable execution, explicit reprogramming, and durable capability reuse. As an initial step, we study whether early NC primitives can be learned solely from collected I/O traces, without instrumented program state. Concretely, we instantiate NCs as video models that roll out screen frames from instructions, pixels, and user actions (when available) in CLI and GUI settings. These implementations show that learned runtimes can acquire early interface primitives, especially I/O alignment and short-horizon control, while routine reuse, controlled updates, and symbolic stability remain open. We outline a roadmap toward CNCs around these challenges. If overcome, CNCs could establish a new computing paradigm beyond today's agents, world models, and conventional computers.
Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.
No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.