Can we automatize scientific discovery in the cognitive sciences?
This paper proposes automating the entire cognitive science discovery pipeline—experiment design, behavioral data simulation via foundation models, model synthesis through LLM program generation, and iterative refinement via an "interestingness" critic—to overcome the slow pace and bias of manual research. The vision is a high-throughput in-silico engine that searches vast algorithmic and experimental spaces to surface theoretically informative mechanisms for human validation.
This is a bold vision paper that articulates a plausible framework for automating cognitive science, offering a coherent synthesis of recent LLM capabilities but remaining largely speculative. The authors are admirably frank about risks, acknowledging that the central danger is "epistemic failure: producing persuasive-looking results that are scientifically hollow." As a conceptual roadmap the paper succeeds; as validation that an integrated automation system works, it is premature, as no empirical results from a working pipeline are presented.
The framework is conceptually coherent and grounded in existing component technologies such as GeCCo and the Centaur model. The discussion of representational limitations is particularly astute: the authors correctly identify that "automated discovery is fundamentally constrained by the expressivity of its underlying grammar," recognizing that no search procedure can discover what the representation forbids. The proposal to use LLMs as intelligent experiment samplers rather than fixed grammars offers a flexible approach to mitigating this bottleneck, and the four-stage decomposition aligns well with established philosophy of science.
The paper presents an ambitious vision without empirical validation of the integrated pipeline. The "interestingness" signal intended to guide discovery is vaguely defined and potentially dangerous; the authors themselves warn it could be gamed to "manufacture superficial weirdness" or drift toward "theatrical novelty." Furthermore, reliance on foundation models for synthetic behavioral data introduces severe epistemic risks, as these models may "rely on shortcuts, drift toward 'average' subjects, or fail under genuine distribution shift," potentially creating an illusion of mechanistic insight while encoding unknown biases. The risk of combinatorial explosion producing a "Library of Babel" full of scientifically hollow results remains inadequately addressed.
The paper does not present new experimental results or systematic comparisons against human-designed discoveries. It references the authors' prior work on GeCCo and Centaur as evidence for component feasibility, but these appear to be concurrent submissions rather than established foundations, and no comparison is made to demonstrate that this architecture outperforms general-purpose automated discovery tools like FunSearch or AlphaEvolve. The lack of a working demonstration is acknowledged implicitly when the authors state that "The proposed cycle is only as strong as its weakest step," recognizing that integration remains hypothetical.
As a vision paper, reproducibility is inherently limited: no code, datasets, or hyperparameters are provided for a unified system, nor could they be since the full integration appears to remain hypothetical. The authors acknowledge that "A pragmatic stance is to treat such models as accelerators for hypothesis generation rather than as ground truth," yet provide no concrete implementation details for the proposed LLM prompts, critic architecture, or the specific function used to evaluate "interestingness." Without public access to the referenced GeCCo system or the "recent iterations" of Centaur, independent verification of even the component claims is currently impossible.
The cognitive sciences aim to understand intelligence by formalizing underlying operations as computational models. Traditionally, this follows a cycle of discovery where researchers develop paradigms, collect data, and test predefined model classes. However, this manual pipeline is fundamentally constrained by the slow pace of human intervention and a search space limited by researchers' background and intuition. Here, we propose a paradigm shift toward a fully automated, in silico science of the mind that implements every stage of the discovery cycle using Large Language Models (LLMs). In this framework, experimental paradigms exploring conceptually meaningful task structures are directly sampled from an LLM. High-fidelity behavioral data are then simulated using foundation models of cognition. The tedious step of handcrafting cognitive models is replaced by LLM-based program synthesis, which performs a high-throughput search over a vast landscape of algorithmic hypotheses. Finally, the discovery loop is closed by optimizing for ''interestingness'', a metric of conceptual yield evaluated by an LLM-critic. By enabling a fast and scalable approach to theory development, this automated loop functions as a high-throughput in-silico discovery engine, surfacing informative experiments and mechanisms for subsequent validation in real human populations.
Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.
No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.