Can we automatize scientific discovery in the cognitive sciences?

cs.AI q-bio.NC Akshay K. Jagadish, Milena Rmus, Kristin Witte, Marvin Mathony, Marcel Binz, Eric Schulz · Mar 22, 2026

What it does

Why it matters

Main concern

Community signal

0 up · 0 down

AI Review AI reviewed

Plain-language introduction

This paper proposes automating the entire cognitive science discovery pipeline—experiment design, behavioral data simulation via foundation models, model synthesis through LLM program generation, and iterative refinement via an "interestingness" critic—to overcome the slow pace and bias of manual research. The vision is a high-throughput in-silico engine that searches vast algorithmic and experimental spaces to surface theoretically informative mechanisms for human validation.

Critical review

Verdict

Bottom line

This is a bold vision paper that articulates a plausible framework for automating cognitive science, offering a coherent synthesis of recent LLM capabilities but remaining largely speculative. The authors are admirably frank about risks, acknowledging that the central danger is "epistemic failure: producing persuasive-looking results that are scientifically hollow." As a conceptual roadmap the paper succeeds; as validation that an integrated automation system works, it is premature, as no empirical results from a working pipeline are presented.

“the central risk is not merely technical failure, but epistemic failure: producing persuasive-looking results that are scientifically hollow”

Jagadish et al. · Discussion

What holds up

The framework is conceptually coherent and grounded in existing component technologies such as GeCCo and the Centaur model. The discussion of representational limitations is particularly astute: the authors correctly identify that "automated discovery is fundamentally constrained by the expressivity of its underlying grammar," recognizing that no search procedure can discover what the representation forbids. The proposal to use LLMs as intelligent experiment samplers rather than fixed grammars offers a flexible approach to mitigating this bottleneck, and the four-stage decomposition aligns well with established philosophy of science.

“automated discovery is fundamentally constrained by the expressivity of its underlying grammar”

Jagadish et al. · Proposing experiments

Main concerns

The paper presents an ambitious vision without empirical validation of the integrated pipeline. The "interestingness" signal intended to guide discovery is vaguely defined and potentially dangerous; the authors themselves warn it could be gamed to "manufacture superficial weirdness" or drift toward "theatrical novelty." Furthermore, reliance on foundation models for synthetic behavioral data introduces severe epistemic risks, as these models may "rely on shortcuts, drift toward 'average' subjects, or fail under genuine distribution shift," potentially creating an illusion of mechanistic insight while encoding unknown biases. The risk of combinatorial explosion producing a "Library of Babel" full of scientifically hollow results remains inadequately addressed.

“A critic could reward theatrical novelty, drift toward idiosyncratic edge cases, or be gamed by policies that learn to manufacture superficial weirdness”

Jagadish et al. · Discussion

“Behavioral foundation models may rely on shortcuts, drift toward 'average' subjects, or fail under genuine distribution shift”

Jagadish et al. · Discussion

Evidence and comparison

The paper does not present new experimental results or systematic comparisons against human-designed discoveries. It references the authors' prior work on GeCCo and Centaur as evidence for component feasibility, but these appear to be concurrent submissions rather than established foundations, and no comparison is made to demonstrate that this architecture outperforms general-purpose automated discovery tools like FunSearch or AlphaEvolve. The lack of a working demonstration is acknowledged implicitly when the authors state that "The proposed cycle is only as strong as its weakest step," recognizing that integration remains hypothetical.

“The proposed cycle is only as strong as its weakest step”

Jagadish et al. · Discussion

Reproducibility

As a vision paper, reproducibility is inherently limited: no code, datasets, or hyperparameters are provided for a unified system, nor could they be since the full integration appears to remain hypothetical. The authors acknowledge that "A pragmatic stance is to treat such models as accelerators for hypothesis generation rather than as ground truth," yet provide no concrete implementation details for the proposed LLM prompts, critic architecture, or the specific function used to evaluate "interestingness." Without public access to the referenced GeCCo system or the "recent iterations" of Centaur, independent verification of even the component claims is currently impossible.

“A pragmatic stance is to treat such models as accelerators for hypothesis generation rather than as ground truth”

Jagadish et al. · Discussion

Abstract

The cognitive sciences aim to understand intelligence by formalizing underlying operations as computational models. Traditionally, this follows a cycle of discovery where researchers develop paradigms, collect data, and test predefined model classes. However, this manual pipeline is fundamentally constrained by the slow pace of human intervention and a search space limited by researchers' background and intuition. Here, we propose a paradigm shift toward a fully automated, in silico science of the mind that implements every stage of the discovery cycle using Large Language Models (LLMs). In this framework, experimental paradigms exploring conceptually meaningful task structures are directly sampled from an LLM. High-fidelity behavioral data are then simulated using foundation models of cognition. The tedious step of handcrafting cognitive models is replaced by LLM-based program synthesis, which performs a high-throughput search over a vast landscape of algorithmic hypotheses. Finally, the discovery loop is closed by optimizing for ''interestingness'', a metric of conceptual yield evaluated by an LLM-critic. By enabling a fast and scalable approach to theory development, this automated loop functions as a high-throughput in-silico discovery engine, surfacing informative experiments and mechanisms for subsequent validation in real human populations.

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

Challenges are public to read, but only signed-in members can post them. Your challenge text is stored with your account for moderation, but usernames are not shown in the public thread.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.