Essay 10 of 12GIRL

Darwin-Gödel to ShinkaEvolve: The Case for Open-Ended AI

From the Darwin-Gödel Machine to ShinkaEvolve — the case for open-ended systems that never converge.

31 min read6,808 words❖ Series: Continual Intelligence

Anchor papers: Zhang et al. arXiv:2505.22954 · György et al. arXiv:2506.23908 · Lange et al. arXiv:2509.19349 · Han et al. arXiv:2502.19402 Series: Continual Intelligence


In 2025, a coding agent rewrote its own source code, tested each rewrite against a software engineering benchmark, and used the result to guide the next iteration. It did not adjust weights via backpropagation. It did not receive human-authored training examples. It grew an archive of increasingly capable versions of itself — each one building on the last, each tested empirically before being accepted into the lineage. By the time the experiment concluded, performance on SWE-bench had climbed from 20.0% to 50.0%, and on Polyglot from 14.2% to 30.7% (Zhang, Hu, Lu, Lange, and Clune, 2025).

This is the Darwin Gödel Machine. And the number alone is not what matters.

Every other system this series has examined — the plasticity repair methods [← A1], the world models [← A3], the GVF-driven agents [← A4], the stable RL procedures [← A7], the frontier reasoning systems [← A9] — all operate inside a fixed architectural boundary. They learn by adjusting parameters within a representation space that was designed by humans. The DGM rewrote the boundary itself. That distinction is not a technical detail. It is the entire argument of this article.

Part IV of this series asks: what lies beyond the wall that frontier reasoning systems are hitting [← A9]? Three independent groups arrived at the same answer, from different directions. A DeepMind theoretical team argues by formal argument that statistical learning cannot reach general intelligence (György, Lattimore, Lazić, and Szepesvári, 2025). Sakana AI's engineering teams built systems that sidestep statistical learning's limits via open-ended evolutionary search (Zhang et al., 2025; Lange, Imajuku, and Cetin, 2025). A MIT-Harvard mechanistic team proposes that reward-based pretraining from scratch is the structural bridge (Han, Pari, Gershman, and Agrawal, 2025). Convergence from three distinct directions — theoretical, engineering, and mechanistic — is the strongest form of evidence available.

This article builds the case, evaluates the evidence, and names the gaps that remain.


§1 — The Ceiling of Statistical Learning

All modern deep learning systems, including the frontier reasoning models analyzed in [← A9], are trained via statistical learning: minimize expected loss over a distribution of training examples, optimize by gradient descent, generalize to held-out examples from the same distribution. This recipe has delivered transformative results. It has also, according to György, Lattimore, Lazić, and Szepesvári (2025), a fundamental ceiling.

Their argument is specific. Sound deductive reasoning — deriving new knowledge from existing facts and rules through logically valid inference — requires exact learning, not statistical approximation. A system that achieves high accuracy on a deductive task has not necessarily learned the underlying rule; it may have learned a statistical pattern that correlates with the right answer in distribution. On out-of-distribution deductive tasks, that pattern fails. György et al. (2025) observe that even the most advanced frontier AI systems "regularly and consistently falter on easily-solvable deductive reasoning tasks," and argue that "their unsound behavior is a consequence of the statistical learning approach powering their development."

The claim is not that deep learning is useless — it is that statistical approximation has a hard theoretical limit when exact, reliable logical inference is required. Crossing that limit requires a shift from approximating rules to learning them with formal guarantees.

The neurosymbolic research program has been making a version of this argument for twenty years. Garcez and Lamb (2020), reviewing two decades of neural-symbolic computing, concluded that "deep learning alone is insufficient for trust, safety, interpretability, and accountability" — that symbolic knowledge representation and logical reasoning must be integrated with neural networks to produce systems capable of sound reasoning. Their diagnosis has been confirmed and extended. Colelough and Regli (2024), in a systematic PRISMA-methodology review of neuro-symbolic AI from 2020 to 2024, document the same structural gap: the integration of symbolic and sub-symbolic AI remains the defining challenge of the field's third wave, and it has not closed. The neurosymbolic evidence base provides the empirical complement to György et al.'s formal argument: twenty years of experiments confirm what the theory predicts.

Together, these four works set up the central question of this article: if gradient descent on fixed architectures cannot cross the general intelligence threshold, what can?

(see Figure 1)


Figure 1 — The Statistical-to-Exact Learning Spectrum

  ══════════════════════════════════════════════════════════════════════════════════════
   LEARNING PARADIGM SPECTRUM: FROM STATISTICAL TO EXACT
  ──────────────────────────────────────────────────────────────────────────────────────

   STATISTICAL LEARNING           HYBRID / SYMBOLIC+NEURAL         EXACT LEARNING
   ──────────────────────         ──────────────────────────        ──────────────────────
   Mechanism:                     Mechanism:                        Mechanism:
   Minimize expected loss         Neural + symbolic integration;    Formal rule induction
   over training distribution;    structured search (MCTS, CoT);   or empirically validated
   gradient descent               neuro-symbolic modules            self-modification

   Strength:                      Strength:                         Strength:
   Scales with data and           Improves robustness and           Logically sound;
   compute; generalization        interpretability; structured      distribution-invariant;
   within distribution            search reduces inference cost     generalizes exactly

   Failure mode:                  Failure mode:                     Failure mode:
   Distribution shift;            Partial — symbolic modules        Hard to scale;
   deductive breakdown;           may still approximate             provable guarantees
   GI ceiling (György et al.)     logical inference                 rare in practice

  ──────────────────────────────────────────────────────────────────────────────────────
   Systems A1–A9 in this series:  ETD, Diligent Learner             DGM (empirical
   plasticity repair, WMs,        (Shalev-Shwartz &                 validation replaces
   frontier RL, plasticity        Shashua, 2025)                    formal proof)
  ══════════════════════════════════════════════════════════════════════════════════════

Figure 1: The learning paradigm spectrum from statistical approximation to exact rule learning. Every system studied in Articles 1–9 of this series operates in the statistical learning column. The exact-learning critique (György et al., 2025), supported by twenty years of neurosymbolic evidence (Garcez & Lamb, 2020; Colelough & Regli, 2024), predicts that systems in this column cannot reach general intelligence by formal argument. Open-ended self-modifying systems such as DGM target the right column via empirical validation rather than formal proof — a different mechanism for the same destination.


Key Takeaway: György et al. (2025) establish by formal argument that statistical learning cannot guarantee exact deductive reasoning, a necessary condition for general intelligence. This is not a claim that current AI is useless — it is a claim about what it fundamentally cannot do, regardless of scale. Every other system in this series operates below this ceiling. The neurosymbolic literature provides twenty years of empirical confirmation. The rest of this article examines what might cross it.


§2 — What Open-Ended Evolution Means

The Gödel machine, introduced by Schmidhuber (2007) and described in Zhang et al. (2025), proposed a theoretical framework for self-improving AI: a system that can modify any part of itself — including its learning algorithm — when it can formally prove the modification is beneficial. The theoretical elegance is complete. The practical limitation is fatal: for systems operating in complex, partially observable environments, formally proving that a given modification is net beneficial is almost never achievable without restrictive assumptions about the system. The original Gödel machine remained a theoretical construct.

The Darwin Gödel Machine (Zhang, Hu, Lu, Lange, and Clune, 2025) solves the proof bottleneck by replacing it. Instead of proving that a modification is beneficial, DGM tests it empirically. Each candidate modification is evaluated on a coding benchmark. Modifications that improve performance are accepted into an archive; those that do not are discarded. The archive grows over time into a lineage of increasingly capable agents, each building on the best versions that came before.

The mechanism deserves a precise statement. DGM maintains an archive of generated coding agents. From this archive, it selects an agent and issues that agent a self-improvement instruction: rewrite your own code to improve your capability. The child agent produced by this self-modification is evaluated on benchmark tasks — the same software engineering problems it will ultimately be used to solve. If the child outperforms its parent, it enters the archive and becomes eligible for further modification. Zhang et al. (2025) describe the result as "a growing tree of diverse, high-quality agents" — not a single lineage but an open-ended population that explores many different modification paths in parallel.

The improvements DGM discovers are qualitatively new, not just quantitatively better. Zhang et al. (2025) report that the system automatically develops better code editing tools, improved long-context window management, and peer-review mechanisms — capabilities that were not explicitly designed into the system but emerged as solutions to bottlenecks the DGM encountered during self-modification. This is the critical empirical signal: DGM does not merely improve its score. It improves its ability to improve, discovering capabilities that future generations build upon. SWE-bench performance rose from 20.0% to 50.0%; Polyglot rose from 14.2% to 30.7% (Zhang et al., 2025). Furthermore, Zhang et al. (2025) document that the DGM significantly outperforms baselines without self-improvement or without open-ended exploration, establishing that both components — the self-modification loop and the open-ended archive — are necessary.

To place DGM in context, consider Encode–Think–Decode (ETD), introduced by Koishekenov, Lipani, and Cancedda (2025). ETD does not rewrite its architecture, but it does modify its own computation pathway: by training a model to iterate over a selected subset of reasoning-relevant layers during mid-training, ETD amplifies latent reasoning without changing the architecture, parameter count, or training data composition. Interpretability studies established that the crucial computation for reasoning is concentrated in a limited range of layers; ETD exploits this by making the model revisit those layers recursively at inference time. The gains are substantial — a relative accuracy improvement of +28.4% on GSM8K and +36% on MATH with the OLMo-2 1B Base model (Koishekenov et al., 2025); [→ A11] covers ETD's architecture and results in full. ETD represents the inner-loop end of the self-modification spectrum: not rewriting code, but redistributing internal computation. DGM is the outer-loop end: rewriting the full codebase across evolutionary generations. Together they outline a spectrum of self-modification, from within-forward-pass computation redistribution to full code rewriting with empirical selection.

(see Figure 2)


Figure 2 — The DGM Self-Modification Loop

  ══════════════════════════════════════════════════════════════════════════════════
   DGM: OPEN-ENDED SELF-MODIFICATION CYCLE
  ──────────────────────────────────────────────────────────────────────────────────

                    ┌────────────────────────────────────────────┐
                    │           ARCHIVE OF AGENTS                 │
                    │                                             │
                    │   Agent_root (20.0% SWE-bench)             │
                    │      ├── Agent_A → Agent_A1                │
                    │      │            └── Agent_A1a            │
                    │      └── Agent_B → Agent_B1 (50.0%)  ←─── │─── Selected lineage
                    │                                             │
                    │   "Growing tree of diverse, high-quality   │
                    │    agents" (Zhang et al., 2025)             │
                    └───────────────────┬────────────────────────┘
                                        │ SELECT agent from archive
                                        ↓
                    ┌────────────────────────────────────────────┐
                    │   PARENT AGENT + SELF-IMPROVEMENT TASK      │
                    │   "Rewrite your own code to improve         │
                    │    your capability"                         │
                    └───────────────────┬────────────────────────┘
                                        │ SELF-MODIFY (rewrite own code)
                                        ↓
                    ┌────────────────────────────────────────────┐
                    │   CHILD AGENT (modified codebase)           │
                    └──────────┬─────────────────────────────────┘
                               │ EVALUATE on benchmark
                               ↓
                    ┌──────────────────────────────────────────┐
                    │      BENCHMARK EVALUATION                  │
                    │  (SWE-bench, Polyglot, ...)               │
                    └───────┬──────────────────┬───────────────┘
                  IMPROVES? │                  │ REGRESSES?
                          YES ↓                ↓ NO
                    ADD TO ARCHIVE        DISCARD CHILD
                    (seed future          (parent lineage
                     evolution)            continues)

  ──────────────────────────────────────────────────────────────────────────────────
   KEY ASYMMETRY vs. STANDARD RL:
   Standard RL:  Fixed architecture θ → gradient updates → better weights
   DGM:          Mutable codebase    → code rewrites    → new agent generation
  ══════════════════════════════════════════════════════════════════════════════════

Figure 2: The DGM self-modification loop. Unlike standard RL, which adjusts weights within a fixed architecture, DGM rewrites the agent's source code — including the code responsible for future self-modification. The archive's tree structure enables open-ended exploration: many modification paths run in parallel, with empirical benchmark evaluation as the selection pressure. The emergent improvements (code editing tools, context management, peer review) were not designed in — they were discovered by the self-modification process itself.


Key Takeaway: DGM's capability gains are not produced by gradient updates on training data — they emerge from self-modification with empirical selection. The archive-plus-evaluation structure replaces formal proofs (the original Gödel machine's requirement) with empirical selection pressure, making the approach practical. The result is capability gains that are qualitatively different from gradient-based learning: the system discovers new tools and strategies that improve its ability to keep improving, not merely better performance on a fixed objective within fixed representational limits.


§3 — Evaluating the Sakana AI Scientist

To understand where current autonomous research capability actually sits — and how far DGM-style open-ended systems would need to advance to constitute genuine artificial research intelligence — it helps to examine the most visible existing attempt: Sakana AI's AI Scientist, a system designed to automate the entire research lifecycle from idea generation to paper writing to peer review simulation. Beel, Kan, and Baumgart (2025) term this ambition Artificial Research Intelligence (ARI) and provide its first systematic evaluation.

The failures Beel et al. (2025) document are specific and quantifiable. The AI Scientist's literature review relies on simplistic keyword searches rather than synthesis, producing poor novelty assessment: several generated research ideas were incorrectly classified as novel, including micro-batching for stochastic gradient descent — a technique established well before the system's training cutoff. Five out of twelve proposed experiments (42%) failed due to coding errors. Generated manuscripts contained a median of just five citations per paper, the large majority outdated: only five out of 34 citations across the evaluation were from 2020 or later. Structural errors were frequent — missing figures, repeated sections, placeholder text such as "Conclusions Here." Hallucinated numerical results appeared in several manuscripts. And the system's per-iteration adaptability was measured directly: each iteration added only 8% more characters on average to experimental code, indicating minimal generative capacity beyond small local edits.

One failure mode is particularly revealing for open-ended AI research. An experiment designed to optimize energy efficiency reported improvements in accuracy while consuming more computational resources — contradicting its stated goal — and the AI Scientist did not detect the contradiction. A system capable of reliable self-improvement must, at minimum, correctly evaluate whether its own modifications achieve their intended objective. That self-evaluation capability is not present in its current form.

And yet: Beel et al. (2025) conclude that the AI Scientist "represents a significant leap forward in research automation" — it produces complete research artifacts, end to end, in a form that resembles legitimate scientific output. The gap between resemblance and reliability is precisely what the evaluation quantifies. The measurement is the contribution: knowing the gap exists is the prerequisite to closing it.

(see Figure 3)


Figure 3 — AI Scientist Capability Audit (Beel et al., 2025)

  ══════════════════════════════════════════════════════════════════════════════════════
   AUTONOMOUS RESEARCH: WHAT ARI REQUIRES vs. WHAT AI SCIENTIST ACHIEVES
  ──────────────────────────────────────────────────────────────────────────────────────
   DIMENSION                  WHAT ARI REQUIRES            CURRENT AI SCIENTIST
  ──────────────────────────────────────────────────────────────────────────────────────
   Literature synthesis       Deep synthesis;              Keyword search; micro-batching
                              novelty verification         classified as novel  ❌

   Experiment reliability     Robust execution;            5/12 experiments failed (42%)
                              coherent goal-tracking       Goal inversion undetected  ❌

   Self-modification depth    Substantial per-iteration    +8% characters per iteration
                              code improvement             on average  ⚠️

   Citation quality           Current, relevant            Median 5 citations;
                              synthesis                    only 5/34 from 2020+  ❌

   Output integrity           Reproducible, consistent     Hallucinated numerics;
                              manuscripts                  placeholder text; missing figs  ❌

   Artifact completeness      Complete research paper      ✅  (the genuine advance)

  ──────────────────────────────────────────────────────────────────────────────────────
   OVERALL:  ✅  Significant leap in research automation form.
             ❌  Reliability gap on every substantive dimension measured.
  ══════════════════════════════════════════════════════════════════════════════════════

Figure 3: Capability audit of the AI Scientist based on Beel, Kan, and Baumgart (2025). The system produces complete research artifacts end to end — that is the genuine advance, and it distinguishes the AI Scientist from every prior research automation tool. On every reliability dimension, however, the system falls substantially short of what ARI requires. The 42% experiment failure rate and the goal-inversion failure define the baseline that open-ended self-modifying systems must surpass to constitute genuine artificial research intelligence.


Key Takeaway: The AI Scientist evaluation (Beel et al., 2025) establishes the current state of autonomous research systems: impressive form, unreliable substance. The specific failure modes — goal inversion, shallow self-modification (+8% per iteration), hallucinated results — define the reliability gap that any credible claim to open-ended AI must close. DGM's empirical validation loop directly addresses this gap: benchmark scores are hard to fake, and evolutionary rejection of regressive modifications is precisely the goal-tracking mechanism the AI Scientist lacks.


§4 — ShinkaEvolve: Towards Open-Ended Adaptation

If DGM is the proof of concept, ShinkaEvolve (Lange, Imajuku, and Cetin, 2025) is the open-source infrastructure layer. Both systems use large language models as mutation operators in an evolutionary search loop — but ShinkaEvolve is designed to attack the critical bottleneck that makes such systems impractical at scale: sample efficiency.

Lange et al. (2025) identify the core problem with existing code evolution methods: they require thousands of samples to identify effective solutions, and they remain closed-source, limiting adoption and reproducibility. ShinkaEvolve addresses both constraints with three interlocking innovations. Parent sampling controls which prior solutions seed new mutation rounds, balancing exploration of novel directions against exploitation of known productive ones — preventing premature convergence on local optima. Code novelty rejection-sampling filters the search space by preferring candidates that differ structurally from previously evaluated solutions, maintaining population diversity without requiring large populations. A bandit-based LLM ensemble selection strategy dynamically reweights which LLMs are used as mutation operators based on their recent per-domain performance, adapting the mutation process to the current search landscape.

The empirical results demonstrate breadth as well as depth. ShinkaEvolve discovers a new state-of-the-art circle packing solution using only 150 samples, outperforming AlphaEvolve on this task (Lange et al., 2025). The system also designs competitive agentic harnesses for AIME mathematical reasoning, improves ALE-Bench competitive programming solutions, and discovers novel mixture-of-expert load balancing loss functions. The range of domains — geometric optimization, mathematical competition, competitive programming, and neural architecture design — is the signal: a sample-efficient evolutionary framework that achieves state-of-the-art results across this range is approaching the open-ended adaptability that closed-source predecessors lacked.

ShinkaEvolve does not modify itself — it modifies candidate solution programs. The distinction from DGM is architecturally important: DGM's archive contains agents that improve their own code-writing capability; ShinkaEvolve's archive contains solution programs that improve at solving external problems. ShinkaEvolve is an open-ended problem solver; DGM is an open-ended self-improver. Both are necessary components of a complete open-ended AI stack. A system that can solve problems more efficiently (ShinkaEvolve) and a system that can improve its own problem-solving capability (DGM) are complementary, not competing.

(see Figure 4)


Figure 4 — Fixed Architecture Ceiling vs. Open-Ended Evolution Trajectory

  ══════════════════════════════════════════════════════════════════════════════════════
   PERFORMANCE TRAJECTORIES: FIXED ARCHITECTURE vs. OPEN-ENDED SELF-MODIFICATION
  ──────────────────────────────────────────────────────────────────────────────────────

  PANEL A: Standard RL / SFT on Fixed Architecture

  Performance
  ▲
  │       ▓▓▓▓▓▓▓▓░░░░░░░░░░░░░░░░░░░░░░ ← Representational plateau:
  │      ▓▓▓                               architecture limits search space;
  │    ▓▓▓▓                               gradient descent exhausts headroom
  │   ▓▓▓
  │ ▓▓▓           GI threshold ─────────────────────── (above the plateau)
  │▓▓
  └──────────────────────────────────────────►  Compute / Training Steps

  PANEL B: DGM-Style Open-Ended Self-Modification

  Performance
  ▲
  │                        ░░░░░ ─── GI threshold (approached, not claimed crossed)
  │                  ░░░░░░
  │            ░░░░░░             50.0% SWE-bench (Zhang et al., 2025)
  │       ░░░░░
  │    ░░░░░                      Each generation rewrites architecture,
  │  ░░░░                         not just weights
  │ ░░░  20.0% SWE-bench (start)
  └──────────────────────────────────────────►  Compute / Modification Rounds

  ──────────────────────────────────────────────────────────────────────────────────────
  György et al. (2025) predict the Panel A ceiling by formal argument.
  Panel B is DGM's empirical trajectory in well-defined coding tasks — not claimed
  as general intelligence; the general case remains open.
  ══════════════════════════════════════════════════════════════════════════════════════

Figure 4: Fixed architecture performance plateaus versus DGM's open-ended self-modification trajectory. Panel A illustrates the ceiling that the exact-learning critique (György et al., 2025) predicts by formal argument: gradient descent on a fixed architecture exhausts representational headroom before reaching the general intelligence threshold. Panel B shows DGM's empirical trajectory — self-modification rounds replace gradient steps, and the architecture itself is the variable. The GI threshold is marked as approached but not claimed crossed; DGM's results are in well-defined coding tasks, not general intelligence. The claim is structural: open-ended self-modification does not have the same ceiling as fixed-architecture optimization.


Key Takeaway: ShinkaEvolve achieves state-of-the-art results across diverse domains — geometric optimization, mathematical competition, neural architecture search — using only 150 samples for a new circle packing record. Its open-source design and three-innovation efficiency stack (parent sampling, novelty rejection-sampling, bandit ensemble) make the evolutionary framework reproducible and accessible. The architectural distinction from DGM is precise: problem solver vs. self-improver. The full open-ended AI stack requires both layers.


§5 — Reward-Based Pretraining as a Bridge

The DGM and ShinkaEvolve results establish that open-ended evolution produces capability gains beyond what gradient-based methods achieve. They do not, by themselves, explain why standard pretraining fails to produce those gains in the first place. Han, Pari, Gershman, and Agrawal (2025) provide the mechanistic explanation.

Their starting distinction is precise: Large Language Models exemplify Artificial Useful Intelligence (AUI) — systems capable of assisting humans in real-world tasks — but not Artificial General Intelligence (AGI), which requires adaptive and robust reasoning across novel contexts. The gap between AUI and AGI is not a matter of degree; it is a matter of structure. Han et al. (2025) demonstrate this empirically using algorithmic tasks in esoteric programming languages: when the surface form of a reasoning task changes — same underlying structure, different syntax — LLM performance drops substantially. The generalization that practitioners observe on standard benchmarks is pattern-matching to training-set statistics, not reasoning from first principles.

The core structural diagnosis is this: LLMs couple reasoning and knowledge. Because the same model learns both factual content and inferential process simultaneously via next-token prediction, reasoning becomes entangled with training-distribution syntax and semantics. The system that reasons well about Python does so partly because it has seen millions of Python programs. The same reasoning capability does not transfer to a structurally identical problem in a language with no training-set history. The reasoning is inseparable from the knowledge — and the knowledge is tied to the training distribution.

Han et al. (2025) propose three structural corrections to break this coupling. First: pretrain to reason using RL from scratch, as a replacement for the widely-used next-token prediction pretraining objective. An agent that learns to reason via reward signal before acquiring natural language knowledge develops a reasoning prior that is genuinely separable from specific knowledge content. Second: use a curriculum of synthetic tasks to bootstrap a generalizable reasoning prior in RL before transferring to natural language tasks — the curriculum provides the inductive structure that naked RL would need to rediscover from scratch. Third: use a small context window during reasoning pretraining to prevent the model from exploiting spurious token co-occurrence correlations, forcing it to reason from the problem structure rather than from contextual statistics.

This three-part proposal is the mechanistic bridge between the theoretical critique and the engineering demonstrations. Standard LLM pretraining maximizes prediction accuracy on a massive corpus — this is the statistical learning approach subject to the ceiling that György et al. (2025) identify formally. Reward-based pretraining decouples reasoning from knowledge, creating a prior that can generalize in the way exact learning requires. Applied to a DGM-style self-modification outer loop, a system with a genuinely decoupled reasoning prior is a more coherent path toward open-ended general capability than DGM applied on top of a system whose reasoning is entangled with training-set statistics.

(see Figure 5)


Figure 5 — Three Independent Groups Converging on Self-Modification

  ══════════════════════════════════════════════════════════════════════════════════════
   THREE-GROUP CONVERGENCE: THE STRUCTURAL PATH TO GENERAL INTELLIGENCE
  ──────────────────────────────────────────────────────────────────────────────────────

      DEEPMIND TEAM (THEORETICAL)                  SAKANA AI (ENGINEERING)
      György, Lattimore, Lazić,                    Zhang, Hu, Lu, Lange,
      Szepesvári (2025)                            Clune (2025)
      ──────────────────────────                   ──────────────────────────────
      "Statistical learning cannot                 "DGM: 20.0% → 50.0% SWE-bench
       guarantee sound deductive                    via open-ended code self-
       reasoning — a necessary                      modification; emergent tools
       condition for GI"                            and strategies discovered"
      Provides: FORMAL LIMIT                        Provides: EMPIRICAL PROOF
                   \                                         /
                    \                                       /
                     ↘                                     ↙
                  ┌───────────────────────────────────────────┐
                  │  CONVERGENCE POINT                         │
                  │  Self-modification via search is the       │
                  │  necessary structural condition for        │
                  │  crossing the GI threshold                 │
                  └───────────────────────────────────────────┘
                                      ↑
                                     /
                                    /
      MIT-HARVARD TEAM (MECHANISTIC)
      Han, Pari, Gershman, Agrawal (2025)
      ─────────────────────────────────────────────────────
      "LLM reasoning overfits to training distribution;
       coupling of reasoning and knowledge is the structural
       cause; RL pretraining from scratch with synthetic
       curriculum creates a separable reasoning prior"
      Provides: MECHANISTIC EXPLANATION

  ──────────────────────────────────────────────────────────────────────────────────────
   Background support: Garcez & Lamb (2020); Colelough & Regli (2024)
   → 20 years of neurosymbolic evidence that symbolic + neural integration is necessary
  ══════════════════════════════════════════════════════════════════════════════════════

Figure 5: Three independent research groups converging on self-modification as the structural path beyond statistical learning's ceiling. DeepMind provides the formal theoretical limit; Sakana AI provides the empirical engineering proof; the MIT-Harvard team provides the mechanistic explanation for why the standard pretraining paradigm cannot cross the limit. Convergence from three distinct directions — theoretical, engineering, and mechanistic — is stronger evidence than any single line of argument alone. Twenty years of neurosymbolic research provides the empirical background that frames all three.


Key Takeaway: Han et al. (2025) provide the mechanistic reason standard LLMs cannot generalize across novel reasoning contexts: reasoning and knowledge are coupled by the next-token prediction pretraining objective. Reward-based pretraining from scratch, using a synthetic task curriculum with a small context window, is proposed as the structural correction that decouples them. This is the bridge that makes DGM's open-ended self-modification architecturally coherent: a system with a genuinely separable reasoning prior can improve in directions that a statistically-trained system, whose reasoning is tied to its training distribution, cannot.


§6 — The Search-Theoretic Perspective

The convergence from three independent groups gains formal depth from a fourth angle. Shalev-Shwartz and Shashua (2025) examine why existing approaches to improving reasoning — Supervised Fine-Tuning, RL, Tree-of-Thoughts, Monte Carlo Tree Search — "often fail on complex reasoning tasks," and identify three structural obstacles that are independent of scale.

The first obstacle is distribution drift: the distribution of reasoning chains needed to solve difficult problems is systematically underrepresented in training data. A model trained on successful CoT traces learns to generate traces that look like correct reasoning without learning the search strategy that generates them. The surface mimics the substance without acquiring it. The second obstacle is lack of embedded search: standard CoT and RL approaches do not explicitly model backtracking — the ability to recognize a dead end and return to an earlier decision point. Without backtracking, the reasoning process is a walk, not a search. It can commit to an incorrect path early and have no mechanism to escape it. The third obstacle is exponential inference cost: search-augmented methods like MCTS scale exponentially with problem depth, making them impractical for precisely the complex multi-step problems where the capability gap is largest.

Shalev-Shwartz and Shashua (2025) introduce the Diligent Learner to address all three simultaneously. The Diligent Learner models reasoning explicitly as depth-first search guided by a validator, with backtracking supported upon failure at any node. Under two mild and realistic assumptions, they prove that the Diligent Learner can efficiently learn from CoT data while SFT and RL fail to do so. The formal guarantee is important: this is not an empirical improvement on a held-out benchmark, but a proof of polynomial efficiency where competing approaches fail.

The connection to DGM is structural and direct. DGM's outer loop is a validated search process: it maintains an archive of agent variants, generates modifications, tests them empirically, and uses the results to prune which lineages continue. The "benchmark evaluation" step in DGM is the Diligent Learner's validator applied at the code level — binary feedback (improvement vs. regression) at each decision node. DGM's archive with open-ended exploration is the Diligent Learner's depth-first-search-with-backtracking policy applied to the space of agent code. The formal framework explains why DGM works where gradient descent plateaus: DGM conducts a genuine search with backtracking (evolutionary selection rejects regressions and preserves the successful paths), while gradient descent is a local walk without the mechanism to reverse direction on a non-convex landscape.

Super-intelligence, in this framework, is not a quantitative scaling problem. It is a search architecture problem.

(see Figure 6)


Figure 6 — The Diligent Learner: Reasoning as Search with Backtracking

  ══════════════════════════════════════════════════════════════════════════════════════
   DILIGENT LEARNER vs. STANDARD COT: SEARCH STRUCTURE COMPARISON
  ──────────────────────────────────────────────────────────────────────────────────────

  STANDARD CoT / RL (no backtracking):      DILIGENT LEARNER (with backtracking):

  Problem                                    Problem
      │                                          │
      ↓                                          ↓
  Step₁                                      Step₁
      │                                          │
      ↓                                          ↓
  Step₂                                      Step₂ ──→ Validator
      │                                          │          │
      ↓                                          ↓    INVALID? ──→ BACKTRACK to Step₁
  Step₃                                      Step₂'         │
      │                                          │    VALID? → continue
      ↓                                          ↓
  Step₄ (committed to wrong path;            Step₃'
   no reversal mechanism)                        │
      │                                          ↓
      ↓                                      SOLUTION (validator confirms)
  WRONG ANSWER
  (detected only at end)

  ──────────────────────────────────────────────────────────────────────────────────────
  Standard approaches fail because:          Diligent Learner succeeds because:
  • Distribution drift in training data      • Depth-first search with validator
  • No embedded backtracking                 • Backtracking at each dead end
  • Exponential cost at depth                • Polynomial cost (proven under 2
    (MCTS)                                     mild assumptions)
  All "fail on complex reasoning tasks"      "Efficiently learns from CoT data
  (Shalev-Shwartz & Shashua, 2025)           where existing methods fail" (ibid.)
  ══════════════════════════════════════════════════════════════════════════════════════

Figure 6: The Diligent Learner's search structure versus standard CoT approaches. Standard methods treat reasoning as a one-way walk: once a step is committed, there is no mechanism to reverse direction. The Diligent Learner embeds backtracking explicitly, treating reasoning as depth-first search with a validator at each step. Shalev-Shwartz and Shashua (2025) prove this enables efficient CoT learning where SFT and RL fail. The structural parallel to DGM's archive-and-selection loop is the key insight: both are validator-guided search processes, not gradient walks — and this is why they can discover solutions that gradient descent cannot reach.


Key Takeaway: Shalev-Shwartz and Shashua (2025) prove that efficient reasoning from complex CoT data requires embedded search with backtracking — a structural property that SFT, RL, and standard MCTS all lack. The Diligent Learner provides this property and is proven efficient under mild assumptions. The structural isomorphism between DGM's outer-loop evolution and the Diligent Learner's inner-loop reasoning is not a coincidence: both are instances of validated search over a growing tree. Super-intelligence, in this framework, is a search architecture problem — not a data scaling problem.


§7 — What Would Real Open-Ended AI Look Like?

From the preceding sections, the requirements for genuinely open-ended AI can be stated precisely. A system is open-ended if it satisfies three conditions. First: self-modification — it can modify its own algorithms, not merely its parameters. DGM satisfies this; every other system in this series does not. Second: open-ended novelty — it can discover solutions that are qualitatively new, not refinements within a fixed representational regime. DGM's emergence of peer-review mechanisms and context-management tools in the coding domain is evidence of this property. Third: scalable search — the search process must be computationally tractable as problem complexity increases, without the exponential cost that defeats naive tree search. ShinkaEvolve's sample efficiency (150 samples for a circle packing record) and the Diligent Learner's polynomial-cost backtracking are both demonstrations that scalable search is achievable in specific domains.

No current system fully satisfies all three conditions in the general case. DGM is self-modifying and produces qualitatively novel solutions, but its demonstrated scope is well-defined coding tasks: it has not self-modified across the full range of tasks that constitute general intelligence. ShinkaEvolve is sample-efficient and broadly applicable across problem domains, but it modifies external programs, not its own reasoning architecture. Han et al.'s (2025) reward-based pretraining proposal — RL from scratch, synthetic task curriculum, small context window — is not yet a demonstrated system at scale; the full decoupling of reasoning from knowledge it theorizes remains to be empirically confirmed.

The gaps are quantifiable and structural. DGM achieves 50.0% on SWE-bench — a substantial improvement over its 20.0% starting point, and over all baselines without self-improvement. The remaining gap to the top of the benchmark reflects the structural difference between the closed domain of code editing with a fixed set of tool interfaces and the open domain of novel software engineering that requires designing new tool interfaces, reasoning about novel domain requirements, and writing tests for problems with no prior specification. Closing this gap requires exactly the capabilities that the theoretical frameworks in this article predict: exact reasoning for novel task types, backtracking search for multi-step problems without a ground-truth prefix, and a reasoning prior decoupled from training-distribution syntax.

The path from DGM 2025 to genuinely open-ended AI is therefore not "more compute on the same architecture." It is "different architecture": reward-based reasoning priors (Han et al., 2025) feeding into Diligent-Learner-style inner-loop search (Shalev-Shwartz and Shashua, 2025) feeding into DGM-style outer-loop self-modification (Zhang et al., 2025), applied to a system whose self-modification scope includes its own reasoning architecture, not just its code editing toolset. Whether this integration is navigable with current techniques, or requires further theoretical foundations, is the open research question that defines the frontier.

Key Takeaway: Three requirements for open-ended AI: self-modification, open-ended novelty, and scalable search. No current system satisfies all three in the general case. DGM satisfies the first two in coding tasks; ShinkaEvolve satisfies the third broadly; Han et al. (2025) theorize the structural prerequisite (decoupled reasoning prior) for all three to compose. The path is architecturally specific and experimentally open. "More scale" is not the path. "Different structure" is.


§8 — Is Evolution the Right Prior?

The DGM and ShinkaEvolve results raise a deeper question: is evolutionary search the right optimization prior for building open-ended AI, or is it a pragmatic approximation to something more fundamental?

The case for evolution is the case for open-ended exploration. Gradient descent is a local optimizer: it follows the gradient of a differentiable objective, making small incremental adjustments in the direction of improvement. It is highly efficient within a fixed representational regime where the loss landscape is smooth. But it cannot make discontinuous jumps to qualitatively different solution architectures, because discontinuous jumps are not differentiable. Evolutionary search does not require differentiability — it generates candidates through mutation and selects among them by fitness, which can be any measurable property. The DGM results are direct evidence that, in the domain of coding agent design, qualitatively new solutions are available that gradient descent would not reach from the same starting point.

The case against evolution as a sufficient prior is the case for efficiency and convergence guarantees. Evolutionary search is sample-inefficient in the general case: generating and evaluating large populations to find improvements is expensive relative to gradient steps that directly follow the performance landscape. ShinkaEvolve's 150-sample circle packing result is impressive precisely because it is an exception — it relies on three carefully designed mechanisms to achieve that efficiency, and those mechanisms are domain-aware. Without domain-specific parent sampling and novelty rejection criteria, the sample cost returns to the thousands.

The deeper unresolved question is whether the fitness landscape for general cognitive capability has the structure that makes evolutionary search convergent in the relevant sense. For geometric optimization and code generation, the landscape is tractable. For the full space of cognitive tasks constituting general intelligence, the connectivity and smoothness of the landscape are not established.

The safety tension. Self-modifying systems introduce an alignment discontinuity that gradient-based systems do not present. A system that learns within fixed weights can degrade alignment only along the dimensions of its training objective — misalignment is gradual and detectable. A system that rewrites its own code can, in principle, rewrite its alignment constraints as well, if those constraints are represented in code that falls within the scope of self-modification. Zhang et al. (2025) report that all DGM experiments were conducted with safety precautions including sandboxing and human oversight — the appropriate response at current capability levels. As open-ended self-modification scales to more capable systems, the question of whether alignment constraints can be made modification-invariant — architecturally immutable even when the system rewrites its own reasoning code — becomes the defining safety problem of the open-ended AI research agenda. This is not a reason to stop the research; it is a reason to build safety architecture that scales alongside capability, not as an afterthought.

Key Takeaway: Evolutionary search enables qualitative capability jumps that gradient descent cannot make — this is what DGM and ShinkaEvolve demonstrate. It is also sample-inefficient in the general case and relies on fitness landscape assumptions that have not been established for general intelligence. The safety tension is real and structural: self-modification that improves capability can also modify alignment constraints. The research agenda that builds open-ended AI must simultaneously build modification-invariant alignment — these are not two separate problems. They are the same problem.


§ What Comes Next

This is the philosophical high point of the series. We have moved from the alarm of benchmark failures [← A10] through the understanding of plasticity and world models [← A1, A3, A4], through the engineering of stable RL [← A6, A7], through the empirical record of frontier reasoning systems hitting a structural ceiling [← A9], and arrived here: systems that do not merely learn from data, but modify themselves. The next two articles descend from the vertigo into specific architectures.

[→ A11: Thinking Without Tokens: CTM] addresses the inner-loop problem. DGM's self-modification loop requires an agent capable of evaluating complex coding tasks quickly and accurately. The Continuous Thought Machine (CTM), also from Sakana AI, introduces per-neuron temporal memory — a mechanism for deep latent reasoning that operates within the forward pass without generating external reasoning tokens. If CTM's architectural approach scales, it provides the fast, deep inner-loop evaluator that DGM-style outer-loop self-modification requires. The Encode-Think-Decode framework from §2 of this article — iterating over reasoning-relevant layers — is the stepping stone between standard CoT and CTM: both are points on the spectrum from external token-based reasoning to fully internal latent reasoning. [→ A11] resolves the latent computation question that §2 opens.

[→ A12: RL as Educator] closes the series by asking whether RL can serve as the teacher in a recursive self-improvement loop — one where the system generates its own training curriculum and evaluates its own learning. LADDER, introduced in [→ A11] and extended in [→ A12], is a constrained form of DGM self-modification applied to curriculum design rather than code architecture. The recursive teacher loop is DGM's outer-loop logic applied one level down: instead of rewriting agent code, it rewrites the difficulty distribution of the agent's training problems. [→ A12] closes the arc the series has been building: from systems that fail at benchmarks [← A10] to systems that improve their own benchmarks.

For readers who came to this article first: the GVF-based open-ended prediction systems in [← A4] are the historical predecessors of DGM's open-ended agent archive. The open-ended prediction of future reward cumulants that those systems theorized is realized, in a qualitatively different form, in DGM's open-ended improvement of coding capability. The distance between those two points is the distance this series has covered.


Final Key Takeaways

  1. The formal ceiling is established. György et al. (2025) demonstrate by formal argument that statistical learning cannot guarantee exact deductive reasoning — a necessary condition for general intelligence. Every system in Articles 1–9 operates below this ceiling.

  2. DGM crosses the ceiling in a bounded domain. The Darwin Gödel Machine improves from 20.0% to 50.0% on SWE-bench and from 14.2% to 30.7% on Polyglot via open-ended self-modification — not gradient descent, not distillation, not scaling. The mechanism is qualitatively different from every prior system in this series, and it significantly outperforms baselines without self-improvement or without open-ended exploration.

  3. Three groups converge independently. DeepMind (formal theoretical limit), Sakana AI (empirical engineering proof), and Han et al. (mechanistic explanation of why standard pretraining fails) arrive at the same conclusion: crossing the GI threshold requires self-modification that decouples reasoning from fixed training-distribution statistics.

  4. The current frontier is impressive but unreliable. The AI Scientist evaluation (Beel et al., 2025) documents the gap between autonomous research form and autonomous research reliability — 42% experiment failure rate, hallucinated results, goal inversion undetected. DGM's empirical validation loop addresses this gap structurally.

  5. Scalable search is the architectural key. Shalev-Shwartz and Shashua (2025) prove that efficient complex reasoning requires embedded search with backtracking. ShinkaEvolve demonstrates it with 150 samples for a state-of-the-art result. The DGM's archive-and-selection loop is the same structure at the outer loop. Super-intelligence is a search architecture problem.

  6. The safety question is not separable. Self-modifying systems can rewrite their own alignment constraints. Build modification-invariant alignment alongside capability — not after.


References

[1] Beel, J., Kan, M.-Y., & Baumgart, M. (2025). Evaluating Sakana's AI Scientist for Autonomous Research: Wishful Thinking or an Emerging Reality Towards 'Artificial Research Intelligence' (ARI)? arXiv:2502.14297.

[2] Colelough, B. C., & Regli, W. (2024). Neuro-Symbolic AI in 2024: A Systematic Review. University of Maryland.

[3] Garcez, A. d'A., & Lamb, L. C. (2020). Neurosymbolic AI: The 3rd Wave. City, University of London / Universidade Federal do Rio Grande do Sul.

[4] György, A., Lattimore, T., Lazić, N., & Szepesvári, C. (2025). Beyond Statistical Learning: Exact Learning Is Essential for General Intelligence. Google DeepMind. arXiv:2506.23908.

[5] Han, S., Pari, J., Gershman, S. J., & Agrawal, P. (2025). General Intelligence Requires Reward-based Pretraining. Proceedings of the 42nd International Conference on Machine Learning (ICML). arXiv:2502.19402.

[6] Koishekenov, Y., Lipani, A., & Cancedda, N. (2025). Encode, Think, Decode: Scaling Test-Time Reasoning with Recursive Latent Thoughts. FAIR at Meta / University College London.

[7] Lange, R. T., Imajuku, Y., & Cetin, E. (2025). ShinkaEvolve: Towards Open-Ended and Sample-Efficient Program Evolution. Sakana AI. arXiv:2509.19349.

[8] Shalev-Shwartz, S., & Shashua, A. (2025). From Reasoning to Super-Intelligence: A Search-Theoretic Perspective. arXiv:2507.15865.

[9] Zhang, J., Hu, S., Lu, C., Lange, R., & Clune, J. (2025). Darwin Gödel Machine: Open-Ended Evolution of Self-Improving Agents. University of British Columbia / Vector Institute / Sakana AI. arXiv:2505.22954.