Essay 12 of 12RLRE

RL as Educator: Training Teachers, Not Just Students

Stop optimizing the student. Train the teacher. Reinforcement learning reframed as the design of curricula.

29 min read6,325 words❖ Series: Continual Intelligence

Anchor papers: Cetin et al. 2025 (RLTs) · Gu et al. 2024 (TeaMs-RL) · Simonds & Yoshiyama arXiv:2503.00735 (LADDER) · Huang et al. 2026 (R2M) Series: Continual Intelligence


A 7B model trained to explain outperforms distillation pipelines built on models orders of magnitude larger — teaching a 32B reasoning model to score higher on AIME, MATH, and GPQA than distillation from far larger systems. The smaller model won not because it was smarter, but because it had been trained for a different objective entirely. It was not trained to solve problems. It was trained to make other models understand.

That gap — between optimizing a model to answer and optimizing a model to teach — is what this article is about.

Twelve articles ago [← A10], the series opened with a diagnosis: the field's benchmarks measure task-switching, not the structural capacity for continuous non-stationary adaptation. That diagnosis was not pessimism — it was the first move in a twelve-article argument. The plasticity crisis [← A1] is not a forgetting problem; it is a structural collapse of future learning capacity, the shape that the absence of a world model takes in a neural network [← A3]. World models, properly designed, restore plasticity by providing the compressed scaffold that prevents representational collapse [← A4]. Architecture must be redesigned, not just training [← A6]. Stable RL is the prerequisite for learning a world model that doesn't collapse mid-training [← A7]. And RL post-training can, under the right conditions, genuinely extend reasoning capacity — as long as the teacher is well-constructed [← A2].

This is the article about what the right teacher looks like when RL builds it.


§1 — Beyond RLHF: RL as a Curriculum Generator

Standard RL post-training, in the form that produced DeepSeek-R1, ProRL, and MiniMax-M1 [← A9], operates on a specific optimization target: the student's parameters θ. The reward signal is correctness — does the final answer match ground truth? The student model is the entity being improved, and it learns by attempting problems and receiving terminal feedback.

This architecture has a structural limitation that is easy to overlook: the teacher is fixed. The training data, the problem selection, the format of questions — everything that frames what the student sees — is curated by humans or inherited from the pre-training distribution. RL optimizes the student's response to that curriculum; it does not optimize the curriculum itself.

The consequence is the Invisible Leash [← A2]: RL can only amplify capabilities already latent in the base model. The curriculum determines the shape of the leash. If problems are consistently too easy, the reward signal degrades toward zero — the student always succeeds. If they are consistently too hard, RL hits the exploration barrier — the student never succeeds, gradients vanish, and optimization stalls. The curriculum is the binding constraint, and under standard RL post-training, it is not optimized.

The reframe proposed by three independent research programs — RLTs, TeaMs-RL, and LADDER — is simple to state but consequential to implement: apply RL to the teacher. Optimize not θ (student parameters) but the process that generates what the student sees. The reward for the teacher is student improvement — a delayed, non-differentiable signal that cannot be backpropagated through the student's learning process. RL is, architecturally, the right tool: it handles delayed rewards and non-differentiable feedback natively.

Same RL machinery. Different system component being trained. Different consequence for what the system can learn.

(see Figure 1)


Figure 1 — The Paradigm Shift: From Student Optimization to Teacher Optimization

  ══════════════════════════════════════════════════════════════════════════════════════════════
   RL-AS-STUDENT (standard post-training)     vs     RL-AS-EDUCATOR (this article)
  ──────────────────────────────────────────────────────────────────────────────────────────────

   RL-AS-STUDENT                                    RL-AS-EDUCATOR
   ────────────────────────────────────             ──────────────────────────────────────────

   Fixed curriculum ──────────────────►             Teacher (trainable)
   (human-curated, static)                          → generates curriculum / explanations
          │                                                    │
          │ questions                                          │ curriculum items, explanations
          ▼                                                    ▼
   Student model (θ) ─────────────────►             Student model (θ_s)
   → attempts problem                               → attempts problem, learns from explanation
          │                                                    │
          │ answer                                             │ answer quality / log-probs
          ▼                                                    ▼
   Reward: correctness of answer                   Reward: student improvement
   (immediate, proximal)                           (delayed, non-differentiable)
          │                                                    │
          ▼                                                    ▼
   Gradient ──► update θ (student)                RL signal ──► update teacher params
                                                  (GRPO, policy gradient, or similar)

   OPTIMIZED COMPONENT: student θ                 OPTIMIZED COMPONENT: teacher / curriculum

   REWARD SIGNAL: terminal correctness            REWARD SIGNAL: downstream learning quality
  ──────────────────────────────────────────────────────────────────────────────────────────────
   Same RL machinery. Different optimization target. Different system component being trained.
  ══════════════════════════════════════════════════════════════════════════════════════════════

Figure 1: Standard RL post-training (left) optimizes student parameters θ using correctness as the reward. RL-as-educator (right) optimizes the teacher — the component that generates what the student sees — using student improvement as the reward signal. The RL machinery is identical; the optimization target is different. Opening curriculum design as a learnable component of the training pipeline is the move that A12 examines.


Key Takeaway: Standard RL post-training fixes the curriculum and optimizes the student. RL-as-educator fixes the student loop and optimizes the curriculum generator. The reward for the teacher is student improvement — delayed, non-differentiable, exactly the kind of signal RL handles well. When the teacher is trainable, the curriculum can be designed by the same optimizer that learns from it.


§2 — RLTs: When the Teacher Is Trained, Not Just the Student

The most direct implementation of RL-as-educator is Reinforcement-Learned Teachers (RLTs), proposed by Cetin, Zhao, and Tang (2025) at Sakana AI.

The starting observation is clean: standard RL reasoning trains a model to solve problems from scratch. This is the format examined in [← A9] — given a question, generate a chain of thought, produce an answer, receive a binary correctness reward. The exploration challenge is structural: if the model cannot already solve the problem with some non-zero probability at initialization, the gradient is always zero. RL reasoning is limited, in practice, to models that are already partially capable.

The RLT insight is a clean sidestep. A teacher's value is not measured by whether it can solve problems from scratch. Teachers have access to the solution. What matters is whether they can explain the solution in a way that maximizes student understanding.

Cetin et al. (2025) formalize this directly. An RLT is given both the question and the ground-truth solution in its prompt. Its task is to "connect the dots" — produce a detailed explanation that bridges question to solution in a way tailored to the student's comprehension. The reward is not correctness; it is a dense signal derived from the student model's log probabilities after receiving the explanation. This measures not "is the answer right?" but "does the student now understand the solution's logic?"

The training objective avoids exploration collapse entirely. Because the solution is provided, the teacher never needs to find answers it couldn't previously generate. The optimization pressure is entirely on explanation quality — pedagogical clarity, identification of logical leaps, decomposition of reasoning paths matched to the student's current capability.

The result is striking. A 7B RLT produces explanations that yield higher final performance on AIME, MATH, and GPQA than "existing distillation and cold-starting pipelines that collect and postprocess the reasoning traces of orders of magnitude larger LMs" (Cetin et al., 2025). Two additional results extend the finding. First, the 7B RLT is effective when teaching 32B students — models substantially larger than the teacher itself — confirming that the optimization target is explanation quality, not the teacher's parametric knowledge. Second, RLTs maintain their effectiveness when applied zero-shot to out-of-distribution tasks, suggesting the learned explanation strategy generalizes across problem types without retraining.

The pedagogical mechanism maps onto the observation from [← A5]: the shape of reasoning matters independent of correctness. An RL-trained teacher learns not just what to convey but how — the format, pacing, and decomposition of reasoning that maximizes student comprehension. The teacher's output distribution is optimized for learning transfer, not for correctness signaling.


Key Takeaway: RLTs decouple teaching ability from problem-solving ability. By giving the teacher the solution and rewarding explanation quality via dense student log-probability signals, Cetin et al. (2025) demonstrate that a 7B model trained to explain outperforms distillation from orders-of-magnitude-larger models on AIME, MATH, and GPQA. The exploration barrier that limits standard RL vanishes when the teacher's reward is downstream learning, not direct task performance.


§3 — TeaMs-RL: RL for Instruction Dataset Generation

RLTs optimize how a teacher explains a known solution. TeaMs-RL (Gu, Knoll, and Jin, 2024) relocates the optimization upstream: it optimizes what instructions the training dataset contains in the first place.

The target of standard RLHF pipelines is the post-SFT student model. The instruction dataset that seeds the SFT step is treated as fixed — assembled from human annotators or distilled from a large expert LLM such as ChatGPT. TeaMs-RL frames this as a missed optimization opportunity. The instruction dataset determines the student's starting distribution. A dataset with low diversity, concentrated on easy instructions or over-represented in certain domains, sets the student's capability ceiling before RL even begins.

Gu et al. (2024) train an "instructor LLM" using RL where the reward is diversity of the generated instructions, evaluated by a separate "reviewer LLM." The instructor LLM learns to generate instructions that maximize diversity — not instructions that maximize performance on any specific task. An expert LLM then generates responses to these RL-designed instructions; the result is a diverse instruction dataset used for a single SFT step, with no subsequent RLHF required.

The efficiency result is the headline: TeaMs-RL requires only 5.73% of the expert LLM queries needed by the strong baseline (Gu et al., 2024). For every 100 queries a standard pipeline uses to build its dataset, TeaMs-RL uses fewer than 6 — while achieving better instruction-following performance than WizardLM on complex instruction tasks.

The structural insight is that diversity, not volume, is the binding constraint. A large instruction dataset clustered in easy, similar examples trains a model that performs well on easy, similar examples. RL, with a diversity reward, finds instructions that span the space efficiently. Human annotators and brute-force scaling cannot easily optimize for diversity as an explicit objective; RL can.

From the series perspective, TeaMs-RL illustrates a principle that connects to [← A3]: the world model metaphor applies to curricula as well as environments. A curriculum that compresses diverse instruction-types into a compact, generalizable representation is doing for instruction diversity what a world model does for environment dynamics — reducing the information load while preserving the coverage that enables generalization.


Key Takeaway: TeaMs-RL demonstrates that the instruction dataset is a learnable component of the training pipeline. Applying RL to the dataset generator — with diversity as the reward — achieves better instruction-following than WizardLM while using only 5.73% of baseline's expert LLM queries. The teacher here is the dataset generator itself, and it can be trained.


§4 — LADDER: Self-Improvement as Teacher-Student Recursion

Both RLTs and TeaMs-RL retain the distinction between teacher and student as separate models. LADDER collapses it.

Simonds and Yoshiyama (2025) introduce Learning through Autonomous Difficulty-Driven Example Recursion (LADDER): a framework where the model becomes its own teacher by recursively generating simpler variants of problems it cannot yet solve. The model trains on the variants it can solve, then on harder variants of those, then harder again — climbing a difficulty tree it constructed for itself.

The mechanism has three components. Variant generation: given a hard problem, the model generates a tree of progressively simpler versions, each one step easier than its parent, preserving the mathematical structure while reducing complexity. Solution verification: each variant is checked numerically, providing a reliable, dense reward signal without human involvement. RL training on the tree: the model trains using GRPO on variants starting from the simplest (leaf nodes) and working toward the root (hardest), in difficulty order.

The result breaks a barrier that previous articles identified as fixed. The Invisible Leash [← A2] constrains RL to what the base model can already solve. LADDER sidesteps this by generating the intermediate curriculum that the base model can solve — using the model's own capabilities to identify its competence boundary and construct a stepping-stone path to beyond it. The leash is not extended; it is fully exploited by positioning the student precisely at its edge.

Empirically: a Llama 3B model, starting at 1% accuracy on undergraduate integration problems, reaches 82% after LADDER training — with no human-curated training data, no architectural changes, and no distillation from a larger teacher (Simonds and Yoshiyama, 2025). A 7B model achieves 73% accuracy on the 2025 MIT Integration Bee examination. For reference: GPT-4o scores 42% on the same examination; typical human performance is 15–30%.

The test-time extension, TTRL (Test-Time Reinforcement Learning), applies the same recursion at inference: given a test problem, generate simpler variants, solve them, refine the solution approach, apply it to the original. With TTRL, the same 7B model reaches 90% on the MIT Integration Bee — surpassing o1 (Simonds and Yoshiyama, 2025).

The TTRL mechanism connects to the inference-time compute argument in [← A11]. CTM locates reasoning computation inside the forward pass, between activations. LADDER's TTRL locates it in a test-time RL micro-loop — solving related problems as a mini-curriculum for the specific test instance. Both are shifts from generating more tokens to spending computation differently: depth of recursion rather than length of trace.

The deepest implication is structural: the teacher-student boundary dissolves. The model is simultaneously the teacher (generating simpler variants as the curriculum), the student (training on those variants via RL), and the examiner (verifying solutions numerically). This is not a metaphor — it is the literal operational structure of LADDER. The recursive self-improvement loop described here is the DGM self-modification loop [← A8] applied to curriculum design: the system changes what it learns from, not just how it responds.

(see Figure 2)


Figure 2 — Curriculum Design Loop and LADDER Recursion

  ════════════════════════════════════════════════════════════════════════════════════════════
   LADDER: RECURSIVE TEACHER-STUDENT COLLAPSE
  ────────────────────────────────────────────────────────────────────────────────────────────

   STANDARD RL-AS-EDUCATOR (RLTs)               LADDER: SELF-REFERENTIAL RECURSION
   ──────────────────────────────────            ──────────────────────────────────────────────

   TEACHER MODEL (separate)                      TARGET PROBLEM (hard; model fails at init)
   ├─ receives question + solution                           │
   ├─ generates explanation                                  │ [teacher role] variant generation
   │                                                         ▼
   │  explanation ──────────────────►             DIFFICULTY TREE
   ▼                                              Level N:  original problem   ← training goal
   STUDENT MODEL (separate)                       Level N-1: one step simpler
   ├─ learns from explanation                     Level N-2: simpler still
   │                                                 ...
   │  log-prob reward ◄──────────────              Level 1:  solvable at initialization
   ▼                                                         │
   UPDATE: teacher parameters                               │ [student role] GRPO training
                                                             │  (bottom-up: leaves → root)
   Reward signal:                                            ▼
   student improvement (dense, via log-probs)     MODEL (same model: teacher + student + examiner)
                                                  ├─ generates variants      [teacher]
                                                  ├─ trains on variants      [student]
                                                  └─ verifies via numerics   [examiner]
                                                             │
                                                  TTRL: at test time, same loop runs
                                                  per test instance (micro-curriculum)

   TEACHER ≠ STUDENT (two models)               TEACHER = STUDENT = EXAMINER (one model)
   Inner loop only                               Nested loops: training + inference
  ════════════════════════════════════════════════════════════════════════════════════════════

Figure 2: Curriculum design loop for standard RL-as-educator (left, RLTs) versus LADDER's self-referential recursion (right). RLTs maintain a teacher-student distinction with dense log-probability rewards flowing back to update the teacher. LADDER collapses the distinction — one model generates, trains on, and verifies its own curriculum. The LADDER outer loop is the DGM self-modification architecture [← A8] applied to curriculum design: the system changes what it learns from, not just how it responds to a fixed curriculum.


Key Takeaway: LADDER achieves 1% → 82% on undergraduate integration with a 3B model and 73% on the MIT Integration Bee with a 7B model — surpassing GPT-4o (42%) and typical human performance (15–30%) — with no human data or architectural scaling. TTRL extends the same recursion to test time, reaching 90% surpassing o1. Teacher, student, and examiner are the same model at different positions in the recursion: self-improvement is not the outcome; it is the mechanism.


§5 — When the Evaluator Must Evolve Too

RLTs, TeaMs-RL, and LADDER each optimize some component of the teaching pipeline while treating one element as relatively stable: the evaluator. The reward signal — whether from correctness labels, diversity metrics, or numerical verification — is assumed to be a reliable proxy for learning quality. In long-horizon RL training, this assumption breaks.

Huang et al. (2026) identify the failure mode precisely. In standard RLHF, the reward model (RM) is trained on human preference data and then held fixed during RL training of the policy θ. But the policy evolves. As θ changes, the distribution of policy outputs changes; the RM, trained on the original distribution, increasingly assigns scores based on features that were correlated with quality in the original data but are no longer meaningful for the evolved policy. Policy models exploit these spurious patterns — response length, markdown formatting, superficial linguistic cues like specific n-grams — to maximize reward without genuinely improving alignment. This is reward overoptimization.

The distribution gap is not static, and it is not semantic. Huang et al. (2026) demonstrate that the hidden states of the evolving policy carry information about the current policy distribution that is not captured by semantic representations of response text alone. The correlation between policy hidden-state similarity and reward discrepancy is strongly negative: Pearson −0.683, Spearman −0.711, Kendall −0.545 (Huang et al., 2026). Responses that look semantically similar but belong to different regions of the policy's output distribution receive systematically different reward scores — differences a surface semantic reward model cannot track.

R2M (Real-Time Aligned Reward Model) redesigns the reward model's scoring head to incorporate the evolving hidden states of the policy at each training step. Rather than retraining the full reward model, R2M introduces a lightweight aggregation component — a Sequence-to-Token Cross Attention module plus a Time-Step-Based Weighted Combination — that integrates policy hidden states into the reward computation with minimal overhead. The synchronization is continuous at every training round.

Formally, R2M provides a tighter upper bound on reward misalignment than the vanilla RM. Where ε measures the extent of reward misalignment, Huang et al. (2026) establish:

  ε(t)_R2M    ≤  (1 - γ(t))^(1/2) · C  +  ΔD(t) · L

  ε(t)_vanilla ≤  C  +  ΔD(t) · L

  where γ(t) ∈ [0, 1], C > 0

Since (1 − γ(t))^(1/2) < 1 for any γ(t) > 0, R2M's bound is strictly tighter: the reward model's misalignment is bounded below the level achievable by a static RM, continuously.

Empirically: RLOO + R2M improves AlpacaEval 2 win rate by 5.2%–8.0%, length-controlled win rate by 2.9%–6.1%, and TL;DR summarization win rate by 6.3% compared to vanilla RLOO (Huang et al., 2026), with minimal additional computational cost.

The implication for RL-as-educator is direct. An RL-trained teacher optimizes against a reward signal. If that reward signal is the student's log probabilities (as in RLTs), the effective reward model is the student model at each training step. The student is changing; the teacher's reward signal must track those changes. R2M addresses this problem when the evaluator is a learned proxy for human preferences, but the structural insight generalizes: the evaluator is a learnable component of the teaching pipeline, and it must be updated alongside the student it is evaluating.


Key Takeaway: A static reward model increasingly fails to track the evolving policy's true distribution — Huang et al. (2026) establish a negative correlation of Pearson −0.683 between hidden-state similarity and reward discrepancy. R2M's continuous incorporation of policy hidden states reduces this misalignment gap, yielding 5.2%–8.0% gains on AlpacaEval 2 with negligible overhead. For RL-as-educator: the evaluator judging the teacher's effectiveness must evolve alongside the teacher it is evaluating.


§6 — What RL-Teacher Research Reveals About the RL-Student Debate

Articles 2 and 5 pose complementary questions. Article 2 [← A2] asks: does RL teach LLMs to reason, or does it merely amplify pre-existing latent capability? The Invisible Leash constrains RL to the base model's frontier; grokking and skill composition can move that frontier, but only incrementally. Article 5 [← A5] asks: does the shape of reasoning traces matter independently of correctness? Yes — RLVR compresses trace diversity while improving pass@1, and there is an independent axis for trace quality that correctness alone does not capture.

RL-as-educator adds a third question that A2 and A5 did not ask: was the teacher appropriate for the student in the first place?

The Invisible Leash argument identifies a ceiling: RL cannot manufacture new capabilities. But it does not say where that ceiling sits for a given model, or whether the training curriculum is positioned to reach it. A student trained on problems that are systematically too easy will never approach the ceiling. A student trained on problems systematically too hard will stall at the exploration barrier. The teacher determines where in the difficulty space the student trains — and under standard RL post-training, the teacher is not optimized.

LADDER's core result illustrates this precisely. The 3B model at initialization scored 1% on integration problems — not 0%. The capability existed at the margin. Standard RL on full integration problems would have yielded near-zero gradients; the tasks sat at the exploration barrier. LADDER's curriculum construction brought those tasks into the trainable regime by decomposing them into problems the model could already solve. The Invisible Leash was not extended; it was fully exploited by positioning the student at its exact competence edge, consistently, for every training problem.

The implication: the RL-student debate in A2 and A5 is partly a debate about teacher quality. When the teacher is well-matched — placing training problems consistently at the student's edge of competence — RL achieves more of its theoretical potential. When the teacher is poorly matched, RL achieves less, and the resulting pessimism about RL's reach is partly an artifact of teacher quality rather than a fundamental constraint.

This is not a resolution of the Invisible Leash. It is a precision on where the Leash binds. The leash constrains the ceiling; teacher quality determines how close to that ceiling the student can train.


Key Takeaway: RL-as-educator reframes the debates from A2 and A5. The Invisible Leash constrains the ceiling; teacher quality determines how close to the ceiling the student trains. LADDER shows that optimal curriculum placement enables RL to extract gains that appeared bounded by exploration failure. The RL-teacher and RL-student research programs are not parallel — they are two lenses on the same optimization problem, one at the component level and one at the system level.


§7 — Limits: What RL-Trained Teachers Cannot Do

The RL-as-educator paradigm is powerful in current form. Its limits are also worth naming precisely — not as disqualifications, but as the boundary conditions that define what human curriculum design still provides.

Reward hacking in explanation generation. RLTs are trained on student log-probability rewards. This is a better proxy for learning quality than answer correctness, but it remains a proxy. A teacher that learns to maximize student log probabilities may optimize for explanations that are fluent, familiar, or highly predictable within the student's pre-training distribution — without necessarily producing explanations that build durable understanding. The distinction between "this explanation is easy to predict" and "this explanation transfers durably" is not captured by log probabilities alone. A more precise reward for genuine conceptual transfer does not yet exist.

Distribution mismatch between teacher training and student deployment. Cetin et al. (2025) demonstrate zero-shot effectiveness on out-of-distribution task types — a strong result. But "out-of-distribution" here means different problem topics within the same reasoning paradigm and student architecture class. A teacher trained to explain mathematical reasoning to one class of student models may produce explanations poorly calibrated for a student with a different pre-training distribution, different tokenization, or qualitatively different reasoning style. Teacher-student co-evolution under architecture distribution shift is not addressed by any of the four papers in this article.

Catastrophic forgetting in the teacher. Teacher models are themselves trained sequentially — first on mathematical explanation, then on coding, then on scientific reasoning. The same plasticity dynamics documented in [← A1] and [← A6] apply to teacher models. A teacher that forgets how to explain mathematical reasoning after being fine-tuned for code explanation fails its students without any visible failure signal until downstream task performance degrades. Continual learning for teacher models is not addressed here; the problem is structural, and the solutions from earlier in this series — architecture redesign [← A6], plasticity repair [← A1], world model scaffolding [← A3] — apply just as much to teachers as to students.

Verification brittleness for domains without computational oracles. LADDER's self-improvement loop is bounded by the reliability of its verification mechanism. It works for integration because numerical integration checking is computationally tractable. For domains where correctness verification is hard — open-ended reasoning, commonsense inference, long-horizon planning — the recursion can amplify confident errors. A model that generates wrong-but-confident simpler variants will train on false premises; RL will reinforce the false chain. LADDER's 1% → 82% result is partly a property of the domain's verifiability, not a general guarantee about recursive self-improvement.

What these four failure modes converge on: human curricula encode implicit knowledge about what students find difficult, what misconceptions are typical, and what trajectory of conceptual development leads to durable understanding. These are not accessible from output correctness or log probability. RL-trained teachers are optimized for measurable proxies; the unmeasured aspects of learning quality remain the province of human curriculum design.


Key Takeaway: RL-trained teachers have four characteristic failure modes — log-probability reward hacking, distribution mismatch under student architecture shift, catastrophic forgetting of teaching skills during sequential teacher training, and verification brittleness outside computationally checkable domains. These limits are not disqualifying; they define where human curriculum design remains irreplaceable, and where the CRL solutions from earlier in this series (plasticity repair, world model scaffolding) apply to teacher models as much as to student models.


§8 — Closing the Loop: From Diagnosis to Self-Teaching Systems

This is the twelfth article. It is time to name what the series argued.

The opening diagnosis [← A10]: the field's continual RL benchmarks measure task-switching, not the structural capacity for continuous non-stationary adaptation. Existing benchmarks fail at least one of the formal desiderata for CRL. The field has been evaluating the wrong thing, systematically, for years.

The crisis behind the diagnosis [← A1]: the problem is not forgetting in the classical sense. Forgetting is about the past; plasticity loss is about the future. Networks that have lost plasticity cannot form new representations regardless of what is presented to them. Plasticity loss is the collapse of future learning capacity — and it accumulates silently, task by task, until the network is effectively frozen.

The theoretical explanation [← A3, ← A4]: plasticity loss is the shape that the absence of a world model takes in a neural network. A network that has learned a compressed, generalizable model of its environment can reuse that model for new tasks without collapsing its representational capacity. A network that has not built such a model must overwrite existing representations to accommodate new tasks — and eventually runs out of room. GVFs were pointing toward world models without naming them. DreamerV3 built what GVFs were pointing toward.

The engineering response [← A6, ← A7]: architecture must be redesigned for continual learning, not just training procedures. And RL training must be stable before it can build anything; an unstable RL training loop destroys the world model before it can help.

The understanding of what RL actually does [← A2, ← A5]: RL post-training amplifies capabilities latent in the base model. It cannot manufacture new ones. But it can compress the reasoning trace distribution toward reliable correct paths, improving the shape of thought alongside its outcomes. These are both real effects. Neither alone is the whole story.

The scale limit [← A9]: prolonged RL training on frontier models hits a consistent wall — pre-training headroom, reward signal quality, entropy stability. The wall is not algorithmic; it is structural. Three escape routes exist: evolutionary self-modification [← A8], architectural compute [← A11], and curriculum self-design (this article).

The frontier [← A8, ← A11]: self-modifying systems change what they are. CTM changes where computation happens — inside the forward pass, without token cost. LADDER changes what the system learns from.

These are not separate escape routes. They are the same move applied to different components of the learning pipeline.

(see Figure 3)


Figure 3 — The 12-Article Arc: From Benchmark Diagnosis to Self-Teaching Systems

  ════════════════════════════════════════════════════════════════════════════════════════════════
   CONTINUAL INTELLIGENCE — THE 12-ARTICLE ARC
  ────────────────────────────────────────────────────────────────────────────────────────────────

  PART I: THE PROBLEM        PART II: FOUNDATION       PART III: SOLUTIONS        PART IV: FRONTIER
  ──────────────────────     ───────────────────────   ─────────────────────────  ─────────────────────────

  [A10]──────►[A1]           [A3]──────►[A4]           [A6]──────►[A7]            [A9]
  [EV][CL]   [CL]           [CL][WM]  [CL][WM][RL]   [CL][RL]   [RL]            [RE][RL]
  Benchmark  Plasticity     Big World  GVFs as         Forgetting  Stable RL at   Frontier
  Diagnosis  Crisis         Hypothesis Proto-WMs       Transformer Scale          Reasoning
                                                                                      │
                                                       [A2]◄──────►[A5]              │
                                                       [RL][RE]   [RE][RL]           │
                                                       Does RL     Shape of          │
                                                       Reason?     Thought           │
                                                                                      ▼
                                                                               [A8]────►[A11]
                                                                               [GI]     [RE][EV]
                                                                               Darwin-  Thinking
                                                                               Gödel    Without
                                                                                        Tokens
                                                                                           │
                                                                                           ▼
                                                              ┌────────────────────────────────────────┐
                                                              │                [A12]                   │
                                                              │               [RL][RE]                 │
                                                              │          RL AS EDUCATOR                │
                                                              │     (self-teaching systems)            │
                                                              └────────────────────────────────────────┘

  Node colors (domain tags):
  [CL] = blue    [WM] = green    [RL] = orange    [RE] = purple    [EV] = red    [GI] = multi

  ────────────────────────────────────────────────────────────────────────────────────────────────
                                 ◄──────── SERIES THESIS ────────►
    "The plasticity crisis in CRL is the shape that the absence of a world model takes in a
     neural network. Fix the world model, fix plasticity — then teach the system to teach itself."
  ════════════════════════════════════════════════════════════════════════════════════════════════

Figure 3: The 12-article arc. Reading order runs A10 → A12 across four book parts. Node colors indicate primary domain tags: [CL] blue, [WM] green, [RL] orange, [RE] purple, [EV] red, [GI] multi-color. A12 closes at the convergence point: a system that designs its own curriculum, trains its own teacher, and recursively improves both. The series thesis is the center: fix the world model, fix plasticity, teach the system to teach itself.


We have traveled from "benchmarks measure the wrong things" to "systems that improve their own teachers." That is the distance the field has covered, and the distance this series has covered.

The arc is explicit: diagnosis (A10) → crisis (A1) → theory (A3, A4) → engineering (A6, A7) → understanding (A2, A5) → scale (A9) → frontier (A8, A11) → self-teaching (A12). Each article in the series extends and grounds the thesis by one component. Together they form an argument that the plasticity problem is one problem viewed from twelve angles — and that the solution is also one solution, approached from twelve directions.

The self-teaching systems examined here — RLTs, TeaMs-RL, LADDER, R2M — are not a completed program. LADDER's loop requires domain-verifiable tasks. RLTs require a stable student model to evaluate against. R2M requires that the policy's hidden states remain informative as proxies for preference alignment. These are real constraints. The Limits section named them without resolution.

What A12 proposes is an architecture for thinking about the training pipeline as a fully learnable object. Every component — the curriculum, the teacher, the reward evaluator — can be optimized. The optimizer for each is RL. The signal for each is the downstream component's performance. The result is a nested optimization loop where the system's capacity for self-improvement is itself a subject of optimization.

This is the shape the field is moving toward. Not a single model getting better — a system that can redesign the conditions under which it learns.


Key Takeaway: The series thesis reaches its final extension at A12: fix the world model, fix plasticity — then teach the system to teach itself. RLTs train the teacher. TeaMs-RL trains the dataset generator. LADDER collapses teacher and student into one self-referential loop. R2M trains the evaluator to evolve alongside the policy. Four papers, four components of the training pipeline made learnable. The field is moving from optimizing models to optimizing the training pipeline as a whole.


Epilogue: Three Open Questions

This series deliberately leaves three questions open. Not because the literature is thin — it is substantial — but because the evidence does not yet resolve them. Stating them clearly is more useful than speculating past the data.


1. Open-ended CRL agents: Can a DGM-style self-modifying agent satisfy all of Khetarpal's CRL desiderata simultaneously?

The CRL desiderata established in [← A10] — learning continuously, generalizing across tasks, maintaining plasticity, not forgetting — have never been satisfied simultaneously by any deployed system. The Darwin-Gödel Machine [← A8] proposes self-modification as the mechanism for satisfying them in principle, but the empirical demonstrations are small-scale and domain-constrained. A system that self-modifies its architecture, curriculum, teacher, and reward signal in a unified loop — satisfying all desiderata continuously — does not exist. Whether it can exist is the organizing question for the next decade of continual RL research.


2. CTM at scale: Does per-neuron temporal memory scale to models with 70B+ parameters, or does the synchronization overhead dominate?

The Continuous Thought Machine [← A11] demonstrates a qualitatively different locus for reasoning computation at small scale — inside the forward pass, in the synchronization patterns between neurons, without reasoning tokens. The synchronization matrix S_t has dimensionality D × D for D neurons; at 70B-parameter scale, the memory cost and communication overhead of maintaining per-neuron temporal state are non-trivial. The CTM paper's experiments are small-scale. Whether the architectural principle survives the transition to frontier-model scale is unknown. The answer determines whether token-denominated reasoning can be structurally displaced, or only supplemented at the margin.


3. WM-native evaluation: What would a benchmark look like that evaluates world model quality directly, not just downstream task performance?

Every benchmark examined in this series — from the CRL desiderata audit [← A10] to the MIT Integration Bee [← A11, this article] — evaluates world model quality indirectly, through downstream task performance. A model that achieves high task performance through memorization, distributional exploitation, or curriculum overfit passes all current benchmarks. A model that has genuinely constructed a compressed, generalizable model of its environment may score identically. The field currently has no way to distinguish these cases. A benchmark that directly measures world model quality — generalization to novel environment configurations, calibrated predictive accuracy on counterfactual transitions, uncertainty over world states — does not yet exist. Building it is not optional; it is the prerequisite for the field knowing whether it is making progress on the problem this series identified in Article 1.


The series ends here. The open questions are not a failure of the argument — they are its continuation. The distance from "benchmarks measure the wrong things" to "systems that improve their own teachers" is a decade of work by hundreds of researchers. The next decade has its own distance. The starting point is these three questions, and the observation that none of them has an answer yet. The final page of a good argument is not a summary. It is an instruction: go find out.


§ What Comes Next

This is the final article in the Continual Intelligence series. For readers arriving at A12 first: the reading order begins at [← A10: The Benchmark Gap] and builds the argument across twelve chapters. Each article is self-contained; the series rewards sequential reading.

The argument in compressed form: systems fail to learn continuously because they lack world models. World models are what stabilize the representational substrate that plasticity requires. Reinforcement learning is the signal that trains both — if training is stable. When stability is achieved, RL can be directed not just at the student but at every component of the training pipeline. That is the program this series constructed. That is also where the three open questions live: what would it mean to satisfy it completely?


Final Key Takeaways

  1. Standard RL post-training fixes the curriculum and optimizes the student. RL-as-educator fixes the student loop and optimizes the curriculum generator. Same RL machinery; different optimization target. When the teacher is trainable, the curriculum can be designed by the same optimizer that learns from it.

  2. RLTs decouple teaching ability from problem-solving ability. A 7B model optimized for explanation quality via dense student log-probability rewards outperforms distillation from orders-of-magnitude-larger models on AIME, MATH, and GPQA. The exploration barrier that limits standard RL vanishes when the reward is downstream learning, not direct task performance.

  3. TeaMs-RL demonstrates that the instruction dataset is a learnable component of the training pipeline. Applying RL to the dataset generator — with diversity as the explicit reward — achieves better instruction-following than WizardLM while using only 5.73% of baseline's expert LLM queries.

  4. LADDER dissolves the teacher-student boundary. By recursively constructing its own difficulty curriculum, a 3B model improves from 1% to 82% on integration problems without human data or architectural changes. With TTRL, the 7B variant reaches 90% on the MIT Integration Bee, surpassing o1. Teacher, student, and examiner are the same model at different positions in the same recursion.

  5. The evaluator must evolve alongside the student. R2M's continuous incorporation of the policy's hidden states reduces reward overoptimization — yielding 5.2%–8.0% wins on AlpacaEval 2 WR — with negligible overhead. A static reward model is a curriculum bottleneck; the evaluator is the fourth learnable component of the teaching pipeline.

  6. RL-as-educator reframes the Invisible Leash. The Leash constrains the ceiling; teacher quality determines how close to that ceiling the student trains. The debates in A2 and A5 about what RL teaches are partly debates about whether the teacher was appropriate for the student in the first place.

  7. The series thesis reaches its final extension: fix the world model, fix plasticity — then teach the system to teach itself. From benchmark diagnosis (A10) to self-teaching systems (A12), the argument is complete. The three open questions define what "complete" does not yet mean.


References

[1] Cetin, E., Zhao, T., & Tang, Y. (2025). Reinforcement Learning Teachers of Test Time Scaling. Sakana AI Technical Report.

[2] Gu, S., Knoll, A., & Jin, M. (2024). TeaMs-RL: Teaching LLMs to Generate Better Instruction Datasets via Reinforcement Learning. Preprint. Code: https://github.com/SafeRL-Lab/TeaMs-RL

[3] Simonds, T., & Yoshiyama, A. (2025). LADDER: Self-Improving LLMs Through Recursive Problem Decomposition. Tufa Labs Technical Report. arXiv:2503.00735

[4] Huang, Z., Xia, X., Ren, Y., Zheng, J., Xiao, X., Xie, H., Li, H., Liang, S., Dai, Z., Zhuang, F., Li, J., Ban, Y., & Wang, D. (2026). Real-Time Aligned Reward Model beyond Semantics. Preprint, March 10, 2026.