Essay 09 of 12RERL

Reasoning at Scale: What DeepSeek-R1, ProRL, and Prolonged RL Reveal

DeepSeek-R1, ProRL and prolonged RL reveal how far reasoning can be pushed when you simply do not stop training.

36 min read7,891 words❖ Series: Continual Intelligence

Anchor papers: DeepSeek-AI arXiv:2501.12948 · Liu et al. arXiv:2505.24864 · MiniMax arXiv:2506.13585 · Sun et al. arXiv:2509.21016 Series: Continual Intelligence


DeepSeek-R1-Zero trained with pure RL and no human-annotated reasoning data. On AIME 2024, it started at 15.6% pass@1 and finished at 77.9% — surpassing the average human competitor. NVIDIA's ProRL then trained for roughly 16,000 GPU-hours on 136,000 verifiable problems and extracted gains the standard pipeline had missed entirely: +54.8% on logic puzzles over the distilled 1.5B baseline, a domain so structurally different from mathematics that the improvement cannot be explained by residual training signal from pre-training. MiniMax-M1 subsequently demonstrated that achieving comparable performance requires only 25% of DeepSeek-R1's FLOPs at 100K generation length — if you redesign the attention mechanism around the scaling constraint.

Three systems. Three different answers to the question "what does prolonged RL actually buy?" This article builds the comparison table, reads the numbers, and names the wall these systems are all approaching. Part III of this series [← A7] gave you the stability toolkit. Part IV opens here, with the tools running into their limits.


§1 — The Frontier Reasoning Moment: 2024–2025

The 2024–2025 wave of frontier reasoning systems did not arrive all at once. It arrived as a controlled experiment that the research community ran simultaneously, without coordinating. DeepSeek published a recipe. NVIDIA extended it. MiniMax rearchitected it. Each team believed they were doing something different. What they were actually doing was sampling from the same distribution of what prolonged RL training does to a capable base model.

Zhang, Neubig, and Yue (2025) ran the controlled version of this experiment. Their central finding reframes the debate: RL produces genuine capability gains, measurable by pass@128 on out-of-distribution problems, only when two conditions hold simultaneously. First, pre-training must leave sufficient headroom — tasks that are already solved or completely out-of-reach provide no learning signal. Second, RL training data must target the model's edge of competence: problems that are difficult but not yet intractable. This is the zone where the reward signal is non-zero but imperfectly exploited, and therefore where optimization is meaningful. Zhang et al. (2025) call this the model's "edge of competence" — a phrase that belongs in the vocabulary of anyone planning RL post-training.

The corollary is uncomfortable. If a model's pre-training has already covered the task distribution exhaustively, RL will find nothing new to learn. The apparent capability gains from RL post-training in heavily benchmarked domains (competition mathematics, coding) may be amplification of near-threshold knowledge rather than creation of new reasoning capacity. What the Invisible Leash argument [← A2] established theoretically, Zhang et al. (2025) established empirically with controls.

A complementary lens comes from Yeo, Tong, Niu, Neubig, and Yue (2025), who systematically investigated the conditions under which long chain-of-thought reasoning emerges. Four findings shape the rest of this article: reasoning capabilities tend to emerge with increased training compute, but their development is not guaranteed — reward shaping is essential for stabilizing the growth of CoT length. Core abilities like error correction are present in base models but require significant compute and careful RL design to reliably incentivize. Scaling verifiable reward signals with filtered noisy web solutions shows strong out-of-distribution potential. And SFT, while not strictly necessary, simplifies training by providing a stable initialization before RL begins.

These two papers together define the problem space. What follows is the empirical record of six systems attempting to navigate it.

(see Figure 1)


Figure 1 — Cross-System Comparison: What Prolonged RL Buys

  ══════════════════════════════════════════════════════════════════════════════════════
   FRONTIER REASONING SYSTEMS: EMPIRICAL COMPARISON
  ──────────────────────────────────────────────────────────────────────────────────────
   SYSTEM                   RL APPROACH          KEY INNOVATION         PRIMARY GAIN REPORTED
  ──────────────────────────────────────────────────────────────────────────────────────
   DeepSeek-R1-Zero         Pure GRPO,           No SFT cold start;     AIME 2024: 15.6% →
   (DeepSeek-AI, 2025)      no SFT               pure RL emergence      77.9% pass@1

   DeepSeek-R1              GRPO, 4-stage        Cold start SFT +       Surpasses o1-preview
   (DeepSeek-AI, 2025)      pipeline             dual RL stages         on math, code, STEM

   ProRL / Nemotron-1.5B    GRPO + DAPO +        KL penalty +           +54.8% logic puzzles,
   (Liu et al., 2025)       KL control +         reference reset;       +14.7% math,
   ~16K GPU-hours           policy reset         136K diverse tasks     +13.9% code vs 1.5B base

   MiniMax-M1               Large-scale RL +     Lightning attention     25% FLOPs vs
   (MiniMax, 2025)          CISPO                + hybrid MoE;          DeepSeek-R1 @ 100K
   512 H800s × 3 weeks                           1M token context       tokens; SWE + tool use

   RL Grokking / DELTA      Staged RL:           Dense→binary           0% → 100% on
   (Sun et al., 2025)       dense reward         reward curriculum;     Manufactoria-HAS;
   Qwen3-4B-Instruct        warm-up → binary     DELTA benchmark        grokking after ~450 steps

   Long CoT / Demystified   RL with reward       Noisy web solution     Strong OOD potential;
   (Yeo et al., 2025)       shaping + SFT init   filtering at scale     error correction emerges
  ──────────────────────────────────────────────────────────────────────────────────────
   Systems sorted by RL compute budget (ascending). GPU-hours not reported uniformly
   across all systems — gaps marked where papers do not report this metric directly.
  ══════════════════════════════════════════════════════════════════════════════════════

Figure 1: Six frontier reasoning systems compared on RL approach, innovation, and primary measured gain. The comparison table is the argument: despite different architectures, model sizes, and training recipes, every system encounters the same structural constraints — pre-training headroom, reward signal quality, and entropy stability. The specific gains are secondary; the consistent pattern is primary.


Key Takeaway: The 2024–2025 frontier reasoning moment is not a collection of isolated breakthroughs. It is a coordinated natural experiment on what prolonged RLVR does to a capable base model. The consistent finding across systems: RL expands capability only at the edge of competence, where the training signal is non-zero but the task is not yet solved.


§2 — DeepSeek-R1: The Baseline

The baseline for this empirical comparison is DeepSeek-R1, and its most revealing form is not the final polished model but its experimental predecessor: DeepSeek-R1-Zero.

DeepSeek-AI (2025) built DeepSeek-R1-Zero by applying GRPO directly to DeepSeek-V3-Base — no supervised fine-tuning, no human-annotated reasoning traces, no cold-start data. The reward signal was purely outcome-based: correctness of the final answer against ground truth. The training configuration sampled 16 outputs per question, used a maximum sequence length of 32,768 tokens (extended to 65,536 tokens after 8,200 training steps), set KL coefficient to 0.001, and ran with learning rate 3×10⁻⁶.

The result was documented in real time on the AIME 2024 benchmark. Pass@1 accuracy climbed from 15.6% at training start to 77.9% at convergence. With majority voting across 16 samples (cons@16), the model achieved 86.7% — above the average human competitor. What DeepSeek-AI (2025) found more striking than the final number was the trajectory: the model developed self-reflection, verification, and dynamic strategy adaptation without being taught these behaviors. They emerged as the optimizer found them necessary to improve accuracy. "Rather than explicitly teaching the model how to solve a problem, we simply provide it with the right incentives, and it autonomously develops advanced problem-solving strategies."

DeepSeek-R1-Zero had visible problems: language mixing within a single chain-of-thought (switching between English and Chinese mid-reasoning), and difficulty generalizing beyond reasoning-intensive domains to open-domain question answering and writing. These problems were not algorithmic failures of RL; they were downstream effects of an absence of behavioral anchoring. DeepSeek-R1 addressed this with a four-stage pipeline: cold-start SFT to establish behavioral priors → RL stage 1 on reasoning tasks → rejection-sampling SFT to distill good reasoning traces → RL stage 2 incorporating helpfulness and harmlessness signals alongside reasoning.

DeepSeek-R1's limitations deserve explicit acknowledgment, because they define the shape of the wall. The model excels on closed-form reasoning tasks with verifiable ground truth — competition mathematics, structured coding problems, STEM questions. It underperforms on long-horizon tasks that require maintaining coherent context across extended generation lengths; MiniMax-M1 (MiniMax, 2025) noted this gap explicitly, describing its 1-million-token native context window as "8x the context size of DeepSeek R1" and designed specifically for "complex tasks that require processing long inputs." Format generalization across novel problem presentations is also bounded — a limitation the skill composition and grokking work (§§4–5) illuminates mechanistically.

(see Figure 2)


Figure 2 — DeepSeek-R1 Training Pipeline: From Base Model to Reasoning System

  ══════════════════════════════════════════════════════════════════════════════
   DEEPSEEK-R1 MULTI-STAGE TRAINING PIPELINE
  ──────────────────────────────────────────────────────────────────────────────

   STAGE 0          STAGE 1           STAGE 2          STAGE 3
   ──────────       ──────────        ──────────        ──────────
   DeepSeek-V3      Cold-Start SFT    RL Stage 1        Rejection-
   Base Model   →   (thousands of →   GRPO on       →   Sampling SFT
                    long CoT          reasoning         (distill
                    examples;         tasks;            best traces
                    human-aligned     outcome-only      from Stage 1
                    format)           rewards)          RL model)

                                                              ↓

                                                       STAGE 4
                                                       ──────────
                                                       RL Stage 2
                                                       (reasoning
                                                       + helpfulness
                                                       + harmlessness)

                                                              ↓

                                                       DeepSeek-R1

  ──────────────────────────────────────────────────────────────────────────────
   R1-ZERO BYPASS: Skip Stages 0–1, go directly Base → GRPO (Stage 1).
   Result: 15.6% → 77.9% AIME pass@1, but language mixing and format issues.
  ══════════════════════════════════════════════════════════════════════════════

Figure 2: DeepSeek-R1's four-stage training pipeline. The critical design choice is Stage 0 (cold-start SFT): a small set of human-aligned long-CoT examples provides behavioral anchoring before RL begins, preventing the language mixing and format instability observed in DeepSeek-R1-Zero. The R1-Zero bypass demonstrates that RL alone produces strong reasoning but not behavioral coherence.


Key Takeaway: DeepSeek-R1 demonstrates that a capable base model plus GRPO plus verifiable rewards is sufficient to produce self-reflective reasoning behaviors. The four-stage pipeline is not theoretical elegance — it is engineering response to the failure modes of pure RL (language mixing, narrow generalization). The failure modes of R1-Zero are the same failure modes that every subsequent system had to solve differently.


§3 — What Happens with Prolonged RL

NVIDIA's ProRL paper (Liu et al., 2025) poses the question directly: if the DeepSeek-R1 recipe works, what happens when you run it longer, on more tasks, with explicit stability controls? The short answer: capability boundaries expand, but they require active engineering to keep expanding. The failure mode waiting at the end of uncontrolled prolonged RL is entropy collapse — and it arrives before you would expect.

Liu, Diao, Lu, Hu, Dong, Choi, Kautz, and Dong (2025) trained Nemotron-Research-Reasoning-Qwen-1.5B using ProRL, a methodology they define by three properties: a training dataset of 136K verifiable problems spanning five domains (mathematics, code, STEM, logical puzzles, instruction following), training duration exceeding 2,000 steps, and explicit stability mechanisms to prevent premature convergence. Training ran on four 8×H100-80GB nodes for approximately 16,000 GPU-hours.

The gains over the distilled 1.5B baseline (DeepSeek-R1-Distill-Qwen-1.5B) were domain-structured in a revealing way. Math: +14.7% average pass@1. Code: +13.9%. STEM reasoning (GPQA Diamond): +25.9%. Instruction following (IFEval): +22.0%. Logic puzzles (Reasoning Gym): +54.8%. The logic puzzle result is the signal. Competition mathematics had been the primary training domain for the base model; logic puzzles were genuinely novel territory. The 54.8% improvement on logic puzzles relative to a 14.7% improvement on mathematics is not random variation — it documents what access to the model's edge of competence buys. Mathematics was already partially solved; logic puzzles had headroom.

The stability mechanics of ProRL reveal what extended training encounters. Entropy collapse is the central failure mode: the model's output distribution becomes overly peaked during training, reducing exploration and causing the reward signal to stagnate as the model converges on a narrow set of strategies. Liu et al. (2025) address this with three interlocking mechanisms. DAPO's decoupled clipping (separate lower and upper bounds ε_low = 0.2, ε_high = 0.4) maintains exploration by resisting full distribution concentration. Dynamic sampling filters prompts where accuracy is uniformly 1 or 0 — keeping the training signal on the productive middle zone. And explicit KL divergence regularization maintains a penalty between the current policy π_θ and a reference policy π_ref: L_KL-RL(θ) = L_GRPO(θ) − β·D_KL(π_θ || π_ref).

The reference policy reset is the least-obvious mechanism and the most important for extended training. As training progresses, the KL penalty increasingly dominates the loss, producing diminishing policy updates. ProRL addresses this by periodically resetting the reference policy to the current online policy and reinitializing optimizer states — allowing the model to continue learning while retaining the stabilizing effect of KL regularization. This is not a trick; it is the engineering answer to the self-limiting property of KL-penalized RL over long training horizons.

Yeo et al. (2025) provide independent confirmation of the mechanism. In their systematic study of long chain-of-thought emergence, they establish that reward shaping is crucial for stabilizing CoT length growth — uncontrolled RL produces erratic length dynamics rather than the smooth scaling associated with improved reasoning. They further demonstrate that scaling verifiable reward signals with filtered noisy web solutions shows strong out-of-distribution potential, particularly for STEM reasoning, suggesting that data quality at scale can substitute for data quantity on narrow structured tasks.

(see Figure 3)


Figure 3 — The Prolonged RL Failure Mode Landscape

  ══════════════════════════════════════════════════════════════════════════════
   FAILURE MODES ALONG THE PROLONGED RL TRAINING TRAJECTORY
  ──────────────────────────────────────────────────────────────────────────────

   TRAINING PHASE        WHAT YOU SEE             WHAT IS ACTUALLY HAPPENING
  ──────────────────────────────────────────────────────────────────────────────
   EARLY (0–200 steps)   Rapid pass@1 gain        Base model headroom
                         ↑ reward, ↑ accuracy      being exploited

   MID (200–1000 steps)  Slower gain, plateau     Strategies near edge-of-
                         visible                   competence require longer
                                                   exploration periods

   LATE (>1000 steps)    ─── FAILURE MODE ONSET ───────────────────────────

   ENTROPY COLLAPSE      Output distribution       KL penalty begins
                         narrows; reward           dominating loss;
                         stagnates                 reference model anchor
                                                   kills gradient signal

   REWARD HACKING        Reward increases but      Model finds reward-
                         real accuracy plateaus    maximizing shortcuts
                         or decreases              (format gaming,
                                                   spurious token patterns)

   MODE COLLAPSE /       Diverse prompts yield     Optimization converges
   HOMOGENIZATION        near-identical outputs    on single dominant
                                                   reasoning strategy

  ──────────────────────────────────────────────────────────────────────────────
   PRORL FIXES:          Reference policy          KL penalty retains
   APPLIED THROUGHOUT    reset + optimizer         exploration; DAPO
                         reinitializaton           decoupled clipping
                         restores gradient         maintains diversity
  ══════════════════════════════════════════════════════════════════════════════

Figure 3: Failure mode landscape for prolonged RL training, organized by training phase. The key engineering insight from ProRL is that entropy collapse and reward hacking are not endpoint events — they have onset signatures detectable mid-training. The reference policy reset mechanism is designed to interrupt entropy collapse before it terminates the learning signal. Process-level rewards (Zhang et al., 2025) reduce reward hacking by making the reward signal sensitive to reasoning quality, not just final-answer correctness.


Key Takeaway: ProRL's 54.8% logic puzzle improvement over the distilled 1.5B baseline is the clearest published evidence that prolonged RL can access capability beyond the base model's pre-trained distribution — not through creation of new primitives, but by reaching the edge of competence on domains the standard training pipeline undertrained. The prerequisite is entropy management: without KL control and reference policy resetting, the extended training run would terminate in premature convergence before those gains become accessible.


§4 — Grokking in RL: Sudden Algorithm Acquisition

Prolonged RL does not improve linearly. Sun, Cao, Huang, Bai, Hajishirzi, Dziri, and Song (2025) documented a phenomenon that practitioners have observed but rarely characterized precisely: models trained with RL on hard reasoning problems show a long exploratory plateau of near-zero reward followed by a sudden jump to near-perfect accuracy. They named it, after the analogous phenomenon in supervised learning, RL grokking.

The vehicle for this study is DELTA — Distributional Evaluation of Learnability and Transferrability in Algorithmic Coding. DELTA is a controlled benchmark of synthetic programming problem families designed to isolate two distinct questions. Learnability asks: can RL instill a procedure that the base model cannot execute, even with extensive sampling (pass@K = 0 for large K)? Transferability asks: once learned, does the procedure generalize to out-of-distribution variants that demand the same underlying skill in different forms?

The Manufactoria-HAS family provides the clean demonstration. The reference model — Qwen3-4B-Instruct — achieves 0% full pass rate at pass@128 on this family. Standard GRPO training with binary full-pass rewards stagnates: with no positive signals in any rollout, there is no gradient. Sun et al. (2025) solve this with staged training. A dense reward warm-up phase provides partial credit for intermediate progress, moving the model into the region where full solutions become reachable. Switching to binary full-pass rewards then triggers the grokking dynamic: the model spends approximately 450 training steps in an exploration phase with full-pass rate below 1%, then abruptly discovers the key algorithmic strategy and converges to 100% accuracy. The improvement from 0% to 100% full-pass rate is essentially instantaneous on the training curve — not gradual, not smooth, sudden.

The generalization results discriminate between types of skill transfer. Along the exploratory axis — extending known skills to harder variants within the same problem family — the model shows solid performance on Basic (70–85% full-pass) and Easy (50–75%) tiers, with gains degrading on Medium (15–50%) and nearly vanishing on Hard (single digits). For compositional generalization — combining previously separate skills never seen in combination during training — the results are more striking: unseen compositions such as ROT BOX+MOV BOX, MOV BOX+GRAVITY, and MULTI BOX+MULTI OBJ achieve 60–70% full-pass accuracy despite the model never having trained on any compositional examples. Transformative generalization — discovering unconventional strategies not structurally related to training problems — remains the weakest axis.

This three-axis structure maps directly onto the series thesis [← A2]: RL reorganizes existing capabilities before it creates new ones, but at sufficient scale, the reorganization enables compositions the base model could not access.

(see Figure 4)


Figure 4 — RL Grokking: The Exploration–Exploitation Phase Transition

  ══════════════════════════════════════════════════════════════════════════════
   RL GROKKING DYNAMICS: MANUFACTORIA-HAS (DELTA BENCHMARK)
  ──────────────────────────────────────────────────────────────────────────────

   Full-pass rate (%)
   100 ┤                                              ╔══════╗
       │                                              ║      ║
       │                                              ║ 100% ║
    75 ┤                                          ╔═══╝      ║
       │                                          ║          ║
       │                                          ║          ╚═══════  100%
    50 ┤                                          ║           achieved
       │                                          ║
    25 ┤                                          ║
       │                                          ║
       │     DENSE-REWARD WARM-UP  │ BINARY REWARD PHASE
     1 ┤─────── near-zero ─────────┼─── near-zero ──── GROKKING ─────
       │     (dense partial credit │  (~450 steps plateau)
       │      builds foothold)     │                   ↑
     0 ├──────────────────────────┼───────────────────────────────────
       0             100          200        350      450    500  steps

  ──────────────────────────────────────────────────────────────────────────────
   GENERALIZATION POST-GROKKING (BOUNCINGSIM):
   Exploratory  (same-family harder variants): 70–85% Basic; 15–50% Medium
   Compositional (unseen skill combinations):  60–70% full-pass
   Transformative (novel strategies):          near-zero (weakest axis)
  ══════════════════════════════════════════════════════════════════════════════

Figure 4: RL grokking dynamics on the Manufactoria-HAS family (DELTA benchmark). The staged training procedure — dense reward warm-up followed by binary full-pass reward — is necessary to produce the phase transition. Without the warm-up, naïve GRPO stagnates at 0% with no gradient signal. The ~450-step exploration plateau followed by sudden jump to 100% is the characteristic RL grokking signature. Compositional generalization (60–70% on unseen combinations) persists after grokking; transformative generalization does not.


Key Takeaway: RL grokking is not a curiosity — it is the mechanistic signature of capability acquisition on tasks that sit at the model's hard edge of competence. The implication for practitioners: training runs that appear to be making no progress (extended near-zero reward phases) may be accumulating the representational prerequisites for sudden capability emergence. Terminating early because "it's not working" destroys this process. Monitoring requires pass@K at multiple values of K, not just pass@1.


§5 — Skill Composition: LLMs Learn to Compose Functions

The grokking phenomenon establishes that RL can unlock capabilities the base model cannot access. Yuan, Chen, Zhang, Cui, Wang, You, Ding, Liu, Sun, and Peng (2025) provide the mechanistic explanation: the capability being unlocked is composition.

The experimental framework is deliberately minimal. Define a skill as the ability to infer the output of a string-transformation function f(x) given input x. If an LLM has learned f independently and g independently — if it can correctly compute f(x) and g(x) for arbitrary x — then what happens when it is asked to compute h(x) = g(f(x)), a composition it has never seen?

The answer without RL: the model fails systematically. The answer with RL training on compositional examples: the model learns h, and the compositional ability generalizes to chains of more than two functions that were never in the training set. Yuan et al. (2025) provide the cross-domain transfer result as the most striking finding: the compositional ability, acquired through RL on a source task, transfers to a target task that shares only the atomic skills — not the compositions and not the task domain. The only requirement is that the model already knows the atomic skills in the target domain before RL training on the source domain begins.

This is not what next-token prediction training produces. Yuan et al. (2025) ran the same training data through SFT and found that none of these findings replicate: no cross-domain compositional transfer, no generalization to unseen chains of functions. The RL signal is doing something structurally different from imitation learning on the same trajectories.

The mechanistic interpretation connects to the grokking timeline in §4. The ~450-step exploratory plateau before grokking is the model discovering the compositional structure of the task. The sudden jump is the moment the composition becomes executable. The reason compositional generalization survives (60–70% on unseen combinations in DELTA) while transformative generalization does not is that composition reuses known primitives in new arrangements — it requires no new atomic skills, only new wiring. Transformation requires new atomic skills, which RL cannot create without pre-training headroom.

The practical implication follows directly from Zhang et al. (2025) [← §1]: build base models with the necessary atomic skills, then apply RL with appropriate incentivization to enable composition of those skills at the target edge of competence. The base model is the library; RL is the linker.

(see Figure 5)


Figure 5 — The Composition Mechanism: From f(x) and g(x) to f(g(x))

  ══════════════════════════════════════════════════════════════════════════════
   HOW RL ENABLES SKILL COMPOSITION (Yuan et al., 2025)
  ──────────────────────────────────────────────────────────────────────────────

   PRE-RL STATE:               BASE MODEL KNOWLEDGE MAP
   ──────────────────────────────────────────────────────
   Skill f(x):  ✅ known         Source domain atomics
   Skill g(x):  ✅ known         (string transformations)
   h(x)=g(f(x)):❌ unknown       Composition NOT accessible
   Target domain atomics: ✅     Same skills, different domain
   Target domain h(x): ❌         Same composition, target domain

  ──────────────────────────────────────────────────────────────────────────────

   RL TRAINING ON SOURCE TASK COMPOSITIONS:

   Iteration:   0        100       300       450      post-grokk
   h accuracy:  0%       ~0%       ~1%       ~1%       → near 100%
                         ↑_____plateau: RL builds___↑___GROKK →
                                  compositional circuit

  ──────────────────────────────────────────────────────────────────────────────

   POST-RL STATE:              WHAT TRANSFERRED
   ──────────────────────────────────────────────────────
   Source h(x):        ✅     directly trained
   Source >2-func:     ✅     unseen during training
   Target h(x):        ✅     cross-domain transfer
   Transformative:     ❌     new atomic skills = no transfer

   SFT CONTROL (same data):   None of the above transfers ←─ key finding
  ══════════════════════════════════════════════════════════════════════════════

Figure 5: The RL-enabled composition mechanism from Yuan et al. (2025). The left panel shows the pre-RL knowledge state: atomic skills are known, compositions are not. The right panel shows what transfers after RL training — including cross-domain transfer to the target task, which neither appeared in training nor shares problem structure with the source task. The critical control: SFT on identical data produces none of these transfer effects.


Key Takeaway: RL teaches composition, not new primitives. The capability gains visible in frontier reasoning systems — the logic puzzle improvements in ProRL, the sudden grokking jumps in DELTA — are composition gains: the model learning to chain existing skills in arrangements it could not previously access. The ceiling on this process is the library of atomic skills built during pre-training.


§6 — MiniMax-M1: Scaling Test-Time Compute

The compute-efficiency dimension of prolonged RL is addressed by MiniMax-M1. The previous systems (DeepSeek-R1, ProRL) operated within the transformer's quadratic attention constraint — longer reasoning chains cost quadratically more compute in both training and inference. MiniMax (2025) asked what becomes possible if you remove that constraint.

MiniMax-M1 is a hybrid Mixture-of-Experts architecture built on MiniMax-Text-01, with 456 billion total parameters and 45.9 billion parameters activated per token. The architectural centerpiece is lightning attention — a linear-time attention mechanism that eliminates the quadratic scaling of standard attention with respect to sequence length. The consequence for reasoning: generating 100,000 tokens costs 25% of the FLOPs required by DeepSeek-R1 for the same output length. The model natively supports 1 million token context windows — eight times DeepSeek-R1's maximum context size.

The training recipe added one algorithmic innovation: CISPO (Clipped Importance Sampling Policy Optimization), which clips importance sampling weights rather than per-token updates. MiniMax (2025) report that CISPO outperforms other competitive RL variants in their experiments. Full RL training completed on 512 H800 GPUs in three weeks, at a rental cost of approximately $534,700.

The benchmark profile shows where lightning attention's advantages materialize. MiniMax-M1 matches or surpasses DeepSeek-R1 and Qwen3-235B on standard mathematical and coding benchmarks, with particular strengths in complex software engineering (SWE-bench Verified), tool utilization (TAU-bench), and long-context retrieval tasks (MRCR 4-needle). These are precisely the domains where the 8× context advantage is load-bearing: software engineering requires tracking global code state across long files; tool utilization requires integrating information from extended tool execution traces; long-context retrieval requires attending to all 1M tokens, not a compressed summary.

The performance profile also shows where the standard reasoning recipe was already saturated. On AIME 2024, MiniMax-M1 achieves 86.0% — an improvement over DeepSeek-R1's reported results, but in a domain that the entire field has been optimizing toward. The more diagnostically interesting improvements are in SWE-bench (56.0%) and TAU-bench (62.8%), which are not saturated. Those numbers represent capability in regimes where the quadratic attention constraint was the bottleneck, not the reasoning capability of the model.

(see Figure 6)


Figure 6 — MiniMax-M1: Compute Efficiency vs. Capability Profile

  ══════════════════════════════════════════════════════════════════════════════
   MINIMAX-M1: ARCHITECTURE TRADE-OFF AND BENCHMARK PROFILE
  ──────────────────────────────────────────────────────────────────────────────

   COMPUTE AT GENERATION LENGTH (FLOPs, relative to DeepSeek-R1):
   ──────────────────────────────────────────────────────────────
   Length:           10K     50K    100K   500K   1M tokens
   DeepSeek-R1:       1×      4×     8×    40×    [OOM at 128K limit]
   MiniMax-M1:       0.4×    0.8×   1×  [linear; 1M native context]
   Crossover below MiniMax-M1:  ~50K tokens (beyond 50K, M1 is cheaper)

  ──────────────────────────────────────────────────────────────────────────────

   BENCHMARK PROFILE (MiniMax-M1 reported results):
   Benchmark                    Score    Domain
   ────────────────────────────────────────────────
   AIME 2024                    86.0%    Math (near-saturated)
   LiveCodeBench                65.0%    Code
   SWE-bench Verified           56.0%    Long-context software eng.
   TAU-bench                    62.8%    Tool utilization
   MRCR 4-needle                73.4%    Long-context retrieval

  ──────────────────────────────────────────────────────────────────────────────

   Training cost: 512 H800 GPUs × 3 weeks ≈ $534,700
   Architecture: 456B total params / 45.9B active per token
   Context: 1M tokens natively
  ══════════════════════════════════════════════════════════════════════════════

Figure 6: MiniMax-M1's compute efficiency profile relative to DeepSeek-R1. Lightning attention (linear-time) versus standard attention (quadratic-time) creates a crossover at roughly 50K tokens: below this, DeepSeek-R1 is cheaper; above it, MiniMax-M1 is cheaper. The benchmark profile reflects this: competitive on math (where outputs rarely exceed 50K tokens), and strongest on SWE-bench, TAU-bench, and long-context retrieval, where long-context efficiency is the primary constraint.


Key Takeaway: MiniMax-M1 demonstrates that RL reasoning capability is not architecture-agnostic. Lightning attention's linear scaling changes which tasks are tractable, shifting the performance frontier from short-reasoning benchmarks (where DeepSeek-R1 was already competitive) to long-horizon, high-context tasks (software engineering, tool use) where quadratic attention was the practical ceiling. The $534,700 training cost establishes that competitive reasoning model training has moved within reach of well-funded research labs, not just hyperscalers.


§7 — The Hivemind Problem: Homogeneity at Scale

Every system in §§1–6 improves performance on structured reasoning tasks with verifiable ground truth. Jiang, Chai, Li, Liu, Fok, Dziri, Tsvetkov, Sap, Albalak, and Choi (2025) document the cost. When you optimize many models toward the same reward signal, using the same GRPO-family algorithm, on similar data distributions, the models converge — not just on similar answers, but on similar internal representations of the problem space.

Jiang et al. (2025) introduce the Artificial Hivemind concept through a large-scale empirical study. The dataset is INFINITY-CHAT: 26,000 diverse real-world open-ended user queries drawn from WildChat, spanning 6 top-level categories (creative content generation, brainstorm and ideation, analytical and interpretive questions, speculative and hypothetical scenarios, skill development, and others) across 17 subcategories. Each query admits a wide range of plausible answers with no single correct response. The study covers 70+ open-source and closed-source language models (with 25 detailed in the main paper), with 31,250 human annotations at 25 annotations per example.

The finding is two-level. Intra-model repetition: a single model consistently generates similar responses across repeated sampling on the same open-ended prompt. Inter-model homogeneity: different models independently converge on similar ideas, with variation limited to minor phrasing differences. This inter-model homogeneity is the more serious concern. The paper's illustrative example — GPT-4o and phi-4 producing nearly identical metaphors for time ("a silent weaver, meticulously threading moments into a tapestry of memories") — is not cherry-picked; it reflects a systematic empirical pattern across 26K queries and 70+ models.

The mechanism is structural. Models trained with RLVR on verifiable tasks converge on efficient reasoning strategies. The most efficient strategies become dominant. The same optimizer, applied to the same reward signal, across different architectures and training runs, finds similar attractors. For closed-form reasoning (mathematics, code), this convergence is useful — it means the field is actually discovering correct reasoning procedures. For open-ended generation, it represents a capability collapse: the diversity the human population would produce on these queries is compressed into a narrow band of model-typical responses.

The implication for anyone building reasoning-capable systems is not merely aesthetic. A population of models that all produce similar outputs is fragile to adversarial inputs, less useful as an ensemble, and less robust to distribution shift. The entire justification for model diversity in safety-critical applications collapses if all models are, functionally, the same model.

(see Figure 7)


Figure 7 — The Artificial Hivemind Effect: From Diversity to Convergence

  ══════════════════════════════════════════════════════════════════════════════
   ARTIFICIAL HIVEMIND: HOW RLVR PRODUCES INTER-MODEL HOMOGENEITY
  ──────────────────────────────────────────────────────────────────────────────

   PRE-RLVR TRAINING:               POST-RLVR OPTIMIZATION:
   ─────────────────                ────────────────────────
   Model A: output₁                 Model A: output_α
   Model B: output₂      →          Model B: output_α′   ← similar
   Model C: output₃      RLVR       Model C: output_α″   ← similar
   Model D: output₄      training   Model D: output_α‴   ← similar
            ↑ diverse               ↑ homogeneous
   Different training               Same optimizer + same reward
   data → different                 signal → convergent attractors
   representations

  ──────────────────────────────────────────────────────────────────────────────

   WHERE HIVEMIND MATTERS MOST:
   ──────────────────────────────────────────────────────────────
   CLOSED-FORM TASKS  (math, code): Convergence is USEFUL
                                    — models converge on correct procedures

   OPEN-ENDED TASKS   (creative,    Convergence is HARMFUL
                      advisory,    — diversity collapses;
                      hypothetical): ensemble = single model

   ADVERSARIAL INPUTS:              Convergence is DANGEROUS
                                    — shared failure modes; no diversity
                                    as defense

  ──────────────────────────────────────────────────────────────────────────────
   Study: 70+ models, 26K queries, 31,250 human annotations (Jiang et al., 2025)
  ══════════════════════════════════════════════════════════════════════════════

Figure 7: The Artificial Hivemind effect, from Jiang et al. (2025). RLVR training with shared reward signals produces inter-model homogeneity across architectures: different models independently converge on similar outputs. For closed-form reasoning, this convergence is evidence that models are learning correct procedures. For open-ended generation, it represents a capability collapse that makes diverse model populations functionally equivalent. Reward design, not just architecture, determines whether a frontier system contributes to or subtracts from the diversity of the model ecosystem.


Key Takeaway: The Hivemind problem is not a future risk — it is a current empirical finding across 70+ models. The same optimization pressure that makes frontier reasoning systems competitive on benchmarks makes them similar to each other on everything else. This is one of the mechanisms behind the wall this article is documenting: the more models you build with RLVR, the less diverse the model ecosystem becomes, and the less additional capability diversity you can extract from scaling up the number of models.


§8 — Reasoning Environments: The Infrastructure Layer

Every system in §§2–7 depends on a verifiable reward signal. The quality of that signal — its coverage, precision, difficulty gradation, and absence of gaming artifacts — determines the ceiling of what the RL training can discover. Stojanovski, Stanley, Sharratt, Jones, Adefioye, Kaddour, and Köpf (2025) address this bottleneck directly.

REASONING GYM is a library of reasoning environments for reinforcement learning with verifiable rewards. The key innovation is procedural generation: unlike fixed datasets (which are consumed and then memorized), REASONING GYM's 100+ data generators produce virtually infinite training data with adjustable difficulty. Each generator produces new problem instances at runtime; each verifier confirms correctness automatically. This decouples the quality of the reward signal from the size of any static dataset.

The domain coverage spans eight categories: algebra, arithmetic, computation, cognition, geometry, graph theory, logic, and common games. The games category includes Rubik's cube, Rush Hour, and grid-based reasoning puzzles — domains that require planning and state tracking rather than symbol manipulation, and where the space of valid problems is infinite. This is precisely the class of tasks where fixed datasets fail: a model can memorize the training instances, the dataset cannot be expanded without human effort, and the reward signal degrades as the training set is exhausted.

The connection to ProRL is explicit. Liu et al. (2025) trained on Reasoning Gym tasks as their logic puzzle domain and observed the largest relative improvement (+54.8%) compared to the base model. The procedural generation ensured that every training step encountered genuinely novel instances — the model could not memorize its way to improvement. The 54.8% number is, in part, a tribute to the infrastructure layer: you cannot get it without an environment that generates fresh verifiable problems continuously.

The practical implication is architectural. The correct mental model for prolonged RL training is not "dataset × compute → capability." It is "environment quality × compute → capability." A fixed dataset of 136K math problems and a procedurally generated stream of 136K logic puzzle instances are not equivalent training resources — the second has higher effective data diversity and therefore higher ceiling. This is why REASONING GYM's logic puzzles produced larger ProRL improvements than the fixed math dataset.


Key Takeaway: The verifiable environment, not the model architecture, is the primary bottleneck in prolonged RL training. Procedural generation — environments that produce novel verifiable instances on demand — is the infrastructure prerequisite for RL training that continues to improve past the point where fixed datasets are memorized. REASONING GYM's 100+ generators across eight domains provide the scaffolding; models trained against them show correspondingly larger gains in those domains.


§9 — Societies of Thought: Multi-Agent Reasoning

The capability gains documented in §§2–8 require a mechanistic explanation. Why does prolonged RL training produce qualitatively different reasoning behavior — the self-reflection, verification, dynamic strategy adaptation described in the DeepSeek-R1 paper — rather than simply amplified token prediction? Kim, Lai, Scherrer, Agüera y Arcas, and Evans (2026) provide one answer, and it is structurally surprising.

Enhanced reasoning in frontier models emerges not from extended computation alone, but from the implicit simulation of complex, multi-agent-like interactions. Kim et al. (2026) call this structure a "society of thought" — a framework in which the model's reasoning process is organized around heterogeneous cognitive perspectives that debate and converge, rather than a single monolithic chain of inference.

The evidence comes from quantitative analysis of reasoning traces and mechanistic interpretability methods applied to DeepSeek-R1 and QwQ-32B. Compared to baseline models and instruction-tuned models without RL reasoning training, these models activate broader conflict between heterogeneous personality- and expertise-related features during reasoning. The internal state of the model during reasoning is not a single coherent sequence; it exhibits a multi-agent structure that manifests as question-answering sequences, perspective shifts, reconciliation of conflicting views, and socio-emotional role dynamics — the structural features of a conversation, not a monologue.

The controlled RL experiments establish causal direction. Base models spontaneously increase conversational behaviors when solely rewarded for reasoning accuracy — no explicit multi-agent training, no conversational scaffolding in the reward signal. The reward for correctness is sufficient to induce the internal debate structure. Kim et al. (2026) further show that fine-tuning models with conversational scaffolding substantially accelerates reasoning improvement compared to both base models and models fine-tuned with monologue-like reasoning traces.

This connects the phenomenology observed in DeepSeek-R1-Zero's emergent behaviors (self-reflection, verification, backtracking) to a structural explanation. Those behaviors are not sophisticated pattern matching of human reasoning traces. They are the outputs of an internal society of thought — multiple cognitive perspectives, organized by the RL reward signal, producing through their debate a reasoning quality that neither perspective would produce alone. The parallel to collective intelligence in human groups is not metaphorical; Kim et al. (2026) propose it as a computational analogy with testable structural properties.

The practical corollary: if you want models that reason better, design for the internal debate. Conversational scaffolding in fine-tuning data, diverse role perspectives in prompting, and reward signals that credit multi-step verification rather than just final-answer correctness — all of these address the same underlying mechanism.

(see Figure 8)


Figure 8 — Society of Thought: The Internal Structure of Frontier Reasoning

  ══════════════════════════════════════════════════════════════════════════════
   SOCIETY OF THOUGHT: HOW REASONING MODELS ORGANIZE INTERNAL COGNITION
  ──────────────────────────────────────────────────────────────────────────────

   INSTRUCTION-TUNED MODEL (no RL reasoning):
   ─────────────────────────────────────────────
   Input → [single perspective] → output
            monologue; single cognitive stance
            Low perspective diversity

   REASONING-RL MODEL (DeepSeek-R1, QwQ-32B):
   ─────────────────────────────────────────────
                    ┌──── [Perspective A: domain expert]
                    │         "approach via algebra"
   Input ──→ Debate ┤
                    │     ← conflict, reconciliation →
                    │
                    ├──── [Perspective B: skeptic]
                    │         "check edge cases"
                    │
                    └──── [Perspective C: synthesizer]
                              "combined approach passes"
                                      ↓
                               Final output
            HIGHER perspective diversity than instruction-tuned models
            Activates broader personality- and expertise-related features

  ──────────────────────────────────────────────────────────────────────────────

   CAUSAL EVIDENCE (Kim et al., 2026):
   Base models rewarded only for accuracy → spontaneously increase
   conversational behaviors. Multi-agent structure is RL-induced, not
   pre-programmed. Conversational scaffolding in fine-tuning accelerates
   reasoning improvement vs. monologue fine-tuning.
  ══════════════════════════════════════════════════════════════════════════════

Figure 8: The Society of Thought structure in frontier reasoning models (Kim et al., 2026). Instruction-tuned models reason as a monologue; RL-trained reasoning models implicitly simulate multi-agent debate, activating heterogeneous cognitive perspectives that conflict and reconcile before producing output. The causal evidence — base models spontaneously develop conversational behaviors when rewarded only for accuracy — establishes that this structure is induced by RL, not inherited from training data.


Key Takeaway: The emergent reasoning behaviors in frontier models — self-reflection, verification, backtracking — have a mechanistic explanation: the model has learned to simulate a society of thought. This is not a metaphor for good prompting. It is a structural property of RL-trained reasoning models, verified through mechanistic interpretability. The implication for training design: reward signals that incentivize multi-step verification and perspective shifts should produce models with more robust internal debate structures.


§10 — Practical Guidance: What These Results Mean for Engineers

The empirical record assembled in §§1–9 supports five principles for practitioners planning or evaluating RL reasoning training. They are organized by the decision point they address.

Principle 1: Pre-train for headroom, then apply RL at the edge. Zhang et al. (2025) establish this as the governing constraint: RL produces genuine capability gains only when pre-training has left headroom — tasks the model cannot yet solve reliably at large pass@K. If your base model has been overtrained on the target domain (common for competition mathematics), RL will amplify rather than extend. The ProRL logic puzzle result (+54.8% over a math-saturated baseline) is the cleaner signal: find the domains your base model underperforms on, build training data there, and expect larger RL gains.

Principle 2: Mid-training is underexplored and high-return. Zhang et al. (2025) show that mid-training — targeted data augmentation between pre-training and RL post-training — significantly enhances performance under fixed compute compared to RL alone. This stage is "often underexamined" in the field's current pipeline designs. A practitioner who allocates 100% of post-pre-training compute to RL is leaving mid-training gains on the table.

Principle 3: Manage entropy or training will terminate itself. ProRL's three-mechanism stability stack (KL penalty + DAPO decoupled clipping + reference policy reset) is not optional. Entropy collapse arrives before the capability gains plateau. If you see reward stagnating around 500–1,000 training steps in a prolonged RL run, do not assume convergence — check output entropy first. Monitor Shannon entropy of the token distribution, KL divergence from the reference policy (not just loss), and pass@K at multiple values of K simultaneously.

Principle 4: Stage your curriculum for hard domains. Sun et al. (2025) show that standard GRPO stagnates completely on pass@K=0 tasks — no gradient signal, no learning. The staged approach (dense-reward warm-up → binary full-pass reward) is necessary to unlock the grokking transition. For any domain where your model achieves near-zero accuracy at large K, the dense-reward warm-up stage is a prerequisite, not an optimization.

Principle 5: Use process-level rewards to prevent hacking. Zhang et al. (2025) demonstrate that process-level rewards — rewards that evaluate intermediate reasoning steps, not just final answers — reduce reward hacking and improve reasoning fidelity. The binary correctness reward is vulnerable to format gaming and spurious token patterns that increase reward without improving reasoning quality. Process-level evaluation is more expensive to implement but produces models whose reward gains reflect actual capability improvements.

(see Figure 9)


Figure 9 — The Prolonged RL Scaling Curve: Gains, Inflection, and Failure

  ══════════════════════════════════════════════════════════════════════════════
   PROLONGED RL: SCHEMATIC CAPABILITY CURVE (qualitative; axes not to scale)
  ──────────────────────────────────────────────────────────────────────────────

   Capability gain
   (relative to base model)

   HIGH  ┤                         ╔═══════════════
         │                      ╔══╝ PROLONGED RL
         │                   ╔══╝    region: gains
         │                ╔══╝       depend on headroom
         │             ╔══╝          + entropy control
    MED  ┤          ╔══╝
         │       ╔══╝  ← STANDARD RL CONVERGENCE
         │    ╔══╝       (typical 500–2000 step runs)
         │ ╔══╝
    LOW  ┤═╝  Base model capability
         │
       0 └──────────────────────────────────────────
         Low                                    High
                     RL compute (training steps)

  ──────────────────────────────────────────────────────────────────────────────
   FAILURE MODE ONSET (without stability controls):
   [A] ENTROPY COLLAPSE: gain curve flattens prematurely (mid-training)
   [B] REWARD HACKING: reward ↑ but real capability flat/declines
   [C] HIVEMIND EFFECT: gains real but output diversity collapses

   AUTHOR NOTE: Cross-system GPU-hour inflection point not reported as a
   unified metric in the papers reviewed; schematic is qualitative.
  ══════════════════════════════════════════════════════════════════════════════

Figure 9: Schematic scaling curve for prolonged RL training. The gains over the base model grow with training compute, but only under stability controls (KL penalty, reference reset, staged curriculum) that prevent the three failure modes (entropy collapse, reward hacking, Hivemind homogenization). The precise inflection point in GPU-hours is not reported as a unified metric across the papers reviewed; the curve is qualitative and system-dependent. The consistent message across ProRL, DELTA, and MiniMax-M1 is that standard RL run times (typically terminated after 500–2,000 steps) leave substantial capability on the table — gains accessible to longer, stability-controlled runs.


Key Takeaway: The five principles form a pipeline: pre-training headroom → mid-training augmentation → entropy-controlled RL → staged curriculum for hard domains → process-level reward shaping. Each stage addresses a distinct failure mode. Skipping any stage does not reduce gains proportionally — it redirects compute into the corresponding failure mode instead of capability growth.


§ What Comes Next

Part III of this series (A6–A7) gave you the architectural and stability toolkit: attention mechanisms that resist forgetting, KL control that prevents policy drift, gradient monitoring that surfaces failures invisible in the loss curve. That toolkit, applied by every system in this article, explains why some frontier systems succeed where early RL experiments failed. The toolkit is not sufficient.

The wall this article has been documenting is not a training stability wall — stability is solved, at least for the systems reviewed here. It is a capability ceiling. Zhang et al. (2025) characterize it as pre-training headroom: RL extracts from pre-trained knowledge, it does not create it. Kim et al. (2026) characterize it as compositional reach: the ceiling is the library of atomic skills, not the RL algorithm. The Hivemind results (Jiang et al., 2025) characterize it as optimization pressure: the more systems you train toward the same reward signal, the less additional diversity you extract.

Three escape routes appear in the next articles:

Evolutionary self-modification [→ A8]: If the capability ceiling is the fixed architecture plus fixed pre-training knowledge, the response is to modify the architecture itself. The Darwin-Gödel Machine (A8) is the engineering proof that this is achievable in a defined setting. The connection to the results here: DeepSeek-R1 is hitting a ceiling because it cannot modify itself. DGM's outer loop is the mechanism that allows capability to continue growing when the inner RL loop saturates.

Architectural compute beyond tokens [→ A11]: MiniMax-M1 demonstrated that attention mechanism choice determines which tasks are tractable. CTM (A11) extends this further: per-neuron temporal memory as a form of compute that does not require generating additional tokens. The society of thought (Kim et al., 2026) is currently implemented through long chain-of-thought token generation — CTM proposes to achieve a richer version of the same structure through internal synchronization, without the generation-length cost.

RL as educator [→ A12]: The final article in the series proposes inverting the optimization target. Instead of RL that trains the student model to get better answers, RL that trains the teacher model to generate better training curricula. This is the recursive form of the ProRL insight: if the right curriculum (edge-of-competence data) is the primary driver of capability gains, optimizing the curriculum generator is more efficient than running more RL steps on a fixed dataset.

The pivot from Part III to Part IV is not a failure of the toolkit. It is a recognition that the toolkit built a very good ceiling, and now the question is how to build beyond it.


Final Key Takeaways

  1. Prolonged RL expands reasoning capability, but only at the edge of competence. Zhang et al. (2025) establish this as the governing constraint. RL amplifies before it extends; extension requires pre-training headroom in the target domain.
  2. The specific gains are domain-structured. ProRL's +54.8% on logic puzzles (vs. +14.7% on mathematics) is not random — it reflects the difference between a domain the base model had headroom in and one it had already exploited.
  3. Entropy collapse is the dominant failure mode in prolonged RL training. KL penalty, reference policy reset, and DAPO decoupled clipping are engineering responses, not training tricks. Without them, the reward signal terminates before capability gains are complete.
  4. RL grokking is a feature, not a bug. The ~450-step exploratory plateau before sudden capability emergence is the signature of the model building compositional circuit prerequisites. Terminating training during the plateau eliminates the capability gain.
  5. RL teaches composition, not new primitives. Yuan et al. (2025) establish this mechanistically: RL enables f(g(x)) from f and g already known; it cannot create new f or g from scratch. The atomic skill library built during pre-training is the permanent ceiling.
  6. Architectural efficiency (lightning attention) changes which tasks are tractable. MiniMax-M1's 8× context advantage materializes specifically in long-horizon tasks (SWE-bench, TAU-bench) — not in short-form mathematics. Architecture and RL training are not independent variables.
  7. The Hivemind effect is the hidden cost of ecosystem-scale RLVR. As more models are trained toward the same reward signal, inter-model homogeneity grows. The diversity assumed in ensembles and safety-critical deployments is eroded by shared optimization pressure.
  8. Frontier reasoning models implicitly simulate societies of thought. Kim et al. (2026) establish a mechanistic basis for the emergent self-reflection and verification observed in RL-trained models. Designing reward signals and fine-tuning scaffolds that reinforce this structure is the highest-leverage lever currently visible.

References

[1] DeepSeek-AI. (2025). DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. arXiv:2501.12948.

[2] Liu, M., Diao, S., Lu, X., Hu, J., Dong, X., Choi, Y., Kautz, J., & Dong, Y. (2025). ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models. arXiv:2505.24864.

[3] MiniMax. (2025). MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention. arXiv:2506.13585.

[4] Sun, Y., Cao, Y., Huang, P., Bai, H., Hajishirzi, H., Dziri, N., & Song, D. (2025). RL Grokking Recipe: How Does RL Unlock and Transfer New Algorithms in LLMs? arXiv:2509.21016.

[5] Yuan, L., Chen, W., Zhang, Y., Cui, G., Wang, H., You, Z., Ding, N., Liu, Z., Sun, M., & Peng, H. (2025). From f(x) and g(x) to f(g(x)): LLMs Learn New Skills in RL by Composing Old Ones. arXiv:2509.25123.

[6] Jiang, L., Chai, Y., Li, M., Liu, M., Fok, R., Dziri, N., Tsvetkov, Y., Sap, M., Albalak, A., & Choi, Y. (2025). Artificial Hivemind: The Open-Ended Homogeneity of Language Models (and Beyond). Advances in Neural Information Processing Systems 38 (NeurIPS 2025). arXiv:2510.22954.

[7] Stojanovski, Z., Stanley, O., Sharratt, J., Jones, R., Adefioye, A., Kaddour, J., & Köpf, A. (2025). Reasoning Gym: Reasoning Environments for Reinforcement Learning with Verifiable Rewards. arXiv:2505.24760.

[8] Kim, J., Lai, S., Scherrer, N., Agüera y Arcas, B., & Evans, J. (2026). Reasoning Models Generate Societies of Thought. arXiv:2601.10825.

[9] Zhang, C., Neubig, G., & Yue, X. (2025). On the Interplay of Pre-Training, Mid-Training, and RL on Reasoning Language Models. arXiv:2512.07783.

[10] Yeo, E., Tong, Y., Niu, M., Neubig, G., & Yue, X. (2025). Demystifying Long Chain-of-Thought Reasoning in LLMs. Proceedings of ICML 2025. arXiv available at GitHub: eddycmu/demystify-long-cot.