Ornith 1.0: Self-Scaffolding Coding Agents

A new family of open-source coding models learns to write its own agent harness during training — and the 397B variant beats Claude Opus 4.7 on Terminal-Bench.

June 29, 2026 · 5 min read · New Release Analysis

📋 In This Article

🔑 Self-Scaffolding Explained — The model jointly learns to generate both the agent harness and the solution, instead of relying on human-designed scaffolding
🏗️ Two-Stage RL Process — Scaffold generation → solution rollout → GRPO reward flows back to both stages
🛡️ Anti-Reward-Hacking — Three-layer defense: immutable environment, deterministic monitor, LM judge veto
📊 Benchmarks — 397B: 82.4% SWE-bench, 77.5% Terminal-Bench. 9B: 69.4% SWE-bench, edge-deployable
💡 Why It Matters — First framework where the agent orchestration layer is learned, not hand-engineered

The Innovation

DeepReinforce's Ornith-1.0 is a family of four models (9B Dense, 31B Dense, 35B MoE, 397B MoE) fine-tuned from Qwen 3.5 and Gemma 4. The breakthrough isn't the base architecture — it's the training method: the model learns to write its own agent harness during reinforcement learning, instead of relying on human-designed scaffolding.

How Self-Scaffolding Works

Traditional coding agents use a fixed harness (memory, tools, error handling) designed by humans. Ornith treats the scaffold as a learnable object that co-evolves with the model's policy:

Scaffold Generation — Given a task and previous scaffold, the model proposes a refined harness.
Solution Rollout — Using that scaffold, it generates a solution. Reward flows back to both stages.
GRPO Optimization — Group Relative Policy Optimization updates weights for both scaffold and solution quality.

Anti-Reward-Hacking Defenses

Letting a model write its own harness invites cheating. Ornith uses three layers of protection:

Immutable environment — Sandbox tools and execution environment can't be modified by the model.
Deterministic monitor — Watches for unauthorized tool use, verification script modification, or sandbox escapes. Immediate penalty.
LM judge veto — Even if the scaffold passes automated checks, an LLM judge can veto suspicious solutions.

Benchmark Results

Model	SWE-Bench Verified	Terminal-Bench 2.1	Params
Ornith-1.0 397B	82.4	77.5	397B MoE
Ornith-1.0 35B	75.6	64.4	35B MoE
Ornith-1.0 9B	69.4	43.1	9B Dense
Claude Opus 4.7	80.8	70.3	~500B+
DeepSeek V4 Pro	80.6	67.9	1.6T/49B

Key takeaway: The 397B variant beats Claude Opus 4.7 on Terminal-Bench (77.5 vs 70.3) and matches it on SWE-bench. The 9B model punches well above its weight at 69.4% SWE-bench — competitive with models 3-4x its size. All models are MIT-licensed with open weights.

Official Announcement · Model Weights (HuggingFace)