Ornith 1.0: Self-Scaffolding Coding Agents
A new family of open-source coding models learns to write its own agent harness during training โ and the 397B variant beats Claude Opus 4.7 on Terminal-Bench.
๐ In This Article
- ๐ Self-Scaffolding Explained โ The model jointly learns to generate both the agent harness and the solution, instead of relying on human-designed scaffolding
- ๐๏ธ Two-Stage RL Process โ Scaffold generation โ solution rollout โ GRPO reward flows back to both stages
- ๐ก๏ธ Anti-Reward-Hacking โ Three-layer defense: immutable environment, deterministic monitor, LM judge veto
- ๐ Benchmarks โ 397B: 82.4% SWE-bench, 77.5% Terminal-Bench. 9B: 69.4% SWE-bench, edge-deployable
- ๐ก Why It Matters โ First framework where the agent orchestration layer is learned, not hand-engineered
The Innovation
DeepReinforce's Ornith-1.0 is a family of four models (9B Dense, 31B Dense, 35B MoE, 397B MoE) fine-tuned from Qwen 3.5 and Gemma 4. The breakthrough isn't the base architecture โ it's the training method: the model learns to write its own agent harness during reinforcement learning, instead of relying on human-designed scaffolding.
How Self-Scaffolding Works
Traditional coding agents use a fixed harness (memory, tools, error handling) designed by humans. Ornith treats the scaffold as a learnable object that co-evolves with the model's policy:
- Scaffold Generation โ Given a task and previous scaffold, the model proposes a refined harness.
- Solution Rollout โ Using that scaffold, it generates a solution. Reward flows back to both stages.
- GRPO Optimization โ Group Relative Policy Optimization updates weights for both scaffold and solution quality.
Anti-Reward-Hacking Defenses
Letting a model write its own harness invites cheating. Ornith uses three layers of protection:
- Immutable environment โ Sandbox tools and execution environment can't be modified by the model.
- Deterministic monitor โ Watches for unauthorized tool use, verification script modification, or sandbox escapes. Immediate penalty.
- LM judge veto โ Even if the scaffold passes automated checks, an LLM judge can veto suspicious solutions.
Benchmark Results
| Model | SWE-Bench Verified | Terminal-Bench 2.1 | Params |
|---|---|---|---|
| Ornith-1.0 397B | 82.4 | 77.5 | 397B MoE |
| Ornith-1.0 35B | 75.6 | 64.4 | 35B MoE |
| Ornith-1.0 9B | 69.4 | 43.1 | 9B Dense |
| Claude Opus 4.7 | 80.8 | 70.3 | ~500B+ |
| DeepSeek V4 Pro | 80.6 | 67.9 | 1.6T/49B |