VibeThinker 3B: Frontier Reasoning at 3 Billion Parameters
A 3-billion-parameter model from WeiboAI matches Claude Opus 4.5, DeepSeek V3.2, and GLM-5 on hard math and code โ tasks we assumed required hundreds of billions of parameters.
๐ In This Article
- ๐ The Spectrum-to-Signal Principle โ Why verifiable reasoning needs far fewer parameters than knowledge storage
- ๐งช Training Pipeline โ 4-stage post-training: curriculum SFT, MGPO reinforcement learning, self-distillation, long-to-short RL
- ๐ Verified Benchmarks โ AIME 2026: 94.3, LeetCode: 96.1% first-attempt, IMO-AnswerBench: 76.4-80.6
- โ ๏ธ Limitations โ Narrow domain (math/code/STEM only), not a general-purpose model
- ๐ก Why It Matters โ Proves that parameter efficiency, not just scale, is the next frontier for open models
The Claim
WeiboAI's VibeThinker-3B is a 3-billion-parameter dense model that matches or beats models 200-300x its size on verifiable reasoning benchmarks โ specifically math, coding, and STEM tasks where answers can be objectively checked. On IMO-AnswerBench (400 olympiad-level problems), it scores 76.4 base and 80.6 with test-time scaling, putting it in the same range as DeepSeek V3.2 (671B), GLM-5 (744B), and Kimi K2.5 (1T).
How It Works: The Spectrum-to-Signal Principle
The key insight from the technical report (arXiv 2606.16140) is that not all intelligence requires the same parameter budget. The team proposes splitting capabilities into two categories:
- Verifiable reasoning (math, code, logic) โ search, constraint satisfaction, error correction. Doesn't require memorizing facts.
- Broad knowledge (science facts, long-tail trivia) โ genuinely requires raw parameter capacity.
Their training pipeline has four stages:
- Two-stage curriculum SFT โ Stage 1 covers broad capabilities (math, code, STEM, dialogue). Stage 2 filters to hard problems only, discarding reasoning traces under 5,000 tokens to force long-horizon thinking.
- Multi-domain RL with MGPO โ A GRPO variant (MaxEnt-Guided Policy Optimization) that weights samples by difficulty, focusing on problems that are neither too easy nor too hard. Applied sequentially: math โ code โ STEM.
- Offline self-distillation โ High-quality trajectories from domain-specific RL checkpoints are filtered and distilled back into a unified model using a "learning-potential score."
- Long-to-short RL โ First optimize for accuracy, then reward shorter correct trajectories. This reduces the token bloat common in reasoning models.
Verified Results
| Benchmark | VibeThinker-3B | vs Larger Models |
|---|---|---|
| AIME 2026 | 94.3 | = DeepSeek V3.2 (671B) |
| IMO-AnswerBench | 76.4 (80.6 w/ CLR) | = GLM-5 (744B) |
| LeetCode Weekly (Apr-May 2026) | 96.1% pass@1 | 123/128 first attempts |
| LiveCodeBench v6 | 80.2 | > all models under 120B |
| GPQA Diamond | 70.2 | Behind frontier (90+) |