VibeThinker 3B: Frontier Reasoning at 3 Billion Parameters

A 3-billion-parameter model from WeiboAI matches Claude Opus 4.5, DeepSeek V3.2, and GLM-5 on hard math and code — tasks we assumed required hundreds of billions of parameters.

June 29, 2026 · 5 min read · Research Analysis

📋 In This Article

🔑 The Spectrum-to-Signal Principle — Why verifiable reasoning needs far fewer parameters than knowledge storage
🧪 Training Pipeline — 4-stage post-training: curriculum SFT, MGPO reinforcement learning, self-distillation, long-to-short RL
📊 Verified Benchmarks — AIME 2026: 94.3, LeetCode: 96.1% first-attempt, IMO-AnswerBench: 76.4-80.6
⚠️ Limitations — Narrow domain (math/code/STEM only), not a general-purpose model
💡 Why It Matters — Proves that parameter efficiency, not just scale, is the next frontier for open models

The Claim

WeiboAI's VibeThinker-3B is a 3-billion-parameter dense model that matches or beats models 200-300x its size on verifiable reasoning benchmarks — specifically math, coding, and STEM tasks where answers can be objectively checked. On IMO-AnswerBench (400 olympiad-level problems), it scores 76.4 base and 80.6 with test-time scaling, putting it in the same range as DeepSeek V3.2 (671B), GLM-5 (744B), and Kimi K2.5 (1T).

How It Works: The Spectrum-to-Signal Principle

The key insight from the technical report (arXiv 2606.16140) is that not all intelligence requires the same parameter budget. The team proposes splitting capabilities into two categories:

Verifiable reasoning (math, code, logic) — search, constraint satisfaction, error correction. Doesn't require memorizing facts.
Broad knowledge (science facts, long-tail trivia) — genuinely requires raw parameter capacity.

Their training pipeline has four stages:

Two-stage curriculum SFT — Stage 1 covers broad capabilities (math, code, STEM, dialogue). Stage 2 filters to hard problems only, discarding reasoning traces under 5,000 tokens to force long-horizon thinking.
Multi-domain RL with MGPO — A GRPO variant (MaxEnt-Guided Policy Optimization) that weights samples by difficulty, focusing on problems that are neither too easy nor too hard. Applied sequentially: math → code → STEM.
Offline self-distillation — High-quality trajectories from domain-specific RL checkpoints are filtered and distilled back into a unified model using a "learning-potential score."
Long-to-short RL — First optimize for accuracy, then reward shorter correct trajectories. This reduces the token bloat common in reasoning models.

Verified Results

Benchmark	VibeThinker-3B	vs Larger Models
AIME 2026	94.3	= DeepSeek V3.2 (671B)
IMO-AnswerBench	76.4 (80.6 w/ CLR)	= GLM-5 (744B)
LeetCode Weekly (Apr-May 2026)	96.1% pass@1	123/128 first attempts
LiveCodeBench v6	80.2	> all models under 120B
GPQA Diamond	70.2	Behind frontier (90+)

Key takeaway: VibeThinker doesn't claim general-purpose parity. It's explicitly narrow — math, code, and STEM reasoning. For open-domain chat, writing, or knowledge tasks, larger models still win. But for verifiable reasoning loops (coding agents, math solvers), it's a breakthrough in parameter efficiency.

Technical Report (arXiv) · Model Weights (HuggingFace)