โ† All Updates

VibeThinker 3B: Frontier Reasoning at 3 Billion Parameters

A 3-billion-parameter model from WeiboAI matches Claude Opus 4.5, DeepSeek V3.2, and GLM-5 on hard math and code โ€” tasks we assumed required hundreds of billions of parameters.

June 29, 2026 ยท 5 min read ยท Research Analysis

๐Ÿ“‹ In This Article

  • ๐Ÿ”‘ The Spectrum-to-Signal Principle โ€” Why verifiable reasoning needs far fewer parameters than knowledge storage
  • ๐Ÿงช Training Pipeline โ€” 4-stage post-training: curriculum SFT, MGPO reinforcement learning, self-distillation, long-to-short RL
  • ๐Ÿ“Š Verified Benchmarks โ€” AIME 2026: 94.3, LeetCode: 96.1% first-attempt, IMO-AnswerBench: 76.4-80.6
  • โš ๏ธ Limitations โ€” Narrow domain (math/code/STEM only), not a general-purpose model
  • ๐Ÿ’ก Why It Matters โ€” Proves that parameter efficiency, not just scale, is the next frontier for open models

The Claim

WeiboAI's VibeThinker-3B is a 3-billion-parameter dense model that matches or beats models 200-300x its size on verifiable reasoning benchmarks โ€” specifically math, coding, and STEM tasks where answers can be objectively checked. On IMO-AnswerBench (400 olympiad-level problems), it scores 76.4 base and 80.6 with test-time scaling, putting it in the same range as DeepSeek V3.2 (671B), GLM-5 (744B), and Kimi K2.5 (1T).

How It Works: The Spectrum-to-Signal Principle

The key insight from the technical report (arXiv 2606.16140) is that not all intelligence requires the same parameter budget. The team proposes splitting capabilities into two categories:

Their training pipeline has four stages:

  1. Two-stage curriculum SFT โ€” Stage 1 covers broad capabilities (math, code, STEM, dialogue). Stage 2 filters to hard problems only, discarding reasoning traces under 5,000 tokens to force long-horizon thinking.
  2. Multi-domain RL with MGPO โ€” A GRPO variant (MaxEnt-Guided Policy Optimization) that weights samples by difficulty, focusing on problems that are neither too easy nor too hard. Applied sequentially: math โ†’ code โ†’ STEM.
  3. Offline self-distillation โ€” High-quality trajectories from domain-specific RL checkpoints are filtered and distilled back into a unified model using a "learning-potential score."
  4. Long-to-short RL โ€” First optimize for accuracy, then reward shorter correct trajectories. This reduces the token bloat common in reasoning models.

Verified Results

BenchmarkVibeThinker-3Bvs Larger Models
AIME 202694.3= DeepSeek V3.2 (671B)
IMO-AnswerBench76.4 (80.6 w/ CLR)= GLM-5 (744B)
LeetCode Weekly (Apr-May 2026)96.1% pass@1123/128 first attempts
LiveCodeBench v680.2> all models under 120B
GPQA Diamond70.2Behind frontier (90+)
Key takeaway: VibeThinker doesn't claim general-purpose parity. It's explicitly narrow โ€” math, code, and STEM reasoning. For open-domain chat, writing, or knowledge tasks, larger models still win. But for verifiable reasoning loops (coding agents, math solvers), it's a breakthrough in parameter efficiency.

Technical Report (arXiv) ยท Model Weights (HuggingFace)