Qwythos 9B: Claude Mythos Distillation into Open Weights
A 9B model fine-tuned on 500M+ tokens of Claude Mythos reasoning traces gains +34 points on MMLU and native tool calling โ showing how far targeted distillation can push small open models.
๐ In This Article
- ๐ What Is Qwythos โ Full-parameter fine-tune of Qwen3.5-9B on Claude Mythos traces, Apache 2.0 licensed
- ๐ Performance Gains โ +34 MMLU, +30 gsm8k-strict, +19 gsm8k-flex over base Qwen3.5-9B
- ๐ง Capabilities โ 1M context via YaRN, native function calling, optional vision projector
- ๐ Benchmark Comparison โ How it stacks against base Qwen3.5 and other 9B-class models
- โ ๏ธ Trade-offs โ GPQA regression, narrow improvement domain, not general-purpose
What It Is
Qwythos-9B is a full-parameter fine-tune of Qwen3.5-9B by Empero AI, trained on 500M+ tokens of chain-of-thought reasoning traces. It ships with 1M context (via YaRN rope scaling from the native 262K), native function calling, and an optional vision projector. Apache 2.0 licensed.
Performance Gains Over Base
| Metric | Qwen3.5-9B (base) | Qwythos-9B | Delta |
|---|---|---|---|
| MMLU | 23.2% | 57.5% | +34.3 |
| gsm8k (strict) | 51.0% | 81.0% | +30.0 |
| gsm8k (flex) | 67.0% | 86.0% | +19.0 |
| GPQA Diamond | 63.0% | 58.0% | -5.0 |
Key Observations
- The massive MMLU gain (+34 points) is partly because the base Qwen3.5-9B is unusually low on MMLU โ the fine-tuning clearly improves knowledge recall significantly.
- GPQA Diamond actually decreases slightly, suggesting the fine-tuning trades some graduate-level science reasoning for better structured test performance.
- Independent testing shows Qwythos runs at 100-150 tok/s on consumer GPUs (RTX 4090) and fits in 6-8GB VRAM at Q4_K_M quantization.
- Tool-calling correctness: 7/7 on a harness combining Python execution and web search.
Key takeaway: Qwythos demonstrates that targeted fine-tuning on high-quality reasoning traces can dramatically improve small model performance on structured tasks. It's not a general-purpose replacement for frontier models, but for math, code, and tool-use workflows at the 9B scale, it's one of the strongest open options available.