Qwythos 9B: Claude Mythos Distillation into Open Weights

A 9B model fine-tuned on 500M+ tokens of Claude Mythos reasoning traces gains +34 points on MMLU and native tool calling — showing how far targeted distillation can push small open models.

June 29, 2026 · 4 min read · Fine-tune Analysis

📋 In This Article

🔑 What Is Qwythos — Full-parameter fine-tune of Qwen3.5-9B on Claude Mythos traces, Apache 2.0 licensed
📈 Performance Gains — +34 MMLU, +30 gsm8k-strict, +19 gsm8k-flex over base Qwen3.5-9B
🔧 Capabilities — 1M context via YaRN, native function calling, optional vision projector
📊 Benchmark Comparison — How it stacks against base Qwen3.5 and other 9B-class models
⚠️ Trade-offs — GPQA regression, narrow improvement domain, not general-purpose

What It Is

Qwythos-9B is a full-parameter fine-tune of Qwen3.5-9B by Empero AI, trained on 500M+ tokens of chain-of-thought reasoning traces. It ships with 1M context (via YaRN rope scaling from the native 262K), native function calling, and an optional vision projector. Apache 2.0 licensed.

Performance Gains Over Base

Metric	Qwen3.5-9B (base)	Qwythos-9B	Delta
MMLU	23.2%	57.5%	+34.3
gsm8k (strict)	51.0%	81.0%	+30.0
gsm8k (flex)	67.0%	86.0%	+19.0
GPQA Diamond	63.0%	58.0%	-5.0

Key Observations

The massive MMLU gain (+34 points) is partly because the base Qwen3.5-9B is unusually low on MMLU — the fine-tuning clearly improves knowledge recall significantly.
GPQA Diamond actually decreases slightly, suggesting the fine-tuning trades some graduate-level science reasoning for better structured test performance.
Independent testing shows Qwythos runs at 100-150 tok/s on consumer GPUs (RTX 4090) and fits in 6-8GB VRAM at Q4_K_M quantization.
Tool-calling correctness: 7/7 on a harness combining Python execution and web search.

Key takeaway: Qwythos demonstrates that targeted fine-tuning on high-quality reasoning traces can dramatically improve small model performance on structured tasks. It's not a general-purpose replacement for frontier models, but for math, code, and tool-use workflows at the 9B scale, it's one of the strongest open options available.

Model Card (HuggingFace) · GGUF Quantizations