LLM Comparison Dashboard

2025–2026 models — closed-source flagships + popular open-weight/local LLMs

20 MODELS · 12+ BENCHMARKS · UPDATED JUNE 2026

🏷️ Model Overview

🧠 Task Performance

Select a task to compare all models side-by-side

🏭 Industry Suitability

Click an industry to see model scores (includes local LLMs)

📊 Benchmark Scores

Select a benchmark — box size and color reflect score

💰 Cost vs Performance

Cost per 1M tokens (input, USD) → Overall Score → 60 75 85 95 $0 $3 $6 $9 $12+ SWEET SPOT DS V4 Qwen3.7 M3 Gemma4 GPT-5.5 Claude GLM-5 Kimi VibeT Ornith

⚡ Strengths & Weaknesses

💪 Strengths

GPT-5.5Best all-rounder, native audio, enterprise agents
Claude 4.8SWE-bench king, nuanced reasoning, long context
DS V4 ProBest open math/code, MIT, 1M context
Qwen3.7Top Chinese model, 1M ctx, Apache 2.0
GLM-5.2SOTA open coding/agents, MIT, async RL
Gemini 3.1Best reasoning (ARC-AGI-2 77%), 1M ctx, multimodal
MiniMax M31M ctx + multimodal, 23B active, cheap
Gemma 4Best small model, Apache 2.0, edge-ready
VibeThinker3B params, frontier math reasoning, MIT
Ornith-1.0SOTA open coding at each size tier, MIT
Kimi K2.6Agent swarms (300 sub-agents), multimodal

⚠️ Weaknesses

GPT-5.5Expensive, closed, rate limits
Claude 4.8Highest cost, export restrictions
DS V4 ProChinese data law, needs multi-GPU
Qwen3.7Proprietary (no weights), hallucination
GLM-5.2744B total, needs serious hardware
MiniMax M3Community license, self-reported scores
Gemma 4Lower ceiling on hard reasoning
VibeThinkerReasoning-only, bad at general chat
Ornith-1.0Brand new, limited independent verification
Kimi K2.6Modified MIT, API-only for best variant

📋 Task-by-Task Breakdown

Select a task category to see all model scores

🖥️ Popular Local / Open-Weight Models by Hardware Tier