LLM Comparison Dashboard

🏷️ Model Overview

Select a task to compare all models side-by-side

Click an industry to see model scores (includes local LLMs)

Select a benchmark — box size and color reflect score

GPT-5.5Best all-rounder, native audio, enterprise agents

Claude 4.8SWE-bench king, nuanced reasoning, long context

DS V4 ProBest open math/code, MIT, 1M context

Qwen3.7Top Chinese model, 1M ctx, Apache 2.0

GLM-5.2SOTA open coding/agents, MIT, async RL

Gemini 3.1Best reasoning (ARC-AGI-2 77%), 1M ctx, multimodal

MiniMax M31M ctx + multimodal, 23B active, cheap

Gemma 4Best small model, Apache 2.0, edge-ready

VibeThinker3B params, frontier math reasoning, MIT

Ornith-1.0SOTA open coding at each size tier, MIT

Kimi K2.6Agent swarms (300 sub-agents), multimodal

GPT-5.5Expensive, closed, rate limits

Claude 4.8Highest cost, export restrictions

DS V4 ProChinese data law, needs multi-GPU

Qwen3.7Proprietary (no weights), hallucination

GLM-5.2744B total, needs serious hardware

MiniMax M3Community license, self-reported scores

Gemma 4Lower ceiling on hard reasoning

VibeThinkerReasoning-only, bad at general chat

Ornith-1.0Brand new, limited independent verification

Kimi K2.6Modified MIT, API-only for best variant

Select a task category to see all model scores