Benchmarks

Comparative LLM performance data across standardized benchmarks including SWE-Bench, ARC-AGI, GPQA Diamond, and LMSYS Arena.

Benchmarks

7 min read 1373 words

Model Intel Benchmarks

Overall LLM Rankings (Awesome Agents, Feb 26)

Comprehensive ranking combining MMLU-Pro (knowledge), GPQA Diamond (reasoning), SWE-Bench Verified (coding), Chatbot Arena Elo (human preference), and cost-adjusted value.

Top 10 (as of Feb 26, 2026):

  1. GPT-5.2 Pro (OpenAI): 88.7% MMLU-Pro, 93.2% GPQA Diamond, 65.8% SWE-Bench, 1402 Elo, $10/$30
  2. Claude Opus 4.6 (Anthropic): 88.2% MMLU-Pro, 89.0% GPQA Diamond, 72.5% SWE-Bench, 1398 Elo, $15/$75
  3. Gemini 3 Pro (Google): 89.8% MMLU-Pro (highest), 87.5% GPQA Diamond, 63.2% SWE-Bench, 1389 Elo, $1.25/$5.00 (best price/performance)
  4. Grok 4 Heavy (xAI): 86.4% MMLU-Pro, 88.9% GPQA Diamond, 61.0% SWE-Bench, 1375 Elo, $3/$15
  5. DeepSeek V3.2-Speciale (DeepSeek): 85.9% MMLU-Pro, 85.3% GPQA Diamond, 77.8% SWE-Bench (highest), 1361 Elo, $0.28/$1.10
  6. Claude Opus 4.5 (Anthropic): 87.1% MMLU-Pro, 86.5% GPQA Diamond, 68.4% SWE-Bench, 1370 Elo, $12/$60
  7. Qwen 3.5 (Alibaba): 84.6% MMLU-Pro, 82.1% GPQA Diamond, 62.5% SWE-Bench, 1342 Elo, $0.50/$2.00
  8. GPT-5.2 (OpenAI): 86.3% MMLU-Pro, 88.0% GPQA Diamond, 58.2% SWE-Bench, 1380 Elo, $2.50/$10
  9. Llama 4 Maverick (Meta): 83.2% MMLU-Pro, 78.5% GPQA Diamond, 55.8% SWE-Bench, 1320 Elo, Free (open-weight)
  10. Mistral 3 (Mistral): 82.8% MMLU-Pro, 79.3% GPQA Diamond, 54.1% SWE-Bench, 1315 Elo, $1/$3

Key trends:

  • Gap narrowing: just 6.9pp between #1 and #10 on MMLU-Pro (vs 25pp two years ago)
  • Coding is the differentiator: widest variance among top models
  • Open-weights competitive: Llama 4 and DeepSeek V3.2 in top 10
  • Pricing varies 50x: most expensive model costs ~50x more per token than cheapest

Source: Awesome Agents Overall Rankings (updated Feb 26, 2026)

Gemini 3.1 Pro reasoning haul (WorldofAI, Feb 19)

  • ARC-AGI-2 (verified): 77.1%, a step-change over Gemini 3 Pro and an indicator that the model can infer novel logic patterns rather than memorize outputs.
  • GPQA Diamond: 94.3%, showing the model’s scientific/technical reasoning hit new highs.
  • SWE-Bench Verified: 80.6%, placing it within a hair of Claude Sonnet 4.6 while still improving fidelity on structured code dumps.
  • BrowseComp: 85.9% (tool-grounded reasoning) and MCP Atlas: 69.2% (multi-step workflow execution) - the distribution of wins across reasoning, coding, and vision contexts suggests a deeper architecture update rather than narrow tuning.
  • Source: WorldofAI · “Google DROPS Gemini 3.1 Pro”

SWE-bench Verified (Feb 2026)

Source: marc0.dev / Scale AI leaks + WorldofAI

  • Claude Opus 4.5: 80.9%
  • Claude Sonnet 4.6: 80.8% (WorldofAI calls it the most reliable coding workhorse for Feb)
  • Gemini 3.1 Pro: 80.6% (WorldofAI reporting of the Feb 19 release)
  • MiniMax M2.5: 80.2%
  • GPT-5.2: 80.0%
  • GLM-5: 77.8% (open-source model, 1451 Arena Elo - highest open-source rating)

SWE-bench Pro (Feb-Mar 2026)

New harder private benchmark launched as Verified approaches saturation. Early reports (late Feb) showed top models (GPT-5, Opus 4.1) scoring only ~23% on Pro vs 70%+ on Verified.

Updated leaderboard (as of March 2, 2026) shows significant improvement:

  • Claude Opus 4.5: 45.89% (rank 1)
  • Claude Sonnet 4.5: 43.60% (rank 1)
  • Gemini 3 Pro Preview: 43.30% (rank 1)
  • Claude Sonnet 4.0: 42.70% (rank 1)
  • GPT-5 2025-08-07 (High): 41.78% (rank 2)
  • GPT-5.2 Codex: 41.04% (rank 2)
  • Claude Haiku 4.5: 39.45% (rank 2)
  • Qwen3-coder-480b: 38.70% (rank 2)
  • MiniMax-2.1: 36.81% (rank 2)
  • Gemini 3 Flash: 34.63% (rank 2)

This near-doubling of scores from ~23% to ~42-46% for top models suggests either: (a) rapid model improvement in code reasoning over 2 weeks, (b) scaffold/agent framework optimization, or (c) different evaluation methodologies. The benchmark remains the frontier for measuring complex, multi-file real-world software engineering tasks.

LMSYS Chatbot Arena / Text Arena (Feb 2026)

Current Leaderboard (as of Feb 26):

  1. Claude Opus 4.6 Thinking: ~1506 Elo (complex reasoning, multi-step logic)
  2. Gemini 2.5 Pro: ~1450 Elo
  3. GPT-5.2-high: ~1400 Elo
  • GPT-5.3 “Vortex” & “Zephyr”: Appeared Feb 25 as mystery models, accumulating Elo votes. Following GPT-5 zenith/summit pattern (flagship + reasoning variants). Expected release: mid-March to early April 2026.

Category Leaders:

  • GPT-5.2: Leading for general chat/daily assistant
  • Claude Sonnet 4.6: #1 for writing tasks (preferred over Sonnet 4.5 in 70% of blind tests, preferred over Opus 4.5 in 59% of coding tasks)
  • Claude Opus 4.6: Leading agentic/coding tier (65.4% Terminal-Bench 2.0, 1606 Elo GDPval-AA)
  • ByteDance Seed 2.0: 6th overall, 3rd in vision
  • DeepSeek R1: Strong performer in coding leaderboard alongside specialized Claude variants
  • Gemini 3.1 Pro: Rolling into leaderboard with strong reasoning metrics
  • Grok 4.1: Notable for creativity/unconstrained style

Additional Notable Benchmarks (Feb 2026)

  • ARC-AGI-2: Gemini 3.1 Pro leads at 77.1% (2.5x improvement over Gemini 3 Pro’s 31.1%)
  • Terminal-Bench 2.0: Claude Opus 4.6 at 65.4%; Gemini 3.1 Pro at 68.5%
  • GDPval-AA (expert office work): Claude Opus 4.6 at 1606 Elo
  • OSWorld (computer use): Claude Sonnet 4.6 jumped from 14.9% to 72.5%

Industry Milestones

  • Anthropic acquires Vercept: Feb 25-26, 2026 - Seattle AI startup specializing in desktop “computer use” technology. Strategic acquisition to advance Claude’s autonomous agent capabilities for live app interaction.
  • Claude Code: $2.5B ARR (as of Feb 2026)
  • Claude Code Security: Launched Feb 20, 2026
  • Claude Haiku 3: Deprecation announced, retiring April 19, 2026 (migrate to Haiku 4.5)
  • Anthropic RSP v3: Third version of Responsible Scaling Policy released (Feb 2026)
  • OpenAI Lockdown Mode: Enterprise security feature announced for ChatGPT (deterministic protections against prompt injection)

MiniMax M2.5 & M2.1 (Feb 12, 2026)

Key metrics:

  • SWE-Bench Verified: 80.2% (M2.5), competitive with Claude/Gemini frontier
  • Multi-SWE-Bench: 51.3% (first model to break 50% on this benchmark)
  • BrowseComp: 76.3% (with context management)
  • Terminal-Bench 2.0: Tested using Claude Code scaffolding with expanded sandbox specs
  • VIBE-Pro: Internal benchmark, on par with Opus 4.5
  • Office work (GDPval-MM): 59.0% average win rate vs mainstream models
  • Speed: M2.5 completes SWE-Bench Verified 37% faster than M2.1, matching Claude Opus 4.6 (22.8 min avg)

Architecture:

  • MoE model trained via RL in 200K+ real-world environments
  • Trained on 10+ languages across full-stack projects (Web, Android, iOS, Windows)
  • Covers entire development lifecycle: 0-to-1 system design → 1-to-10 development → 10-to-90 iteration → 90-to-100 testing
  • Spec-writing tendency: actively decomposes projects like a software architect before coding

Cost efficiency (flagship feature):

  • M2.5: $0.3/M input, $1.2/M output (50 TPS)
  • M2.5-Lightning: $0.3/M input, $2.4/M output (100 TPS)
  • 10-20x cheaper than Claude Opus 4.6, Gemini 3 Pro, GPT-5 on output tokens
  • Example: Run M2.5 continuously at 100 TPS for 1 hour = $1; at 50 TPS for 1 hour = $0.30
  • Four M2.5 instances running continuously for a year = $10,000

Framework generalization:

  • Tested across multiple agent scaffolds (Droid, OpenCode, Claude Code)
  • M2.5 on Droid: 79.7% > Opus 4.6 (78.9%)
  • M2.5 on OpenCode: 76.1% > Opus 4.6 (75.9%)

Deployment at MiniMax:

  • 30% of overall company tasks autonomously completed by M2.5
  • 80% of newly committed code generated by M2.5
  • Over 10,000 user-built “Experts” (domain-specific + Office Skills combos)

Source: MiniMax M2.5 announcement | MiniMax M2.1 announcement

GLM-5 Full Technical Details (arXiv Feb 2026)

Performance:

  • Intelligence Index v4.0: Score of 50 (first open-weights model to hit 50, up from GLM-4.7’s 42)
  • LMArena: #1 open model in both Text Arena and Code Arena, on par with Claude Opus 4.5 and Gemini 3 Pro
  • Vending-Bench 2: #1 among open-source models ($4,432 final balance, approaches Opus 4.5)
  • SWE-Bench Verified: 77.8%
  • Chatbot Arena Elo: 1451 (highest open-source rating)
  • Terminal-Bench 2.0, BrowseComp, MCP-Atlas, τ²-Bench: ~20% improvement over GLM-4.7

Architecture:

  • 744B parameters (40B active), MoE with 256 experts
  • DSA (DeepSeek Sparse Attention): Dynamic content-aware token selection, reduces attention compute by 1.5-2× for long sequences
  • Training: 28.5T tokens total (pre-training 27T + mid-training 1.5T)
  • Context: Progressive extension across stages (32K → 128K → 200K)
  • Multi-token Prediction: Parameter sharing across 3 MTP layers, longer acceptance length than DeepSeek-V3.2

Training innovations:

  • Asynchronous RL infrastructure: Fully decoupled generation from training, maximizes GPU utilization
  • Sequential RL pipeline: Reasoning RL → Agentic RL → General RL with On-Policy Cross-Stage Distillation
  • Agentic RL: 10K+ real-world SWE + terminal + multi-hop search environments, token-level clipping for off-policy stability

Full-stack Chinese GPU support:

  • Deep optimization across 7 domestic chip platforms: Huawei Ascend, Moore Threads, Hygon, Cambricon, Kunlunxin, MetaX, Enflame
  • DSA deterministic top-k operator critical for RL stability (torch.topk over CUDA variants)

Source: GLM-5 arXiv paper

Video Intel Highlights

  • Feb 19 · WorldofAI — “Google DROPS Gemini 3.1 Pro” (link): new ARC-AGI/GPQA/BrowseComp stats and confirmation that structured outputs have tightened compared to 3 Pro.
  • Feb 22 · AICodeKing — “Gemini 3.1 Pro (Fixed with KingMode) + GLM-5: This SIMPLE TRICK makes Gemini 3.1 PRO A BEAST!” (link): demonstrates the KingMode refinement plus GLM-5 joins the session for agentic coding loops, hinting that the new Gemini still benefits from shell scripting.
  • Feb 17 · WorldofAI — “Claude Sonnet 4.6: The Best AI Coding Model Ever” (still the reference point for production agents despite the Gemini release).

This page was last updated on March 9, 2026.