Model Intel Benchmarks

Overall LLM Rankings (Awesome Agents, Feb 26)

Comprehensive ranking combining MMLU-Pro (knowledge), GPQA Diamond (reasoning), SWE-Bench Verified (coding), Chatbot Arena Elo (human preference), and cost-adjusted value.

Top 10 (as of Feb 26, 2026):

GPT-5.2 Pro (OpenAI): 88.7% MMLU-Pro, 93.2% GPQA Diamond, 65.8% SWE-Bench, 1402 Elo, $10/$30
Claude Opus 4.6 (Anthropic): 88.2% MMLU-Pro, 89.0% GPQA Diamond, 72.5% SWE-Bench, 1398 Elo, $15/$75
Gemini 3 Pro (Google): 89.8% MMLU-Pro (highest), 87.5% GPQA Diamond, 63.2% SWE-Bench, 1389 Elo, $1.25/$5.00 (best price/performance)
Grok 4 Heavy (xAI): 86.4% MMLU-Pro, 88.9% GPQA Diamond, 61.0% SWE-Bench, 1375 Elo, $3/$15
DeepSeek V3.2-Speciale (DeepSeek): 85.9% MMLU-Pro, 85.3% GPQA Diamond, 77.8% SWE-Bench (highest), 1361 Elo, $0.28/$1.10
Claude Opus 4.5 (Anthropic): 87.1% MMLU-Pro, 86.5% GPQA Diamond, 68.4% SWE-Bench, 1370 Elo, $12/$60
Qwen 3.5 (Alibaba): 84.6% MMLU-Pro, 82.1% GPQA Diamond, 62.5% SWE-Bench, 1342 Elo, $0.50/$2.00
GPT-5.2 (OpenAI): 86.3% MMLU-Pro, 88.0% GPQA Diamond, 58.2% SWE-Bench, 1380 Elo, $2.50/$10
Llama 4 Maverick (Meta): 83.2% MMLU-Pro, 78.5% GPQA Diamond, 55.8% SWE-Bench, 1320 Elo, Free (open-weight)
Mistral 3 (Mistral): 82.8% MMLU-Pro, 79.3% GPQA Diamond, 54.1% SWE-Bench, 1315 Elo, $1/$3

Key trends:

Gap narrowing: just 6.9pp between #1 and #10 on MMLU-Pro (vs 25pp two years ago)
Coding is the differentiator: widest variance among top models
Open-weights competitive: Llama 4 and DeepSeek V3.2 in top 10
Pricing varies 50x: most expensive model costs ~50x more per token than cheapest

Source: Awesome Agents Overall Rankings (updated Feb 26, 2026)

Gemini 3.1 Pro reasoning haul (WorldofAI, Feb 19)

ARC-AGI-2 (verified): 77.1%, a step-change over Gemini 3 Pro and an indicator that the model can infer novel logic patterns rather than memorize outputs.
GPQA Diamond: 94.3%, showing the model’s scientific/technical reasoning hit new highs.
SWE-Bench Verified: 80.6%, placing it within a hair of Claude Sonnet 4.6 while still improving fidelity on structured code dumps.
BrowseComp: 85.9% (tool-grounded reasoning) and MCP Atlas: 69.2% (multi-step workflow execution) - the distribution of wins across reasoning, coding, and vision contexts suggests a deeper architecture update rather than narrow tuning.
Source: WorldofAI · “Google DROPS Gemini 3.1 Pro”

SWE-bench Verified (Feb 2026)

Source: marc0.dev / Scale AI leaks + WorldofAI

Claude Opus 4.5: 80.9%
Claude Sonnet 4.6: 80.8% (WorldofAI calls it the most reliable coding workhorse for Feb)
Gemini 3.1 Pro: 80.6% (WorldofAI reporting of the Feb 19 release)
MiniMax M2.5: 80.2%
GPT-5.2: 80.0%
GLM-5: 77.8% (open-source model, 1451 Arena Elo - highest open-source rating)

SWE-bench Pro (Feb-Mar 2026)

New harder private benchmark launched as Verified approaches saturation. Early reports (late Feb) showed top models (GPT-5, Opus 4.1) scoring only ~23% on Pro vs 70%+ on Verified.

Updated leaderboard (as of March 2, 2026) shows significant improvement:

Claude Opus 4.5: 45.89% (rank 1)
Claude Sonnet 4.5: 43.60% (rank 1)
Gemini 3 Pro Preview: 43.30% (rank 1)
Claude Sonnet 4.0: 42.70% (rank 1)
GPT-5 2025-08-07 (High): 41.78% (rank 2)
GPT-5.2 Codex: 41.04% (rank 2)
Claude Haiku 4.5: 39.45% (rank 2)
Qwen3-coder-480b: 38.70% (rank 2)
MiniMax-2.1: 36.81% (rank 2)
Gemini 3 Flash: 34.63% (rank 2)

This near-doubling of scores from ~23% to ~42-46% for top models suggests either: (a) rapid model improvement in code reasoning over 2 weeks, (b) scaffold/agent framework optimization, or (c) different evaluation methodologies. The benchmark remains the frontier for measuring complex, multi-file real-world software engineering tasks.

LMSYS Chatbot Arena / Text Arena (Feb 2026)

Current Leaderboard (as of Feb 26):

Claude Opus 4.6 Thinking: ~1506 Elo (complex reasoning, multi-step logic)
Gemini 2.5 Pro: ~1450 Elo
GPT-5.2-high: ~1400 Elo

GPT-5.3 “Vortex” & “Zephyr”: Appeared Feb 25 as mystery models, accumulating Elo votes. Following GPT-5 zenith/summit pattern (flagship + reasoning variants). Expected release: mid-March to early April 2026.

Category Leaders:

GPT-5.2: Leading for general chat/daily assistant
Claude Sonnet 4.6: #1 for writing tasks (preferred over Sonnet 4.5 in 70% of blind tests, preferred over Opus 4.5 in 59% of coding tasks)
Claude Opus 4.6: Leading agentic/coding tier (65.4% Terminal-Bench 2.0, 1606 Elo GDPval-AA)
ByteDance Seed 2.0: 6th overall, 3rd in vision
DeepSeek R1: Strong performer in coding leaderboard alongside specialized Claude variants
Gemini 3.1 Pro: Rolling into leaderboard with strong reasoning metrics
Grok 4.1: Notable for creativity/unconstrained style

Additional Notable Benchmarks (Feb 2026)

ARC-AGI-2: Gemini 3.1 Pro leads at 77.1% (2.5x improvement over Gemini 3 Pro’s 31.1%)
Terminal-Bench 2.0: Claude Opus 4.6 at 65.4%; Gemini 3.1 Pro at 68.5%
GDPval-AA (expert office work): Claude Opus 4.6 at 1606 Elo
OSWorld (computer use): Claude Sonnet 4.6 jumped from 14.9% to 72.5%

Industry Milestones

Anthropic acquires Vercept: Feb 25-26, 2026 - Seattle AI startup specializing in desktop “computer use” technology. Strategic acquisition to advance Claude’s autonomous agent capabilities for live app interaction.
Claude Code: $2.5B ARR (as of Feb 2026)
Claude Code Security: Launched Feb 20, 2026
Claude Haiku 3: Deprecation announced, retiring April 19, 2026 (migrate to Haiku 4.5)
Anthropic RSP v3: Third version of Responsible Scaling Policy released (Feb 2026)
OpenAI Lockdown Mode: Enterprise security feature announced for ChatGPT (deterministic protections against prompt injection)

MiniMax M2.5 & M2.1 (Feb 12, 2026)

Key metrics:

SWE-Bench Verified: 80.2% (M2.5), competitive with Claude/Gemini frontier
Multi-SWE-Bench: 51.3% (first model to break 50% on this benchmark)
BrowseComp: 76.3% (with context management)
Terminal-Bench 2.0: Tested using Claude Code scaffolding with expanded sandbox specs
VIBE-Pro: Internal benchmark, on par with Opus 4.5
Office work (GDPval-MM): 59.0% average win rate vs mainstream models
Speed: M2.5 completes SWE-Bench Verified 37% faster than M2.1, matching Claude Opus 4.6 (22.8 min avg)

Architecture:

MoE model trained via RL in 200K+ real-world environments
Trained on 10+ languages across full-stack projects (Web, Android, iOS, Windows)
Covers entire development lifecycle: 0-to-1 system design → 1-to-10 development → 10-to-90 iteration → 90-to-100 testing
Spec-writing tendency: actively decomposes projects like a software architect before coding

Cost efficiency (flagship feature):

M2.5: $0.3/M input, $1.2/M output (50 TPS)
M2.5-Lightning: $0.3/M input, $2.4/M output (100 TPS)
10-20x cheaper than Claude Opus 4.6, Gemini 3 Pro, GPT-5 on output tokens
Example: Run M2.5 continuously at 100 TPS for 1 hour = $1; at 50 TPS for 1 hour = $0.30
Four M2.5 instances running continuously for a year = $10,000

Framework generalization:

Tested across multiple agent scaffolds (Droid, OpenCode, Claude Code)
M2.5 on Droid: 79.7% > Opus 4.6 (78.9%)
M2.5 on OpenCode: 76.1% > Opus 4.6 (75.9%)

Deployment at MiniMax:

30% of overall company tasks autonomously completed by M2.5
80% of newly committed code generated by M2.5
Over 10,000 user-built “Experts” (domain-specific + Office Skills combos)

Source: MiniMax M2.5 announcement | MiniMax M2.1 announcement

GLM-5 Full Technical Details (arXiv Feb 2026)

Performance:

Intelligence Index v4.0: Score of 50 (first open-weights model to hit 50, up from GLM-4.7’s 42)
LMArena: #1 open model in both Text Arena and Code Arena, on par with Claude Opus 4.5 and Gemini 3 Pro
Vending-Bench 2: #1 among open-source models ($4,432 final balance, approaches Opus 4.5)
SWE-Bench Verified: 77.8%
Chatbot Arena Elo: 1451 (highest open-source rating)
Terminal-Bench 2.0, BrowseComp, MCP-Atlas, τ²-Bench: ~20% improvement over GLM-4.7

Architecture:

744B parameters (40B active), MoE with 256 experts
DSA (DeepSeek Sparse Attention): Dynamic content-aware token selection, reduces attention compute by 1.5-2× for long sequences
Training: 28.5T tokens total (pre-training 27T + mid-training 1.5T)
Context: Progressive extension across stages (32K → 128K → 200K)
Multi-token Prediction: Parameter sharing across 3 MTP layers, longer acceptance length than DeepSeek-V3.2

Training innovations:

Asynchronous RL infrastructure: Fully decoupled generation from training, maximizes GPU utilization
Sequential RL pipeline: Reasoning RL → Agentic RL → General RL with On-Policy Cross-Stage Distillation
Agentic RL: 10K+ real-world SWE + terminal + multi-hop search environments, token-level clipping for off-policy stability

Full-stack Chinese GPU support:

Deep optimization across 7 domestic chip platforms: Huawei Ascend, Moore Threads, Hygon, Cambricon, Kunlunxin, MetaX, Enflame
DSA deterministic top-k operator critical for RL stability (torch.topk over CUDA variants)

Source: GLM-5 arXiv paper

Video Intel Highlights

Feb 19 · WorldofAI — “Google DROPS Gemini 3.1 Pro” (link): new ARC-AGI/GPQA/BrowseComp stats and confirmation that structured outputs have tightened compared to 3 Pro.
Feb 22 · AICodeKing — “Gemini 3.1 Pro (Fixed with KingMode) + GLM-5: This SIMPLE TRICK makes Gemini 3.1 PRO A BEAST!” (link): demonstrates the KingMode refinement plus GLM-5 joins the session for agentic coding loops, hinting that the new Gemini still benefits from shell scripting.
Feb 17 · WorldofAI — “Claude Sonnet 4.6: The Best AI Coding Model Ever” (still the reference point for production agents despite the Gemini release).