Model Intel Benchmarks
Overall LLM Rankings (Awesome Agents, Feb 26)
Comprehensive ranking combining MMLU-Pro (knowledge), GPQA Diamond (reasoning), SWE-Bench Verified (coding), Chatbot Arena Elo (human preference), and cost-adjusted value.
Top 10 (as of Feb 26, 2026):
- GPT-5.2 Pro (OpenAI): 88.7% MMLU-Pro, 93.2% GPQA Diamond, 65.8% SWE-Bench, 1402 Elo, $10/$30
- Claude Opus 4.6 (Anthropic): 88.2% MMLU-Pro, 89.0% GPQA Diamond, 72.5% SWE-Bench, 1398 Elo, $15/$75
- Gemini 3 Pro (Google): 89.8% MMLU-Pro (highest), 87.5% GPQA Diamond, 63.2% SWE-Bench, 1389 Elo, $1.25/$5.00 (best price/performance)
- Grok 4 Heavy (xAI): 86.4% MMLU-Pro, 88.9% GPQA Diamond, 61.0% SWE-Bench, 1375 Elo, $3/$15
- DeepSeek V3.2-Speciale (DeepSeek): 85.9% MMLU-Pro, 85.3% GPQA Diamond, 77.8% SWE-Bench (highest), 1361 Elo, $0.28/$1.10
- Claude Opus 4.5 (Anthropic): 87.1% MMLU-Pro, 86.5% GPQA Diamond, 68.4% SWE-Bench, 1370 Elo, $12/$60
- Qwen 3.5 (Alibaba): 84.6% MMLU-Pro, 82.1% GPQA Diamond, 62.5% SWE-Bench, 1342 Elo, $0.50/$2.00
- GPT-5.2 (OpenAI): 86.3% MMLU-Pro, 88.0% GPQA Diamond, 58.2% SWE-Bench, 1380 Elo, $2.50/$10
- Llama 4 Maverick (Meta): 83.2% MMLU-Pro, 78.5% GPQA Diamond, 55.8% SWE-Bench, 1320 Elo, Free (open-weight)
- Mistral 3 (Mistral): 82.8% MMLU-Pro, 79.3% GPQA Diamond, 54.1% SWE-Bench, 1315 Elo, $1/$3
Key trends:
- Gap narrowing: just 6.9pp between #1 and #10 on MMLU-Pro (vs 25pp two years ago)
- Coding is the differentiator: widest variance among top models
- Open-weights competitive: Llama 4 and DeepSeek V3.2 in top 10
- Pricing varies 50x: most expensive model costs ~50x more per token than cheapest
Source: Awesome Agents Overall Rankings (updated Feb 26, 2026)
Gemini 3.1 Pro reasoning haul (WorldofAI, Feb 19)
- ARC-AGI-2 (verified): 77.1%, a step-change over Gemini 3 Pro and an indicator that the model can infer novel logic patterns rather than memorize outputs.
- GPQA Diamond: 94.3%, showing the model’s scientific/technical reasoning hit new highs.
- SWE-Bench Verified: 80.6%, placing it within a hair of Claude Sonnet 4.6 while still improving fidelity on structured code dumps.
- BrowseComp: 85.9% (tool-grounded reasoning) and MCP Atlas: 69.2% (multi-step workflow execution) - the distribution of wins across reasoning, coding, and vision contexts suggests a deeper architecture update rather than narrow tuning.
- Source: WorldofAI · “Google DROPS Gemini 3.1 Pro”
SWE-bench Verified (Feb 2026)
Source: marc0.dev / Scale AI leaks + WorldofAI
- Claude Opus 4.5: 80.9%
- Claude Sonnet 4.6: 80.8% (WorldofAI calls it the most reliable coding workhorse for Feb)
- Gemini 3.1 Pro: 80.6% (WorldofAI reporting of the Feb 19 release)
- MiniMax M2.5: 80.2%
- GPT-5.2: 80.0%
- GLM-5: 77.8% (open-source model, 1451 Arena Elo - highest open-source rating)
SWE-bench Pro (Feb-Mar 2026)
New harder private benchmark launched as Verified approaches saturation. Early reports (late Feb) showed top models (GPT-5, Opus 4.1) scoring only ~23% on Pro vs 70%+ on Verified.
Updated leaderboard (as of March 2, 2026) shows significant improvement:
- Claude Opus 4.5: 45.89% (rank 1)
- Claude Sonnet 4.5: 43.60% (rank 1)
- Gemini 3 Pro Preview: 43.30% (rank 1)
- Claude Sonnet 4.0: 42.70% (rank 1)
- GPT-5 2025-08-07 (High): 41.78% (rank 2)
- GPT-5.2 Codex: 41.04% (rank 2)
- Claude Haiku 4.5: 39.45% (rank 2)
- Qwen3-coder-480b: 38.70% (rank 2)
- MiniMax-2.1: 36.81% (rank 2)
- Gemini 3 Flash: 34.63% (rank 2)
This near-doubling of scores from ~23% to ~42-46% for top models suggests either: (a) rapid model improvement in code reasoning over 2 weeks, (b) scaffold/agent framework optimization, or (c) different evaluation methodologies. The benchmark remains the frontier for measuring complex, multi-file real-world software engineering tasks.
LMSYS Chatbot Arena / Text Arena (Feb 2026)
Current Leaderboard (as of Feb 26):
- Claude Opus 4.6 Thinking: ~1506 Elo (complex reasoning, multi-step logic)
- Gemini 2.5 Pro: ~1450 Elo
- GPT-5.2-high: ~1400 Elo
- GPT-5.3 “Vortex” & “Zephyr”: Appeared Feb 25 as mystery models, accumulating Elo votes. Following GPT-5 zenith/summit pattern (flagship + reasoning variants). Expected release: mid-March to early April 2026.
Category Leaders:
- GPT-5.2: Leading for general chat/daily assistant
- Claude Sonnet 4.6: #1 for writing tasks (preferred over Sonnet 4.5 in 70% of blind tests, preferred over Opus 4.5 in 59% of coding tasks)
- Claude Opus 4.6: Leading agentic/coding tier (65.4% Terminal-Bench 2.0, 1606 Elo GDPval-AA)
- ByteDance Seed 2.0: 6th overall, 3rd in vision
- DeepSeek R1: Strong performer in coding leaderboard alongside specialized Claude variants
- Gemini 3.1 Pro: Rolling into leaderboard with strong reasoning metrics
- Grok 4.1: Notable for creativity/unconstrained style
Additional Notable Benchmarks (Feb 2026)
- ARC-AGI-2: Gemini 3.1 Pro leads at 77.1% (2.5x improvement over Gemini 3 Pro’s 31.1%)
- Terminal-Bench 2.0: Claude Opus 4.6 at 65.4%; Gemini 3.1 Pro at 68.5%
- GDPval-AA (expert office work): Claude Opus 4.6 at 1606 Elo
- OSWorld (computer use): Claude Sonnet 4.6 jumped from 14.9% to 72.5%
Industry Milestones
- Anthropic acquires Vercept: Feb 25-26, 2026 - Seattle AI startup specializing in desktop “computer use” technology. Strategic acquisition to advance Claude’s autonomous agent capabilities for live app interaction.
- Claude Code: $2.5B ARR (as of Feb 2026)
- Claude Code Security: Launched Feb 20, 2026
- Claude Haiku 3: Deprecation announced, retiring April 19, 2026 (migrate to Haiku 4.5)
- Anthropic RSP v3: Third version of Responsible Scaling Policy released (Feb 2026)
- OpenAI Lockdown Mode: Enterprise security feature announced for ChatGPT (deterministic protections against prompt injection)
MiniMax M2.5 & M2.1 (Feb 12, 2026)
Key metrics:
- SWE-Bench Verified: 80.2% (M2.5), competitive with Claude/Gemini frontier
- Multi-SWE-Bench: 51.3% (first model to break 50% on this benchmark)
- BrowseComp: 76.3% (with context management)
- Terminal-Bench 2.0: Tested using Claude Code scaffolding with expanded sandbox specs
- VIBE-Pro: Internal benchmark, on par with Opus 4.5
- Office work (GDPval-MM): 59.0% average win rate vs mainstream models
- Speed: M2.5 completes SWE-Bench Verified 37% faster than M2.1, matching Claude Opus 4.6 (22.8 min avg)
Architecture:
- MoE model trained via RL in 200K+ real-world environments
- Trained on 10+ languages across full-stack projects (Web, Android, iOS, Windows)
- Covers entire development lifecycle: 0-to-1 system design → 1-to-10 development → 10-to-90 iteration → 90-to-100 testing
- Spec-writing tendency: actively decomposes projects like a software architect before coding
Cost efficiency (flagship feature):
- M2.5: $0.3/M input, $1.2/M output (50 TPS)
- M2.5-Lightning: $0.3/M input, $2.4/M output (100 TPS)
- 10-20x cheaper than Claude Opus 4.6, Gemini 3 Pro, GPT-5 on output tokens
- Example: Run M2.5 continuously at 100 TPS for 1 hour = $1; at 50 TPS for 1 hour = $0.30
- Four M2.5 instances running continuously for a year = $10,000
Framework generalization:
- Tested across multiple agent scaffolds (Droid, OpenCode, Claude Code)
- M2.5 on Droid: 79.7% > Opus 4.6 (78.9%)
- M2.5 on OpenCode: 76.1% > Opus 4.6 (75.9%)
Deployment at MiniMax:
- 30% of overall company tasks autonomously completed by M2.5
- 80% of newly committed code generated by M2.5
- Over 10,000 user-built “Experts” (domain-specific + Office Skills combos)
Source: MiniMax M2.5 announcement | MiniMax M2.1 announcement
GLM-5 Full Technical Details (arXiv Feb 2026)
Performance:
- Intelligence Index v4.0: Score of 50 (first open-weights model to hit 50, up from GLM-4.7’s 42)
- LMArena: #1 open model in both Text Arena and Code Arena, on par with Claude Opus 4.5 and Gemini 3 Pro
- Vending-Bench 2: #1 among open-source models ($4,432 final balance, approaches Opus 4.5)
- SWE-Bench Verified: 77.8%
- Chatbot Arena Elo: 1451 (highest open-source rating)
- Terminal-Bench 2.0, BrowseComp, MCP-Atlas, τ²-Bench: ~20% improvement over GLM-4.7
Architecture:
- 744B parameters (40B active), MoE with 256 experts
- DSA (DeepSeek Sparse Attention): Dynamic content-aware token selection, reduces attention compute by 1.5-2× for long sequences
- Training: 28.5T tokens total (pre-training 27T + mid-training 1.5T)
- Context: Progressive extension across stages (32K → 128K → 200K)
- Multi-token Prediction: Parameter sharing across 3 MTP layers, longer acceptance length than DeepSeek-V3.2
Training innovations:
- Asynchronous RL infrastructure: Fully decoupled generation from training, maximizes GPU utilization
- Sequential RL pipeline: Reasoning RL → Agentic RL → General RL with On-Policy Cross-Stage Distillation
- Agentic RL: 10K+ real-world SWE + terminal + multi-hop search environments, token-level clipping for off-policy stability
Full-stack Chinese GPU support:
- Deep optimization across 7 domestic chip platforms: Huawei Ascend, Moore Threads, Hygon, Cambricon, Kunlunxin, MetaX, Enflame
- DSA deterministic top-k operator critical for RL stability (torch.topk over CUDA variants)
Source: GLM-5 arXiv paper
Video Intel Highlights
- Feb 19 · WorldofAI — “Google DROPS Gemini 3.1 Pro” (link): new ARC-AGI/GPQA/BrowseComp stats and confirmation that structured outputs have tightened compared to 3 Pro.
- Feb 22 · AICodeKing — “Gemini 3.1 Pro (Fixed with KingMode) + GLM-5: This SIMPLE TRICK makes Gemini 3.1 PRO A BEAST!” (link): demonstrates the KingMode refinement plus GLM-5 joins the session for agentic coding loops, hinting that the new Gemini still benefits from shell scripting.
- Feb 17 · WorldofAI — “Claude Sonnet 4.6: The Best AI Coding Model Ever” (still the reference point for production agents despite the Gemini release).