Changelog

Chronological intelligence log of model releases, updates, and routing changes with source citations.

Changelog

14 min read 2772 words

Model Intel Changelog

March 2026

2026-03-08 (PM - 5:34 PM): Sunday Evening Model Intel Update (quick check)

  • Claude Cowork launched March 7: Anthropic released Claude Cowork, a new desktop agent feature bringing Claude Code-style capabilities to general office work. Available in Claude desktop app (paid plans, research preview). Key capabilities: organize files/folders, analyze spreadsheets, generate formatted reports, compile research from multiple documents, schedule automated tasks (daily/weekly/monthly). Extends Claude beyond chat to act as autonomous digital coworker. Tom’s Guide published hands-on review March 7 confirming file organization, document compilation, and task scheduling functionality. Positions Claude to compete with Gemini Workspace integrations.
  • Claude Code 2.1.70 released: Comprehensive stability update (~2 days ago) fixing API 400 errors with third-party gateways, tool search issues, and adding VS Code session visuals + native MCP server management dialog.
  • SWE-Bench Verified leaderboard confirmed stable: March 2026 scores match existing documentation (Opus 4.5 at 80.9%, Opus 4.6 at 80.8%, Gemini 3.1 Pro at 80.6%, Sonnet 4.6 at 79.6%). No changes to routing recommendations.
  • Gemini 3 Deep Think upgrade noted: Wikipedia confirms major upgrade in February 2026 targeting science/research/engineering applications (already in general knowledge but not explicitly documented in BENCHMARKS.md).
  • Gemini 3.1 Pro thinking levels detail: Three-tier system (Low/Medium/High) for balancing speed vs reasoning depth.
  • Frontier landscape stable: No new model releases from major providers. GPT-5.3 “Vortex” and “Zephyr” continue accumulating Arena Elo; expected mid-March to early April release on track.
  • Search limitations: Hit Brave API rate limits (429) after 3 successful queries. Relied on successful Anthropic/Google/SWE-bench searches plus web_fetch for detailed Claude Cowork coverage.
  • Video intel: No new WorldofAI or AICodeKing videos in last 24 hours based on search results.

2026-03-08 (AM - 5:43 AM): Sunday Morning Model Intel Update

  • GPT-5.4 comprehensive analysis: Deep dive into OpenAI’s March 5 release confirms first genuinely competitive alternative to Claude for agentic coding. Key features: 1M context window, native computer use (75% OSWorld-Verified vs 72.4% human performance), tool search (47% token reduction on MCP Atlas), 57.7% SWE-Bench Pro, 83% GDPval (matches/exceeds human experts). Independent testing (Every team, Cursor, Windsurf, Augment, Databricks) shows developers switching or going 50/50 Claude/GPT-5.4. Pricing: $2.50/$15 per M tokens (vs Claude Opus 4.6 at similar range). Caveats: prompt leaking in UI elements, occasional fabrication/scope creep, design aesthetics behind Gemini 3.1 Pro and Claude Opus 4.6 (GPT-5.4 tends toward gradients/shadows vs Claude’s minimal clean style). Status: Available across ChatGPT, API, and Codex. Added to models.json, MODEL_GUIDE.md, ROUTING_RULES.md.
  • Gemini 3.1 Flash-Lite detailed specs: Google’s March 3 preview launch analyzed. Pricing: $0.25/$1.50 per M tokens (10x cheaper than GPT-5.4, 6x cheaper than Gemini 3 Flash). Performance: 2.5X faster Time to First Answer vs 2.5 Flash, 45% output speed increase, 1432 Arena.ai Elo, 86.9% GPQA Diamond, 76.8% MMMU Pro (outperforms prior-gen Gemini models at fraction of cost). Use cases: High-volume translation, content moderation, UI/simulation generation, tasks requiring adaptive thinking levels at scale. Available via Google AI Studio and Vertex AI preview. Recommendation: Consider for ultra-high-volume low-cost workloads; significantly faster and cheaper than competitors in its tier. Added detailed specs to models.json and MODEL_GUIDE.md.
  • Claude Opus 4.6 real-world validation expanded: Beyond Donald Knuth’s March 3 endorsement for automatic deduction, new finding surfaced: Claude Opus 4.6 identified 22 Firefox vulnerabilities (14 high severity) helping Mozilla patch Firefox 148. Concrete security research validation beyond benchmarks. Reinforces routing recommendation for security-review tasks.
  • Routing updates: Added agentic-coding and computer-use task classes to ROUTING_RULES.md. GPT-5.4 now recommended for agentic workflows (native computer use, tool search), Claude Opus 4.6 for security review and complex problem-solving, Gemini 3.1 Flash-Lite for ultra-high-volume low-latency tasks. Long-context split: GPT-5.4 (1M) or Claude Opus 4.6 (1M, 76% MRCR v2) with caveat that GPT-5.4 needle-in-haystack drops from 97% at 16-32K to 36% at 512K-1M (compact regularly).
  • Frontier landscape stable: No new releases beyond previously tracked GPT-5.4 and Gemini 3.1 Flash-Lite. GPT-5.3 “Vortex” and “Zephyr” continue accumulating Arena Elo votes; expected release mid-March to early April remains on track.
  • Search coverage: Scanned Anthropic, OpenAI, Google Gemini, Z.AI/GLM, LMSYS Arena, SWE-bench, WorldofAI, AICodeKing. Hit Brave API rate limits (429) after 3 successful queries; relied on web_fetch for detailed GPT-5.4 and Gemini 3.1 Flash-Lite specs from The Neuron and Google Blog.

2026-03-07 (PM - 5:47 PM): Saturday Evening Model Intel Update (quick check)

  • Claude Opus 4.6 real-world validation: Found 22 Firefox vulnerabilities (14 high severity) helping Mozilla patch Firefox 148. Concrete security research validation beyond benchmarks.
  • Opus 4.6 long-context performance: 76% mean match ratio on MRCR v2’s 8-needle 1M variant (vs Sonnet 4.5 at 18.5%). Radically improved due to internal retrieval enhancements. METR task-completion: 50% horizon 14h 30m, 80% horizon 1h 3m.
  • GLM-5 status: Compute shortages + restricted signups continue. Leads tau2-Bench for agentic tool use. GPT-5.3 Codex Spark now on Cerebras Inference Cloud.
  • Gemini 3 confirmation: Record-breaking 1501 LMSYS Arena score confirmed.
  • Landscape stable: No new releases. Routing unchanged.

2026-03-07 (AM - 5:33 AM): Saturday Early Morning Model Intel Update

  • Landscape stable: No major new releases overnight. All frontier models (Sonnet 4.6, Opus 4.6, Gemini 3.1 Pro, GPT-5.4 family) maintaining positions.
  • Anthropic Academy: Launched 13 free AI courses covering Claude 101 through MCP development (March 6-7). Educational initiative, not a model release.
  • Vertex AI confirmation: Google/Microsoft confirmed Claude models remain available via Vertex AI outside defense projects (March 6 CNBC report).
  • Search coverage: Scanned Anthropic, OpenAI, Google Gemini, Z.AI/GLM updates. Hit rate limits but covered major providers.
  • Routing status: No changes. Sonnet 4.6 for coding/writing, Opus 4.6 Thinking for complex reasoning, Gemini 3.1 Pro for ARC-AGI work, GPT-5.4 family competitive for coding.

2026-03-06 (PM - 5:35 PM): Friday Evening Model Intel Update (quick check)

  • Landscape stable: No major new releases. Frontier models holding position.
  • Claude Opus 4.6 validation: Donald Knuth (March 3) stated Claude Opus 4.6 solved an open problem he’d been pursuing for weeks - “a dramatic advance in automatic deduction and creative problem solving.” High-profile endorsement of reasoning capability.
  • Search rate limits: Hit Brave API rate limits during discovery; relied on successful queries for Anthropic/Google updates.
  • Routing status: No changes. Evening check confirms morning recommendations remain valid.

2026-03-06 (AM - 5:41 AM): Friday Morning Model Intel Update

  • GPT-5.4 rollout complete: GPT-5.4, GPT-5.4 Thinking, and GPT-5.4 Pro now fully available in ChatGPT, API, and Codex (confirmed March 5 afternoon). Multiple independent comparisons vs Claude Opus 4.6 for coding show competitive performance. GPT-5.4 Thinking matches/outperforms human experts 83% of the time on real-world job evaluations.
  • Anthropic Claude Code cleanup: Removed Opus 4 and 4.1 from Claude Code API (March 5); users with these models pinned are automatically migrated to Opus 4.6. Streamlining model lineup.
  • Google Gemini 3.1 Flash-Lite: Preview launched March 3 - new speed-optimized variant of Gemini 3.1 Flash for low-latency tasks. Available via Vertex AI preview.
  • GPT-5.3 Instant tuning: March 3 update focused on reducing “cringe” responses and improving naturalness (already noted in prior update).
  • Reviewer consensus: Multiple sources (Tom’s Guide, Bind AI, CNET, Digital Applied) comparing GPT-5.4 vs Claude Opus 4.6 for coding. General finding: Claude Opus 4.6 holds highest SWE-Bench Verified score for production coding; GPT-5.4 strong on agentic workflows and computer use. Both competitive at frontier level.
  • Routing status: No changes recommended. Sonnet 4.6 for coding/writing, Opus 4.6 Thinking for complex reasoning, Gemini 3.1 Pro for ARC-AGI work remain top picks.

2026-03-05 (PM - 6:26 PM): Thursday Evening Model Intel Update

  • OpenAI GPT-5.4 Released: OpenAI released GPT-5.4 today, available as a reasoning model (GPT-5.4 Thinking) and optimized for high performance (GPT-5.4 Pro). GPT-5.3 Instant is also now available for all users.
  • Anthropic Claude Updates: Claude Code Security was introduced to review codebases and identify vulnerabilities. “Claude remote-control” fixed crash issues. Anthropic memory feature allows users to migrate seamlessly.
  • Google Gemini: Gemini 3.1 Pro dominating ARC-AGI-2 and LMSYS arena. Nano Banana 2 rapid image generation noted for Gemini.
  • Z.AI / GLM-5: Gradual rollout continues, still tracking highly competitive metrics for agentic workflows.

2026-03-03 (AM - 5:30 AM): Tuesday Early Morning Model Intel Update

  • No major new model releases: Stable day across all major providers. All frontier models (Sonnet 4.6, Opus 4.6, Gemini 3.1 Pro, GPT-5.3 arena entries) holding position.
  • Claude Sonnet 5 speculation: Logs showing “claude-sonnet-5-20260219” appearing in Vertex AI (source: Medium article Feb 26). Likely release window: February or March 2026. Not yet publicly available.
  • GLM-5 expanding availability: Now accessible via OpenRouter (with free trial limits) and Z.AI API (pay-as-you-go with starting quota). Gradual rollout continues since Feb 11 launch.
  • SWE-Bench Pro update confirmed: Leaderboard shows sustained improvement (42-46% for top models) vs earlier ~23% reports. Suggests either rapid model gains, scaffold optimization, or methodology differences.
  • LMArena video support: Added video evaluation capability in January 2026 (noted in Wikipedia).
  • GPT-5.3 watch: “Vortex” and “Zephyr” still accumulating Arena Elo votes. Expected release mid-March to early April remains on track.
  • Routing status: No changes. Sonnet 4.6 for coding/writing, Opus 4.6 Thinking for complex reasoning, Gemini 3.1 Pro for ARC-AGI work.

2026-03-02 (PM - Evening): Monday Evening Model Intel Update

  • No new model releases: Stable day across all major providers (Anthropic, OpenAI, Google, Z.AI).
  • Claude service status: Brief outage this morning (March 2), now fully operational per Anthropic status page.
  • SWE-Bench Pro leaderboard update: Scale AI leaderboard shows frontier models significantly outperforming previous reports. Top 3: Claude Opus 4.5 (45.89%), Claude Sonnet 4.5 (43.60%), Gemini 3 Pro Preview (43.30%). Previous narrative suggested ~23% ceiling; current scores suggest continued improvement on this harder benchmark.
  • Vertex AI limitation noted: Code execution tool available on Claude Sonnet 4.6/Opus 4.6 via direct API and Azure, but NOT available on Vertex AI (source: Groundy article, Feb 28).
  • Infrastructure news: Nvidia announced new AI processing chip (WSJ report, Feb 28) and $4B investment in photonics companies Lumentum + Coherent (today) to enhance data center capabilities.
  • Routing status: No changes. Sonnet 4.6 for coding/writing, Opus 4.6 Thinking for complex reasoning, Gemini 3.1 Pro for ARC-AGI work.

2026-03-02 (AM): Monday Morning Model Intel Update

  • No major new releases: Model landscape stable over past 7 days. All tracked frontier models holding position.
  • Anthropic Claude Cowork: Announced for 2026 - expanding Claude beyond dev to knowledge work. VentureBeat: “In 2025 Claude transformed how developers work, and in 2026 it will do the same for knowledge work.”
  • Google Project Genie: World model (released Jan 2026) generates interactive 2D game environments from images + text. Training ground for AI agents in simulated environments.
  • Moonshot Kimi 2.5: Late January release - Chinese model focused on coding + agentic task completion (privately held, not yet accessible).
  • Market signal: Claude app #1 on App Store (Feb 28-Mar 1) amid OpenAI Pentagon controversy driving user migration.
  • Status: Routing recommendations unchanged. Sonnet 4.6 for coding, Opus 4.6 Thinking for complex reasoning, Gemini 3.1 Pro for ARC-style tasks.

2026-03-01 (PM - Evening): Sunday Evening Model Intel Update

  • MiniMax M2.5 details surfaced: Feb 12 release (80.2% SWE-Bench Verified, 51.3% Multi-SWE-Bench, 76.3% BrowseComp). Flagship feature: extreme cost efficiency ($0.3-$2.4/M tokens, 10-20x cheaper than Opus/Gemini/GPT-5). 100 TPS native throughput (2x faster than other frontier models). Trained on 200K+ real-world environments. Added to models.json and BENCHMARKS.md with full technical details.
  • GLM-5 arXiv paper published: Full technical details now available. 744B params (40B active), DSA architecture, 28.5T token training, score of 50 on Intelligence Index v4.0 (first open-weights to hit 50). #1 open model in both Text Arena and Code Arena. Comprehensive architecture/training innovations documented.
  • Claude App Store dominance: Claude hit #1 on App Store (Feb 28-March 1) driven by OpenAI Pentagon controversy and user migration from ChatGPT. Market signal of brand momentum shift.
  • Anthropic funding: $30B Series G at $380B valuation (Feb 12, previously noted but reconfirmed).
  • Model landscape: Stable. All frontier models (Sonnet 4.6, Opus 4.6, Gemini 3.1 Pro) maintaining positions. GPT-5.3 mystery models continue accumulating Arena votes.
  • Recommendation: No routing changes. MiniMax M2.5 is a compelling low-cost option for high-volume agentic workloads if access becomes available.

February 2026

2026-02-28 (PM): Evening Model Intel Update

  • No major new releases: Model landscape remains stable. All tracked models (Sonnet 4.6, Opus 4.6, Gemini 3.1 Pro, GPT-5.3 arena entries) holding position.
  • Current leaders: Sonnet 4.6 for coding (79.6% SWE-bench), Gemini 3.1 Pro for reasoning (77.1% ARC-AGI-2), Opus 4.6 Thinking for problem-solving (#1 Text Arena).
  • Arena watch: GPT-5.3 “Vortex” & “Zephyr” continue accumulating Elo votes; expected release mid-March to early April.
  • Status: No routing changes recommended. Weekend stability confirmed.

2026-02-27 (PM): Evening Model Intel Update

  • No major new releases: Model landscape stable since morning check.
  • Overall LLM Rankings (Feb 26): Added consolidated benchmark data from Awesome Agents leaderboard to BENCHMARKS.md. Key findings: Gemini 3 Pro leads MMLU-Pro at 89.8%, DeepSeek V3.2 leads SWE-Bench Verified at 77.8%, Claude Opus 4.6 strong across reasoning + coding at competitive pricing.
  • Trend confirmation: Gap narrowing between frontier models (6.9pp on MMLU-Pro vs 25pp in 2024), coding becoming the key differentiator, open-weights (Llama 4, DeepSeek) now competitive with proprietary models.
  • Rate limits: Hit Brave Search API rate limits during discovery phase; relied on web fetch for detailed leaderboard data.

2026-02-27 (AM): Daily Model Intel Update

  • Anthropic acquires Vercept: Feb 25-26 acquisition of Seattle AI startup specializing in desktop “computer use” technology. Vercept’s team and tech folding into Claude to advance autonomous agent capabilities for live app interaction. Strategic move toward full computer interaction agents.
  • SWE-Bench Pro: New harder private benchmark subset launched. Top models (GPT-5, Opus 4.1) score only ~23% vs 70%+ on Verified, indicating Verified is near-saturated and Pro is the new frontier for code reasoning.
  • GLM-5 Arena performance: Open-source leaderboard reports 1451 Chatbot Arena Elo (highest open-source rating), 77.8% SWE-bench Verified. Strong showing for open-weights agent/coding model.
  • No major new model releases this week beyond previously tracked (Sonnet 4.6, Gemini 3.1 Pro, GPT-5.3 arena entries).
  • Recommendation stability: Sonnet 4.6 for coding/writing, Opus 4.6 Thinking for problem-solving, Gemini 3.1 Pro for complex reasoning remain top picks.

2026-02-26 (PM): Evening Model Intel Update

  • GPT-5.3 on Arena: Two mystery models (“vortex” and “zephyr”) appeared on LMSYS Chatbot Arena Feb 25, following same dual-codename pattern as GPT-5’s “zenith/summit”. Likely flagship + reasoning variants. Expected release: mid-March to early April 2026. Currently accumulating Elo votes.
  • Anthropic Remote Control: Mobile version of Claude Code launched Feb 24-25, extending Claude’s agentic capabilities to mobile devices.
  • Claude service outage: Feb 25 outage resolved; no major model changes.
  • Arena leaderboard: Claude Opus 4.6 Thinking remains #1 at ~1506 Elo. Gemini 2.5 Pro #2 (~1450), GPT-5.2-high #3 (~1400). GPT-5.3 Arena entries (vortex/zephyr) are the challenger.
  • DeepSeek R1: Noted as strong performer in coding leaderboard alongside specialized Claude variants.

2026-02-25: Model Intel Update

  • Claude performance details: Sonnet 4.6 preferred over Sonnet 4.5 in 70% of blind tests; preferred over Opus 4.5 in 59% of coding tasks. Opus 4.6 Thinking ranks #1 on Text Arena (Feb 2026) for problem-solving.
  • Claude milestones: Claude Code hit $2.5B ARR; Claude Code Security launched Feb 20; Haiku 3 retiring April 19, 2026 (migrate to Haiku 4.5).
  • Gemini 3 Deep Think: Announced Feb 12 as specialized reasoning mode for science/research/engineering; early access via Gemini API.
  • OpenAI Lockdown Mode: New enterprise security feature for ChatGPT with deterministic protections against prompt injection.
  • Anthropic RSP v3: Third version of Responsible Scaling Policy released with separated industry recommendations vs unilateral commitments.
  • Benchmark updates: Opus 4.6 at 65.4% Terminal-Bench 2.0, 1606 Elo GDPval-AA; Sonnet 4.6 jumped to 72.5% OSWorld; Gemini 3.1 Pro at 77.1% ARC-AGI-2 (2.5x improvement), 68.5% Terminal-Bench 2.0.
  • Market positioning: GPT-5.2 leading general chat, Sonnet 4.6 leading writing, Opus 4.6 leading coding/agentic, Gemini 3.1 Pro leading reasoning benchmarks, Grok 4.1 notable for creativity.

2026-02-23: Model Intel Update

  • Gemini 3.1 Pro’s Feb 19 launch (77.1% ARC-AGI-2, 94.3% GPQA, 80.6% SWE-Bench, 85.9% BrowseComp, 69.2% MCP) is now documented in the guide, routing rules, and models.json, and preview access via Antigravity/Vertex/Copilot is highlighted for reasoning work.
  • Added openai-codex/gpt-5.3-codex, zai/glm-5, and google-antigravity/gemini-3-1-pro to models.json and flagged a complex-reasoning fallback for 3.1 Pro; kept Sonnet 4.6/Opus in the long-context/coding fallbacks.
  • Benchmarks were refreshed with WorldofAI’s reasoning haul plus GLM-5 (Reuters) context and GPT-5.3/Codex-Spark updates, and the Video Intel page now calls out the Feb 19 WorldofAI article plus AICodeKing’s Feb 22 KingMode video.
  • Logged OpenAI’s Feb 4 note about restoring GPT-5.2 Thinking’s Extended level and the ongoing rollout of GPT-5.3-Codex/Codex-Spark previews.

2026-02-22: Model Intel Update

  • Google released Gemini 3.1 Pro (smartest model for complex reasoning).
  • OpenAI released GPT-5.3 Codex Spark (Cerebras-served coding model, 1000 tps).
  • Anthropic’s Claude Sonnet 4.6 is now fully available on Vertex AI (Day 0 support).
  • ByteDance Seed 2.0 entered LMSYS top 10 (6th overall, 3rd vision).

2026-02-19 (PM): Evening check - no major new releases

  • Sonnet 4.6 continues as top coding model. OpenAI retired legacy models (GPT-4o, GPT-4.1) but no impact on current routing.

2026-02-19 (AM): Added Claude Haiku 4.5 and Claude Sonnet 4.6 release notes

  • Claude Haiku 4.5 (fast/efficient) and Claude Sonnet 4.6 (new default) released.

2026-02-19: Noted data residency controls (inference_geo) for Anthropic

2026-02-17: Gemini Pro (High) reported issues in Vertex Model Garden

This page was last updated on March 9, 2026.