Model Guide (living)

Defaults (today)

Coding / refactors / PRs: openai-codex/gpt-5.2
Fast drafts + summaries: google-antigravity/gemini-3-flash
Lightweight deep research: openai/o4-mini-deep-research
Vision (image understanding): zai/glm-4.6v
Complex/multi-step reasoning (ARC-style, agent coordination): google-antigravity/gemini-3-1-pro

New Releases (March 2026)

Model	Released	Key Features	Status
GPT-5.4	Mar 5	Major consolidation: Codex features merged into general-purpose model. 1M context window, native computer use (75% OSWorld-Verified vs 72.4% human), tool search (47% token reduction), 57.7% SWE-Bench Pro, 83% GDPval. $2.50/$15 per M tokens. Competitive with Claude Opus 4.6 for agentic coding.	⚡ Available in ChatGPT, API, Codex
GPT-5.4 Thinking	Mar 5	Reasoning variant. 83% GDPval (matches/outperforms human experts on real-world job tasks), 73.3% ARC-AGI-2, 47.6% FrontierMath Tier 1-3. Default effort is ’none’ - set to medium/high/xhigh for deeper thinking.	⚡ Available
GPT-5.4 Pro	Mar 5	Premium high-performance variant. $30/$180 per M tokens. 83.3% ARC-AGI-2 Semi Private at $16.41/task.	⚡ Available (Pro/Enterprise)
Gemini 3.1 Flash-Lite	Mar 3	Fastest and most cost-efficient Gemini 3 series. $0.25/$1.50 per M tokens (vs $2.50/$15 for GPT-5.4). 2.5X faster TTFA, 45% faster output vs 2.5 Flash. 1432 Arena Elo, 86.9% GPQA Diamond, 76.8% MMMU Pro. Ideal for high-volume translation, content moderation, UI generation.	🧪 Preview (AI Studio + Vertex AI)
GPT-5.3 Instant	Mar 5	Available to all users in ChatGPT and API as `gpt-5.3-chat-latest`.	⚡ Available
GPT-5.3 “Vortex” & “Zephyr”	Feb 25	Two mystery models on LMSYS Arena. Vortex (flagship), Zephyr (reasoning). Following GPT-5 zenith/summit pattern. Expected release: mid-March to early April.	🧪 Arena testing (accumulating Elo)
Gemini 3.1 Pro	Feb 19	ARC-AGI-2 77.1% verified, 94.3% GPQA Diamond, 80.6% SWE-Bench Verified, 85.9% BrowseComp, 69.2% MCP Atlas; tighter structured output and agent coordination.	⚡ Rolling preview via google-antigravity / Vertex / Copilot
Claude Sonnet 4.6	Feb 17	1M context, 80.8% SWE-bench Verified, near-Opus reasoning at ~$3/$15M tokens. Preferred over Sonnet 4.5 in 70% blind tests.	⚡ Available via google-antigravity
Qwen 3.5	Feb 16	397B MoE (17B active), native multimodal stack; claimed to outscore Opus 4.5/Gemini 3 on open benchmarks.	🔜 Open weights (hardware heavy)
MiniMax M2.5 & M2.1	Feb 12	80.2% SWE-Bench Verified, extreme cost efficiency ($0.3-$2.4/M tokens, 10-20x cheaper than Opus/Gemini/GPT-5). 100 TPS native (2x faster than other frontier models). Trained on 200K+ real-world environments. Compelling for high-volume agentic workloads.	🔜 Not yet in gateway (Chinese model, MiniMax API)
Gemini 3 Deep Think	Feb 12	Specialized reasoning mode for science, research, engineering. Purpose-built for high-stakes technical problems.	⚡ Early access via Gemini API
GLM-5	Feb 11	744B open-weights (40B active); DSA architecture, score of 50 on Intelligence Index v4.0 (first open-weights to hit 50). #1 open model in Text Arena and Code Arena, on par with Opus 4.5/Gemini 3 Pro.	🔜 Not yet in gateway
GPT-5.3-Codex	Feb 5	Codex + GPT-5 stack yields a general-purpose coding agent; ~25% faster, shipping to GitHub Copilot preview.	⚡ Preview via Copilot & Codex endpoints
Claude Opus 4.6	Feb 5	#1 on Text Arena (Feb 2026), 65.4% Terminal-Bench 2.0, 1606 Elo GDPval-AA. Thinking mode for complex multi-step logic.	⚡ Available
GPT-5.3-Codex-Spark	Feb 12	Cerebras-tuned variant for high-throughput inference inside ChatGPT Pro Spark.	⚡ ChatGPT Pro preview
GPT-5.2 Thinking (Extended)	Feb 4	Extended thinking level restored to its January setting after the inadvertent reduction in Jan.	⚡ Applies to GPT-5.2 Thinking sessions

Recommendation: Keep claude-opus-4-6 in fallbacks for high-stakes long-context work and add google-antigravity/gemini-3-1-pro to the reasoning fallback chain once preview tokens stay stable. Claude Opus 4.6 Thinking now #1 on Text Arena for problem-solving tasks.

New This Week (Mar 8):

Claude Cowork launched: March 7 release brings Claude Code-style agent capabilities to desktop app for general office work. Paid plans (research preview). Organizes files, analyzes spreadsheets, generates reports, compiles research, schedules automated tasks. Extends Claude beyond chat to autonomous digital coworker role. Tom’s Guide hands-on confirms functionality. Positions Claude to compete with Gemini Workspace.
GPT-5.4 analysis complete: Comprehensive independent testing confirms GPT-5.4 as first OpenAI model competitive with Claude for agentic coding. Native computer use (75% OSWorld vs 72.4% human), 1M context, tool search. Multiple developers (Every team, Cursor, Windsurf) report switching or going 50/50 Claude/GPT-5.4. Caveats: prompt leaking, occasional over-eagerness, design aesthetics behind Gemini 3.1/Opus 4.6.
Gemini 3.1 Flash-Lite deep dive: $0.25/$1.50 per M tokens (10x cheaper than GPT-5.4). 2.5X faster TTFA, 45% output speed increase vs 2.5 Flash. Beats prior-gen Gemini models on GPQA/MMMU. Thinking levels standard in AI Studio/Vertex. Ideal for high-volume translation, content moderation, UI/simulation generation.
Claude Opus 4.6 real-world validation: Found 22 Firefox vulnerabilities (14 high severity) helping Mozilla patch Firefox 148. Concrete security research beyond benchmarks.
Claude Code 2.1.70: Comprehensive stability release (Mar 6) fixing API 400s with third-party gateways, tool search issues. Added VS Code session visuals + native MCP management.
Frontier landscape: GPT-5.4 + Claude Opus 4.6 now tier-1 for agentic coding. Gemini 3.1 Flash-Lite emerging as ultra-cheap fast model. Sonnet 4.6 still preferred for writing/structure.

Previous Week Highlights (Mar 3):

Stable model landscape: No major new releases. All frontier models (Sonnet 4.6, Opus 4.6, Gemini 3.1 Pro) holding position.
Claude Sonnet 5 speculation: Logs show “claude-sonnet-5-20260219” appearing in Vertex AI (Medium source, Feb 26). Likely release window: February or March 2026. Not yet publicly available; unconfirmed but follows Anthropic’s naming pattern.
GLM-5 expanding availability: Now accessible via OpenRouter (free trial with limits) and Z.AI API (pay-as-you-go). Gradual rollout continues since Feb 11 launch.
LMArena video support: Added video evaluation capability in January 2026.
GPT-5.3 Arena watch: “Vortex” and “Zephyr” continue accumulating Elo votes. Expected release mid-March to early April remains on track.

Previous Week Highlights (Mar 2):

Anthropic Claude Cowork: Announced for 2026 release - expanding Claude beyond programming to general knowledge work. “In 2025 Claude transformed how developers work, and in 2026 it will do the same for knowledge work.”
Google Project Genie: World model released January 2026 - generates interactive 2D game environments from single images + text descriptions. Training ground for AI agents to learn real-world tasks in simulation.
Moonshot Kimi 2.5: Released late January - Chinese startup’s model focused on coding and agentic task completion (still privately held, not yet in gateway).
Market momentum: Claude app hit #1 on App Store (Feb 28-Mar 1) driven by user migration from ChatGPT amid OpenAI Pentagon controversy.

Previous Week Highlights:

GPT-5.3 Arena Testing: Two mystery models (“vortex” and “zephyr”) appeared on LMSYS Arena Feb 25. Following GPT-5 launch pattern (zenith/summit), these likely represent GPT-5.3 flagship and reasoning variants. Expected release: mid-March to early April 2026.
Anthropic Remote Control: Mobile version of Claude Code launched Feb 24-25.
Claude Haiku 3 deprecation notice: Haiku 3 retiring April 19, 2026. Migrate to Haiku 4.5.
Claude Code milestones: Claude Code Security launched Feb 20; Claude Code hit $2.5B ARR.
OpenAI security: Lockdown Mode announced for ChatGPT Enterprise (deterministic protections against prompt injection).

Models Available On This Host (from gateway config)

Primary:

openai-codex/gpt-5.2 (current gateway primary)

Configured fallbacks (selected highlights):

OpenAI Codex provider: openai-codex/gpt-5.2-codex, openai-codex/gpt-5.1, openai-codex/gpt-5.1-codex-mini, openai-codex/gpt-5.1-codex-max
OpenAI (non-codex ids referenced in fallbacks): openai/o4-mini-deep-research, openai/gpt-5.2, openai/gpt-5.2-pro, openai/gpt-5.2-codex, openai/gpt-5.2-chat-latest
ZAI: zai/glm-4.7 (alias: GLM), zai/glm-4.6v (vision)
Google Antigravity: google-antigravity/gemini-3-flash, google-antigravity/gemini-3-pro-high, google-antigravity/gemini-3-pro-low, google-antigravity/gemini-3-1-pro, google-antigravity/claude-sonnet-4-5, google-antigravity/claude-sonnet-4-5-thinking, google-antigravity/claude-opus-4-5-thinking, google-antigravity/gpt-oss-120b-medium
Gemini CLI: google-gemini-cli/gemini-3-pro-preview, google-gemini-cli/gemini-3-flash-preview

Aliases:

gpt → openai/gpt-5.2
GLM → zai/glm-4.7

Note: google-antigravity/claude-opus-4-5-thinking is present in the configured model list, but it is not currently in the global fallback chain. If we want Opus to be a true auto-fallback for high-stakes work, we should add it to agents.defaults.model.fallbacks (proposal only; no config changes made in this run).

Anthropic (goal)

We want latest Claude models available via Vertex AI so we can:

keep credentials + billing centralized in GCP
use enterprise-friendly auth (service accounts)
swap to new Claude versions quickly

Selection heuristics

If task needs correctness + code changes: prefer the best coding model, then run tests.
If task needs lots of web reading: prefer deep-research model.
If task is latency-sensitive: prefer fast model.

(Next: add a per-task matrix once Vertex/Claude is enabled.)

Macro context (2026-02-23 update)

Gemini 3.1 Pro now leads ARC-style reasoning (77.1% ARC-AGI-2 verified) while posting broad gains on GPQA Diamond (94.3%), SWE-Bench Verified (80.6%), BrowseComp (85.9%), and MCP Atlas (69.2%); preview access is rolling across Google Antigravity, Vertex, NotebookLM, and GitHub Copilot.
Video reviewers: WorldofAI’s Feb 19 write-up celebrated the reasoning jump and structured-output discipline, while AICodeKing’s Feb 22 KingMode video keeps 3.1 Pro stable and compares it alongside GLM-5.
GLM-5 release (Feb 11, Reuters) is being promoted as an open-source alternative that matches Claude Opus 4.5 on coding and outruns Gemini 3 Pro on select benchmarks, running on Ascend/Moore Threads inference stacks.
OpenAI updates: GPT-5.3-Codex (Feb 5) and Codex-Spark (Feb 12) continue to ship the agentic coding stack, and the Feb 4 release note restored GPT-5.2 Thinking’s Extended thinking level to its January setting.
Existing favorites: Claude Sonnet 4.6 still anchors coding fallback, and Qwen 3.5 keeps leading open-weights despite the proprietary rush.