Claude Opus 4.6: 1M Context, Agent Teams, Adaptive Thinking, and a Showdown with GPT-5.3

What Changed

Anthropic released Claude Opus 4.6 on February 5, 2026. The headline features: a 1 million token context window (beta), 128K output tokens, adaptive thinking controls, a Compaction API for longer conversations, and "agent teams" in Claude Code. Twenty-seven minutes later, OpenAI released GPT-5.3-Codex -- turning launch day into a direct comparison.

Model ID: claude-opus-4-6. Available on claude.ai, the Anthropic API, Amazon Bedrock, Google Cloud Vertex AI, Microsoft Azure Foundry, and GitHub Copilot.

The 1M Context Window

Opus-class models now support a 1 million token context window in beta -- roughly 1,500 pages per prompt. The benchmark backing this claim: 76% on the 8-needle MRCR v2 (1M variant), compared to 18.5% for Sonnet 4.5. A 4x gap over the next-best model indicates reliable retrieval across massive contexts, not just a larger window that degrades at scale.

Output tokens doubled from 64K to 128K. For agentic coding tasks that generate long outputs, this removes a hard ceiling.

The Compaction API (beta) auto-summarizes older conversation segments as context approaches the window limit. Multi-step agentic tasks that previously stalled mid-execution now continue by compressing earlier context.

Coding and Agentic Capabilities

Anthropic initially claimed the highest score on Terminal-Bench 2.0. Within 27 minutes, OpenAI released GPT-5.3-Codex and took the lead. Launch-day press reported 65.4% for Opus 4.6 and 77.3% for GPT-5.3-Codex. The live leaderboard now shows a narrower gap: 69.9% vs. 75.1% -- scores shift as agent configurations improve.

Where Opus 4.6 holds the lead: reasoning-heavy benchmarks like GPQA Diamond, MMLU Pro, and TAU-bench. GPT-5.3-Codex dominates terminal and computer-use workloads.

Agent teams in Claude Code spin up parallel agents on independent subtasks. Early testers report the model identifies blockers between tasks reliably. Whether this holds on codebases with circular dependencies and undocumented legacy code remains untested.

Adaptive thinking adds four effort levels -- low, medium, high, max. Developers match reasoning depth to task complexity: a simple classification call costs less compute than a multi-file refactor. The max level is new to Opus 4.6.

Benchmarks

  • GDPval-AA (economically valuable knowledge work): 144 Elo points ahead of GPT-5.2, 190 ahead of Opus 4.5. Translates to winning roughly 70% of head-to-head comparisons
  • Humanity's Last Exam: 53.1% with tools (vs. GPT-5.2 Pro's 50.0%, Gemini 3 Pro's 45.8%). Without tools: 40.0% vs. Opus 4.5's 30.8%
  • BrowseComp (agentic search): 84.0% vs. Opus 4.5's 67.8%
  • Life sciences: Nearly 2x improvement over Opus 4.5 in computational biology, structural biology, organic chemistry, and phylogenetics
  • Terminal-Bench 2.0: 69.9% per live leaderboard (GPT-5.3-Codex leads at 75.1%)

All benchmark claims originate from Anthropic's announcement. Independent verification pending on most.

Pricing and Availability

Tier Input (per M tokens) Output (per M tokens)
Standard (up to 200K) $5.00 $25.00
Long-context (200K+) $10.00 $37.50
US-only inference 1.1x multiplier 1.1x multiplier

For comparison:

Model Input (per M tokens) Output (per M tokens)
GPT-5.2 $1.75 $14.00
Gemini 2.5 Pro $1.25 $10.00
MiMo V2 Flash $0.10 $0.30

GPT-5.3-Codex pricing has not been disclosed at time of writing. Gemini offers 1M context at standard pricing; Claude charges 2x for prompts exceeding 200K tokens.

Market Impact and Early Reactions

Software stocks dropped on the announcement. Thomson Reuters fell 15.83%, LegalZoom dropped nearly 20% -- extending a pattern where frontier model launches rattle SaaS valuations. Separately, Anthropic's red team used Opus 4.6 to discover 500+ previously unknown zero-day vulnerabilities across open-source libraries including GhostScript, OpenSC, and CGIF, each validated by security researchers.

Early developer feedback splits along use-case lines. Coding and agentic tasks show clear improvements. Writing quality drew complaints, with some users on Reddit and Hacker News calling the model "nerfed" for prose. One theory: reinforcement learning optimizations for reasoning came at the cost of prose quality. Developers upgrading for code review and debugging benefit. Teams relying on Claude for long-form writing should consider keeping Opus 4.5 in their workflow.

Breaking Changes

Two changes that affect existing integrations:

Prefill removal: Opus 4.6 does not support prefilling assistant messages. Requests with prefilled content return a 400 error. Alternatives include structured outputs and system prompt instructions.

Deprecated features: thinking: {type: "enabled"} and budget_tokens still work but are deprecated. The replacement: thinking: {type: "adaptive"} with the effort parameter.


Sources

Enjoyed this article?

Subscribe to get notified when we publish new content.