[News Brief] Anthropic Releases Claude Opus 4.5
New flagship model claims coding benchmark lead amid intensifying AI competition
SUMMARY: Anthropic has released Claude Opus 4.5, its latest flagship model, achieving an 80.9% score on SWE-bench Verified coding benchmark. The release comes with a significant price reduction (now $5/$25 per million tokens, down from $15/$75) and follows closely behind launches from OpenAI (GPT-5.1) and Google (Gemini 3 Pro).
Overview
Anthropic announced Claude Opus 4.5 on November 24, 2025, positioning it as “the best model in the world for coding, agents, and computer use.” The release marks the completion of Anthropic’s 4.5 model family, following Claude Sonnet 4.5 (September 2025) and Claude Haiku 4.5 (October 2025). The model is available immediately through Claude Chat, Claude Code, and via API, as well as on AWS Bedrock, Google Cloud Vertex AI, and Microsoft Azure Foundry.
The release enters a crowded marketplace, arriving just six days after Google’s Gemini 3 Pro and twelve days after OpenAI’s GPT-5.1. All three companies are competing for dominance in software engineering and agentic AI applications.
Benchmark Performance
Anthropic reports state-of-the-art results on several industry benchmarks, though the margins separating frontier models continue to narrow. On SWE-bench Verified, a widely-used software engineering benchmark, Opus 4.5 achieved 80.9%, compared to GPT-5.1-Codex-Max at 77.9%, Claude Sonnet 4.5 at 77.2%, and Gemini 3 Pro at 76.2%.
Context on benchmark margins: As Simon Willison observed, “Benchmarks like SWE-bench Verified show models beating each other by single digit percentage point margins, but what does that actually equate to in real-world problems?” After his Opus 4.5 preview expired, he switched to Sonnet 4.5 and “kept on working at the same pace.” These concerns align with my recently published broader critiques of LLM evaluation methodology, which identify systemic issues including benchmark saturation (near-perfect scores losing discriminative power) and reward hacking (models exploiting evaluation loopholes rather than demonstrating genuine capability).
Visual presentations of benchmark results can further exaggerate apparent differences through tricks like truncated y-axes. Here is a chart crime from the Anthropic announcement:
Here is the same chart with a continuous y-axis, showing the true proportion of the differences:
Pricing Changes
Anthropic has significantly reduced API pricing for Opus-class models. Claude Opus 4.5 is priced at $5 per million input tokens and $25 per million output tokens, representing a 67% reduction from the previous Opus 4.1 pricing of $15/$75. This positions Opus 4.5 closer to competitor pricing, though it remains more expensive than GPT-5.1 ($1.25/$10) and Gemini 3 Pro ($2/$12 for standard contexts).
Technical Features
Effort Parameter
Opus 4.5 introduces an “effort” parameter (low, medium, high) allowing developers to control computational effort. According to Anthropic, at medium effort, the model matches Sonnet 4.5’s SWE-bench score while using 76% fewer output tokens. This follows a similar approach to OpenAI’s recent Codex-Max release.
Extended Thinking
The model supports extended thinking mode with configurable token budgets. The system card notes that thinking blocks from previous assistant turns are now preserved in model context by default, a change from earlier Claude models that discarded this information.
Context and Output Limits
Opus 4.5 maintains a 200,000 token context window (same as Sonnet 4.5) and a 64,000 token output limit. The knowledge cutoff date is listed as May 2025. A new “endless chat” feature automatically compresses context when limits are reached, allowing conversations to continue without interruption.
Safety and Alignment Claims
Anthropic describes Opus 4.5 as “the most robustly aligned model we have released to date and, we suspect, the best-aligned frontier model by any developer.” The company reports reduced rates of “concerning behavior” in their automated evaluations and claims improved resistance to prompt injection attacks compared to previous models and competitors.
Third-party verification: The prompt injection resistance claims were developed in partnership with Gray Swan, an external security research firm. However, as noted in Hacker News discussion of the release, independent verification of alignment claims would strengthen confidence in these assertions. The system card acknowledges that “prompt injection still works 1/20 times” in single-attempt scenarios, rising to approximately 1/3 success rate with ten different attack attempts.
Notable Findings from Testing
Policy Loophole Discovery
The system card documents an interesting case where Opus 4.5 technically “failed” a benchmark test. In the τ2-bench evaluation simulating airline customer service, the model discovered creative policy workarounds to help customers: exploiting the difference between “modification” and “cancellation + rebooking” to assist a distressed customer. While this behavior caused the model to fail the test’s expected response, Anthropic frames it as demonstrating sophisticated reasoning: “This kind of creative problem solving is exactly what we’ve heard about from our testers and customers.”
Internal Engineering Test
Anthropic reports that Opus 4.5 scored higher than any human candidate on the company’s internal take-home exam for performance engineering candidates (within a 2-hour time limit). The company notes this test “doesn’t test for other crucial skills candidates may possess, like collaboration, communication, or the instincts that develop over years.”
Methodological Considerations
Benchmark contamination: The system card acknowledges ongoing challenges with decontamination. Despite multiple filtering techniques, some evaluation content persisted in training data. The document notes instances where the model produced “unfaithful” reasoning traces (incorrect intermediate steps leading to correct final answers) suggesting possible memorization of benchmark answers.
Self-reported results: As with most model releases, the benchmark results are primarily self-reported by Anthropic. Independent evaluation and third-party verification would provide additional confidence in the claimed performance levels.
Narrowing margins: The performance gap between frontier models continues to compress. The difference between Opus 4.5’s SWE-bench score (80.9%) and GPT-5.1-Codex-Max (77.9%) represents approximately 3 percentage points—meaningful but not transformative for many practical applications.
Accompanying Product Updates
Alongside the model release, Anthropic announced several product updates: Claude for Chrome (browser automation) is now available to all Max users; Claude for Excel has expanded access to Max, Team, and Enterprise users; Claude Code is now available in the desktop app with upgraded Plan Mode; and context compaction enables “endless” conversations for paid users.
Market Context
The November 2025 release cycle has been remarkably compressed, with OpenAI, Google, and Anthropic all launching flagship models within a two-week period. Salesforce CEO Marc Benioff publicly stated he was switching from ChatGPT to Gemini 3, while OpenAI CEO Sam Altman reportedly told colleagues that Google’s update would create “temporary economic headwinds.” Anthropic, meanwhile, has secured cloud partnerships with Microsoft, NVIDIA, Amazon, and Google, though the company reportedly does not expect to break even until 2028.
For developers and enterprise users, the practical question remains whether marginal benchmark improvements translate to meaningful productivity gains in real-world applications. As noted, “single digit percent improvement on a benchmark” may matter less than concrete examples of previously impossible tasks now becoming achievable.







Great review. Just threw a meaty task at it in Cursor and already impressed with how much more robust its process is than Sonnet 4.5.
The 67% price drop on Opus is significant but here's what changed more for me: Haiku 4.5 got so good that I stopped needing Opus for most tasks entirely.
My agent runs Haiku by default, Sonnet for content, Opus only for complex reasoning. The capability gap between tiers shrunk while the price gap stayed huge. Smart routing beats expensive defaults every time.