Sonnet 4.6 Drops: Not Sonnet 5, But Maybe Something Better

No, It's Not Sonnet 5 (And That's a Good Thing): Meet Sonnet 4.6

TLDR: Anthropic shipped Sonnet 4.6 instead of the rumored Sonnet 5, and honestly, the version number is the least interesting part. This model closes the gap with Opus 4.6 on benchmarks like SWE-bench Verified while introducing granular thinking controls and a one-million-token context window in beta.

Summary:

Alright, let me just get the elephant out of the room. If you spent the last week refreshing X and Reddit waiting for "Sonnet 5" to appear, I feel you, but you should probably redirect that energy. Anthropic went sideways instead of forward with the naming, and what they delivered with Sonnet 4.6 is genuinely impressive. The SWE-bench Verified score of 79.6 percent puts it uncomfortably close to Opus 4.6's 80.9 percent. That is not a Sonnet-class gap anymore. That is a rounding error away from the flagship model.

What makes this interesting from an engineering standpoint is the thinking mode system. You get three options: thinking disabled, extended thinking, and adaptive thinking. The article from the Kilo team suggests extended thinking at medium effort is the sweet spot for heavy coding tasks, and that tracks with what we know about how reasoning models work. The model essentially pre-flights its logic before writing code, catching edge cases upstream. If you are migrating from Sonnet 4.5, though, just turn thinking off. You get the intelligence upgrade without having to rework your prompts or agentic pipelines. That is a smart backward-compatibility move from Anthropic.

Now, the one-million-token context window in beta is the headline feature that the article kind of breezes past, and I want to challenge that. A million tokens sounds incredible for enterprise codebases, but context window size and context window utilization are two very different beasts. We have seen this pattern before: models get enormous context windows, and then the retrieval quality in the middle of that window drops off a cliff. The article does not address this at all. What is the actual recall accuracy at token 500,000? What happens to latency and cost when you are actually pushing that limit? These are the questions that matter for production use, and the Kilo team — who clearly have skin in the game as an AI coding platform — skips right past them.

The article also leans heavily on the "Opus and Sonnet are the premium choice" narrative without really engaging with the competitive landscape. Google's Gemini 2.5 models and OpenAI's latest offerings are pushing hard on coding benchmarks too. The framing here is very much "Anthropic's ecosystem is the only game in town," which is a bit of a blind spot. If you are making infrastructure decisions about which model to standardize on, you owe it to yourself to run your own evals against your actual codebase, not just trust SWE-bench numbers.

What is genuinely exciting, and what I think the article undersells, is the convergence of several API features going to general availability at the same time: code execution, web fetch, tool search, and memory improvements. Individually, these are incremental. Together, they represent a much more capable agentic substrate. The model is not just smarter — the tooling around it is maturing in ways that make autonomous coding workflows actually practical rather than just demo-worthy.

Key takeaways:

Sonnet 4.6 scores 79.6 percent on SWE-bench Verified, nearly matching Opus 4.6's 80.9 percent, at a lower price point
Three thinking modes (disabled, extended, adaptive) give you fine-grained control over the intelligence-versus-speed tradeoff
One-million-token context window is available in beta, but real-world retrieval quality at that scale remains unproven
Multiple API features hit general availability simultaneously: code execution, web fetch, tool search, and memory
For Sonnet 4.5 migration, disabling thinking mode provides the smoothest upgrade path

Tradeoffs: The core tradeoff here is the classic one: Sonnet 4.6 is cheaper than Opus but now nearly as capable on benchmarks. The question is whether that benchmark parity holds up on your specific codebase and workflows. Extended thinking improves output quality but increases latency and token cost. The million-token context window is powerful but unproven at scale for precision retrieval. You are trading known Sonnet 4.5 behavior for potentially better but less battle-tested 4.6 behavior.

No, It's Not Sonnet 5 (And That's a Good Thing): Meet Sonnet 4.6