Anthropic Rents Colossus, ARR Grows 8000% Annualized, and the Agent Platform Wars Begin
Published on 07.05.2026
Anthropic's Second Annual Developer Day: Compute, Not Models
TLDR: Anthropic's "Code with Claude" event delivered no new model release but announced a massive compute deal with SpaceX/xAI to access Colossus 1, immediately translating into doubled Claude Code rate limits. The real story was that 80x demand growth had been quietly throttling users for months.
The thing that strikes me about this developer day is that the biggest news was infrastructure, not intelligence. No new Claude model dropped. What dropped instead was the curtain on a problem Anthropic had apparently been wrestling with: they were simply running out of compute. Dario and Daniela Amodei confirmed in an interview that usage grew roughly 80x unexpectedly, and the SpaceX deal is the first serious attempt to address it. Anthropic CTO Tom Brown said Claude inference would start ramping on Colossus "in the next few days," which is a remarkable turnaround.
The scale figures circulating are worth taking seriously even if Anthropic hasn't canonized every number. Third-party estimates put the arrangement at over 300 megawatts and 220,000 NVIDIA GPUs, with Colossus 1 containing approximately 150,000 H100s, 50,000 H200s, and 30,000 GB200s. One estimate put the annual value at around $5B. Elon Musk confirmed xAI was comfortable leasing Colossus 1 because xAI had already moved training to Colossus 2, which is reportedly already at roughly 500,000 Blackwells. So what looks like Anthropic's big win is also xAI monetizing last year's hardware while upgrading to next year's.
The practical impact landed immediately. Claude Code's 5-hour rate limits doubled for Pro, Max, Team, and seat-based Enterprise users. Peak-hours throttling was removed for Pro and Max. Opus API rate limits went up substantially. Anthropic's product lead Amol Avasare explained that weekly limits stayed put because only a small fraction of users hit those, while a much larger cohort hits the 5-hour window. More changes may follow as compute ramps. That's a rational prioritization, and I appreciate that they said it plainly.
What is less comfortable to sit with is the competitive angle. Some observers argued Anthropic blundered by waiting too long to secure this capacity, possibly surrendering meaningful ARR to OpenAI during the constrained months. The developer community was already hedging toward Codex and Cursor whenever Claude went down. The flip side is that 8000% annualized ARR growth is a difficult number to argue with, even if the starting base was small.
Anthropic x SpaceX compute announcement
Claude Managed Agents: Dreaming, Outcomes, and Whether Any of It Is Defensible
TLDR: Anthropic's "Code with Claude" event introduced managed agent features including "Dreaming" (memory/cross-session context) and "Outcomes" (rubric-based grading), prompting real debate about whether these are genuine platform primitives or just first-party packaging of patterns any open framework can replicate.
The agent product story at this event was interesting because it exposed a genuine disagreement among technical practitioners. Anthropic is positioning Claude toward structured agent systems: memory that persists across sessions, separate grader agents to evaluate quality and prevent reward hacking, and orchestration infrastructure. The framing from the event was that "routines are higher-order prompts" and the remaining gap in deployment is operationalization, not raw capability. I find that framing honest. Getting a model to answer a question well is a solved problem. Getting it to do reliable, verifiable work over a long horizon inside a production system is not.
The skeptics have a point though. "Dreaming" is essentially memory and context management. "Outcomes" is a rubric system. Both of these patterns have been implemented in open frameworks for a while. The question isn't whether Anthropic can build these things; obviously they can. The question is whether a first-party platform version has advantages that open alternatives don't, specifically in terms of model-native integration, reliability guarantees, and safety properties. If Anthropic's managed agent features are tightly coupled to how Claude reasons internally, that might be defensible. If they're just system prompts and tool scaffolding with a nice UI, the moat is thin.
There's a broader signal here that I think matters a lot for architects: harness engineering is now a first-class variable in agent performance. Research showed that the same base model with different prompting, tooling, and middleware can score 10-20 points differently on agent benchmarks like tau2-bench. Cursor added context usage breakdowns to help debug context issues. Cognition shipped AI-generated code review tools in Windsurf 2.0. The battleground shifted from "which model is smartest" to "which system has the best memory, decomposition, grading, and verification pipeline." Anthropic is placing a bet that developers want a first-party answer to that question from the model provider itself.
Code with Claude event coverage
Anthropic's Safety Brand: Bifurcated, Under Pressure, and Now Endorsed by Elon
TLDR: Anthropic's governance and safety positioning attracted renewed criticism alongside a surprising Elon Musk endorsement after the SpaceX deal, with critics questioning whether it's coherent to build powerful closed systems while warning about AI risk, and insiders pushing back that the internal view is "no one can be trusted with AGI" rather than "only us."
The governance debate around Anthropic is one of the more interesting things to follow because it operates on multiple levels simultaneously. On one side, critics claim Anthropic employees privately believe only Anthropic should be trusted to build AI, which is a difficult posture to sustain publicly. Aidan Clark articulated this criticism directly from his own conversations with Anthropic colleagues. On the other side, Anthropic-adjacent voices pushed back saying the more common internal belief is closer to "no one can be trusted with AGI," which is a different and more defensible epistemic position, even if it leads to the same behavior: keep building.
What made this particularly strange is that Elon Musk, after meeting Anthropic leadership, offered what observers described as a surprising endorsement. Musk's comfort leasing Colossus 1 to Anthropic while simultaneously suing OpenAI is hard to read as anything other than strategic. It does suggest that whatever ideological differences exist between xAI and Anthropic, they're fungible enough when compute economics are involved. The criticism that Anthropic warns about AI risk while building "Mythos," a secretive internal model, isn't new but feels sharper after a week where the company's commercial aggressiveness was on full display.
I'll say plainly that I think the safety-focused branding and the competitive aggression are in real tension. That tension isn't disqualifying, and being safety-conscious while still building at the frontier is a coherent position. But when you announce a $5B/year compute expansion, double rate limits, and celebrate 8000% ARR growth in the same week, the "we move deliberately because we worry about the risks" framing needs more than a slide deck to hold up. Dario's invocation of Amdahl's Law as a framework for thinking about bottlenecks in software engineering was genuinely interesting, and his observation about tiny teams with enormous leverage feels accurate. But the governance questions won't be answered by product announcements.
Infrastructure Week: OpenAI's MRC Protocol, Perplexity's ROSE Engine, and vLLM's 3.8x Throughput Jump
TLDR: While Anthropic dominated headlines, OpenAI released an open networking protocol for AI training clusters, Perplexity revealed an in-house inference engine, and vLLM combined with Mooncake to demonstrate 3.8x throughput improvements for agentic workloads.
This was genuinely infrastructure week across the entire industry, not just Anthropic. OpenAI released MRC, a Multipath Reliable Connection protocol for large AI training clusters, already deployed on their biggest supercomputers. The technical focus was on multipath routing and microsecond failover, treating networking as a primary bottleneck rather than an afterthought. I find it interesting that OpenAI is open-sourcing networking infrastructure; it suggests they see commoditizing this layer as strategically neutral while keeping the model training itself proprietary.
Perplexity revealed they built ROSE, an in-house inference engine covering everything from embeddings to trillion-parameter LLMs, using CuTeDSL to accelerate kernel development on Hopper and Blackwell. The fact that Perplexity built this themselves rather than relying on vLLM or TensorRT-LLM says something about the performance requirements they're hitting at scale. There's a recurring pattern here: once you get large enough, the general-purpose inference stacks become limiting, and you start investing in custom systems.
The vLLM plus Mooncake result for agentic workloads is worth a closer look. They reported 3.8x throughput, 46x lower P50 time-to-first-token, 8.6x lower end-to-end latency, and cache hit improvement from 1.7% to 92.2%, scaling to 60 GB200 GPUs. The mechanism is reusable prefixes, which makes sense for agent workloads where system prompts and context preambles are repeated across requests. Unsloth combined with NVIDIA published three training optimizations claiming roughly 25% faster training for home-GPU setups. NVIDIA work on lossless speculative decoding inside RL showed up to 2.5x faster end-to-end RL at 235B scale without changing policy distribution.
Infrastructure coverage in AINews
Model Releases: Zyphra ZAYA1-8B, Google Gemma 4, and DeepSeek's $45B Valuation Talk
TLDR: Zyphra released an open-weight reasoning MoE with under 1B active parameters claiming competitive math performance; Google's Gemma 4 moved into the top 20 open models on Code Arena; DeepSeek is reportedly targeting a $45B valuation from a Chinese state-backed semiconductor fund.
Zyphra's ZAYA1-8B deserves more attention than it's getting. It's an 8B parameter model with under 1B active parameters, released under Apache 2.0, and claims strong math and reasoning efficiency with test-time compute. The architecture and post-training stack drew praise from several technical observers, and the AMD partnership is notable given that most of the open model ecosystem runs on NVIDIA. A competitive reasoning model that runs efficiently on AMD hardware opens up meaningful deployment options for teams that aren't deep in the NVIDIA supply chain.
Google's Gemma 4 moved the open model Pareto frontier in Code Arena, with Gemma-4-31B landing at number 13 and Gemma-4-26B-A4B at number 17 among open models. The DFlash draft model for Gemma 4 was described by Google researchers as one of the strongest draft models they've trained, particularly in coding and math, which is relevant for speculative decoding applications. A consumer-hardware result worth noting: Qwopus3.6-35B-A3B-v1 claimed 162 tokens per second on a single RTX 5090 for one-shot frontend generation. That's a usable speed on hardware that isn't a datacenter.
The DeepSeek story is harder to read. Fundraising talks are reportedly targeting a $45B valuation led by a major Chinese state-backed semiconductor fund, which would place it among the most valuable private AI companies globally. At the same time, evaluators are noting weak WeirdML performance for V4-Pro versus competitors including GLM and Kimi, and questions about whether DeepSeek V4-Pro is actually better than its predecessors on some benchmarks. The valuation and the benchmark story don't obviously align, but fundraising and model performance have never been strongly correlated in AI.
Benchmarks, Evals, and the Harness-First Reality
TLDR: New benchmarks including ProgramBench, Terminal-Bench 2.1, OBLIQ-Bench, and Harvey's LAB appeared alongside a recurring finding that harness engineering is worth 10-20 benchmark points on agent tasks independent of model choice.
The benchmark ecosystem is unusually active right now, and not just because everyone is racing to claim top positions. ProgramBench extends beyond repair-style SWE tasks to ask whether models can rebuild programs from scratch. Terminal-Bench 2.1 patched 28 of 89 tasks from the 2.0 release and found that absolute scores moved by up to 12 points even though relative rankings held. That second finding is important: benchmark maintenance matters, and teams relying on absolute score comparisons across versions are probably drawing false conclusions.
OBLIQ-Bench addresses what I think is an underappreciated gap: hard first-stage retrieval, specifically where current retrievers fail to surface subtly relevant documents from large corpora. RAG systems in production often fail not because the generative model is weak but because retrieval didn't surface the right documents. A benchmark focused on this problem is more useful to production engineers than another reasoning benchmark. Harvey launched LAB as an open-source long-horizon legal agent benchmark covering 1,200 tasks across 24 practice areas, which is interesting because it's an actual domain expert organization defining the evaluation rather than ML researchers approximating what legal work looks like.
The strongest signal across all the benchmark coverage is that harness engineering is a first-class variable. The same base model with different prompting, tooling, and middleware showed 10-20 point jumps on tau2-bench. LangChain published harness profiles for OpenAI, Anthropic, and Google models. One practitioner argued that a tailored harness can outperform default Codex or Claude Code on many tasks, and that usable context windows in practical agent designs are still effectively 50-100k tokens regardless of what the model nominally supports. Another observation: "If you cannot get your work done in the Claude CLI, Claude will not be able to work for you" is a blunt but fair way to frame the alignment between human workflow and agent capability.