The Arrival of GPT-5.5: OpenAI's New Deep-Thinking Powerhouse

TLDR: OpenAI shipped GPT-5.5 and GPT-5.5 Pro, both available in the Kilo Gateway. The headline numbers are real, especially for terminal workflows and long-horizon coding, but the price went up and in-memory prompt caching is gone for this tier.

Summary:

The Kilo team got an alpha pre-test of GPT-5.5 and they are pretty bullish. The new model lands at the top of the Artificial Analysis Intelligence Index, edging out the previous three-way tie at the frontier by three points. On Terminal-Bench 2.0 it scores 82.7 percent, which is a chunky jump from GPT-5.4 at 75.1 and Claude Opus 4.7 at 69.4. Expert-SWE, the internal long-horizon coding benchmark that simulates roughly 20-hour tasks, climbs from 68.5 to 73.1. GDPval, BrowseComp, OSWorld-Verified all move in the right direction too.

What I find more interesting than the leaderboard is the qualitative claim. The Kilo folks say GPT-5.5 stops losing the plot during massive refactors and deep-reasoning passes over large codebases. Anyone who has watched a previous-generation model wander off mid-migration knows exactly what they mean. They are using it as a daily driver for KiloClaw, their always-on agent product, and the word they keep reaching for is reliability. Fewer retries. Less babysitting. That is the actual unlock for agentic workflows.

Pricing is the part that needs a sober reading. GPT-5.5 lands at five dollars per million input tokens and thirty per million output, with a fifty-cent cache rate. That is a step up from GPT-5.4. The argument from OpenAI and from Kilo is that the 5.5 series is more token-efficient on hard tasks, so the per-completed-task cost can actually drop because you do not need the retry loop. That math is true when it is true and false when it is false, and you only find out by running your own workloads. Add to that the note that in-memory prompt caching is not supported here, only extended prompt caching on GPU-local storage, and there is real engineering homework to do before you swap defaults in production.

The Kilo prediction is the line worth circling. They expect GPT-5.5 to fight the latest Opus for coding work but to settle in as a top-of-stack orchestrator in agent setups, with sub-agents falling to smaller or open-source models for cost reasons. That is the shape the industry has been telegraphing for a while now. Frontier models drive the plan, cheaper workhorses do the volume. If you are designing agent systems today, this is the architecture you should already be sketching on a napkin.

Key takeaways:

GPT-5.5 leads the Artificial Analysis Intelligence Index by three points and posts strong gains on Terminal-Bench, Expert-SWE, GDPval, OSWorld, and BrowseComp.
Big reliability improvements on long-horizon coding and refactors; described as a real daily driver for always-on agent work.
Pricing rose to five dollars input and thirty dollars output per million tokens, with token efficiency claimed to offset cost on hard tasks.
No in-memory prompt caching at this tier; only extended prompt caching, cached on GPU-local storage during inference.
Top-agent driver pattern is emerging: GPT-5.5 in the orchestrator seat, smaller or open-source models running sub-agents for cost.
Some weakness reported on UI design tasks, mitigated with more specific prompts.

Why do I care:

I have been around long enough to be allergic to benchmark theater, and I will tell you exactly what I look at first when a new frontier model drops. Not the headline score. The long-horizon coding number and the terminal-tooling number. Those are the ones that map to the work I actually pay for. A model that can hold its line through a twenty-hour refactor without inventing a phantom function or rewriting an interface I never asked it to touch is worth real money. Everything else is a demo.

So the GPT-5.5 numbers are the kind I respect. Terminal-Bench 2.0 in the eighties means tool coordination is finally crossing into reliable territory. Expert-SWE at 73 means it is not just one-shot snippet generation. Pair that with the reliability claim from a team that has actually been running it on production-grade agent workloads, and yes, I am paying attention.

But I am not jumping. The pricing is the thing. Five and thirty per million is fine for the hard problem at the top of the stack: the architectural decisions, the gnarly migrations, the test-suite triage where one model getting it right beats five cheaper models getting it almost right. It is not fine for the inner loop of an agent that is calling tools two hundred times a session. The Kilo prediction about orchestrator-and-sub-agents is the architecture you should be planning around. Top model for the plan, cheap model for the labor. If your agent stack does not already split that way, the new pricing is going to teach you that lesson the expensive way.

The missing in-memory prompt cache is the other detail engineers should not gloss over. If you have been leaning on cached system prompts to keep latency and cost down, you have to re-architect for the extended caching model here. That is a real migration cost, and it deserves a spot in the spreadsheet alongside the per-token numbers. As with every frontier release, the right move is to run the eval on your own workload, on your own data, with your own tools wired in. Benchmarks tell you the model exists. Your eval tells you whether you should ship it.

The Arrival of GPT-5.5: OpenAI's New Deep-Thinking Powerhouse