OpenAI GPT-Realtime-2: Voice Agents Enter a New Era of Reasoning and Real-Time Translation

GPT-Realtime-2, GPT-Realtime-Translate, and GPT-Realtime-Whisper: OpenAI's Big Voice API Push

TLDR: OpenAI released three new streaming audio models — GPT-Realtime-2, GPT-Realtime-Translate, and GPT-Realtime-Whisper — available now in the Realtime API. GPT-Realtime-2 brings what OpenAI calls "GPT-5-class reasoning" to voice agents, with a 128K context window and adjustable reasoning effort levels. This is a meaningful step forward for developers building production voice agents.

Summary: Three months ago, OpenAI quietly shipped realtime-1.5 and it barely made a ripple — it was still riding 4o-class intelligence with only a modest benchmark bump. Today's realtime-2 launch is a different story. The confidence in the release comes through clearly, and the numbers back it up: a 15+ percentage point improvement on Big Bench Audio, instruction retention jumping from 36.7% to 70.8% on Scale AI's Audio MultiChallenge leaderboard. That's not incremental progress — that's a model that's now genuinely usable for complex voice applications.

The most interesting thing about this release is what OpenAI chose to focus on. It's not audio quality — it's usability and agent behavior. GPT-Realtime-2 is designed around how real conversations actually work: the model can issue short preambles like "let me check on that" before responding, it can call multiple tools in parallel while narrating what it's doing ("checking your calendar now"), and it recovers gracefully from awkward situations rather than silently failing. These feel like small details, but they're exactly the kind of friction that makes or breaks a deployed voice product.

The context window expansion from 32K to 128K tokens is significant for anyone building long-running voice agents — things like customer support bots, meeting assistants, or continuous hands-free workflows. Pair that with adjustable reasoning effort (five levels: minimal through xhigh, defaulting to low), and developers now have real controls over the latency-vs-intelligence tradeoff. At minimal reasoning, time-to-first-audio is 1.12 seconds. At high reasoning, it's 2.33 seconds. That's a real engineering dial, not just a marketing bullet point.

The companion models are worth taking seriously too. GPT-Realtime-Translate supports live streaming speech translation from 70+ input languages into 13 output languages — Greg Brockman described this as something OpenAI had been working toward since the company's early days. Vimeo already demoed live dubbing with it, generating translations fully live with no pre-loaded captions. GPT-Realtime-Whisper brings streaming transcription as speech is produced, targeting real-time captions, meeting notes, and continuous speech understanding. Justin Uberti — who's been building demos with these models — described it as "Whisper, but now with realtime streaming," which is a clean summary.

What I'd push back on here: the framing around "GPT-5-class reasoning" deserves scrutiny. The benchmark numbers are strong, but there's a gap between benchmark performance and the messy, ambiguous conversations that real users have. The article notes that ChatGPT Voice Mode itself has not yet been upgraded to use these models — Simon Willison flagged this clearly. So the consumer-facing product is still running on older capabilities while developers get the shiny new API. That asymmetry is worth watching. The real test of how good these models are will come when hundreds of millions of ChatGPT users start using them, not just developers running controlled demos. The question of whether voice as an interface is actually sticky also remains genuinely open — Will Depue's comparison to VR (frequently exciting, historically not sticky) is one I find myself agreeing with more than I'd like to admit.

Key takeaways:

GPT-Realtime-2 is now the top performer on Scale AI's Audio MultiChallenge S2S leaderboard, with instruction retention doubling versus the previous model
Context window expanded from 32K to 128K tokens; max output is 32K tokens; pricing unchanged at $1.15/hour input and $4.61/hour output
Five reasoning effort levels (minimal through xhigh) let developers tune latency vs. intelligence tradeoffs directly
GPT-Realtime-Translate supports live streaming translation from 70+ languages into 13 output languages
GPT-Realtime-Whisper enables streaming transcription as speech is produced, in real time
ChatGPT Voice Mode has not yet received these upgrades — the launch is API-only for now
Enterprise early results are strong: Glean saw 42.9% helpfulness improvement, Genspark reported +26% effective conversation rate

Why do I care: This is the release that makes real-time voice agents a serious engineering discipline rather than a demo novelty. The combination of 128K context, parallel tool calls with audible transparency, interruption recovery, and reasoning effort controls means voice applications now require the same architectural thinking as any stateful real-time system — latency budgets, interruption semantics, conversational memory, failure recovery. If you're building anything in the voice or accessibility space, this is the API to be testing against. The live translation capability specifically is worth prototyping with immediately — the use cases for live multilingual meetings, customer support, and content accessibility are real and the previous barriers were largely technical. My concern is that the engineering bar just got raised significantly: these models reward thoughtful harness design, and teams that treat voice as "just prompt-response with audio" are going to hit walls fast.

[AINews] GPT-Realtime-2, -Translate, and -Whisper: new SOTA realtime voice APIs