How OpenAI's Codex Is Built: Inside the Multi-Agent Coding Assistant Used by a Million Developers

Published on 17.02.2026

AI & AGENTS

How Codex Is Built

TLDR: OpenAI's Codex has exploded to over a million weekly users, with usage up five-fold since January. The team chose Rust over TypeScript for the core agent, open-sourced it, and claims that more than ninety percent of the app's own code was generated by Codex itself.

Summary:

Alright, let's dig into this one because it's a fascinating look behind the curtain at OpenAI. Gergely Orosz from The Pragmatic Engineer sat down with three people at OpenAI: Thibault Sottiaux (Tibo), the head of Codex; Shao-Qian Mah, a researcher who trains the models powering it; and Emma Tang, whose data infrastructure team uses Codex heavily but isn't part of the Codex team itself. The result is one of the more revealing pieces about how a major AI coding tool actually gets made.

The origin story goes back to late 2024, when OpenAI declared building an Autonomous Software Engineer a top-line goal for 2025. Greg Brockman and Sam Altman pushed the vision hard. Two teams attacked different angles: Codex Web for async, cloud-based work, and Codex CLI for iterative local development. The CLI launched in April 2025, and Codex in ChatGPT followed in May. Then in early February 2026, the Codex desktop app dropped for macOS, which Sam Altman reportedly calls "the most loved internal product we've ever had." Days later, they shipped GPT-5.3-Codex, described as the first model that helped create itself. That is a wild sentence to write, and I want to push back on it a little. "Helped create itself" is doing a lot of heavy lifting. The model generated code that was reviewed and merged by humans. That is not self-creation. It is assisted development. The framing matters.

Now, the language choice is genuinely interesting. They picked Rust over TypeScript and Go. The reasoning: performance at scale where milliseconds matter, correctness via strong typing and memory safety, and a deliberate culture signal about engineering quality. There was also a practical concern: TypeScript means npm, and npm means a sprawling dependency tree you might not fully understand. With Rust, they keep dependencies minimal and auditable. They even hired the maintainer of Ratatui, the Rust TUI library, to work full-time on the team. The long-term vision includes running Codex on embedded systems, which makes Rust's performance profile appealing. But here is what I think the article dances around: the early performance of Codex with Rust was reportedly less standout than with TypeScript. So they bet on the model catching up. That is a significant gamble that happened to pay off, but it is not presented as one. It is presented as inevitable, which it was not.

The core agent loop is a state machine that orchestrates user input, model inference, and tool calls. Prompt assembly includes system instructions, available tools, MCP servers, images, files, and AGENTS.md contents. Inference streams back reasoning steps, tool calls, or responses. If a tool call fails, the error goes back to the model, which tries to diagnose and retry. Compaction kicks in when the context window fills up, calling a special Responses API endpoint to generate a compressed representation of the conversation history. This avoids the quadratic scaling problem with self-attention. Safety-wise, Codex defaults to sandboxed execution with restricted network and filesystem access, which Tibo admits hurts general adoption but prevents unsafe defaults for less technical users. That is a genuinely responsible stance, and I appreciate it.

The team's development practices are where things get really interesting. Engineers typically run four to eight parallel agents simultaneously handling feature implementation, code review, security audits, and codebase summarization. They have built over a hundred "Agent Skills" internally, essentially task-specific extensions. The "Yeet" skill takes any code change, writes the PR title and description, and creates a draft PR in one step. They trained a bespoke model specifically for AI code review, claiming nine out of ten comments point out valid issues, on par with or slightly better than human reviewers. For non-critical code, AI review alone can greenlight a merge. For core agent code and open source components, human review is mandatory. What is missing from this discussion is the failure mode. When the AI reviewer misses something on that one out of ten, what happens? In non-critical code with no human review, bugs ship. The article does not address this at all.

There is a fascinating meta-circularity here. Codex writes its own code, tests itself using a specific skill, and even debugged its own systems during a team meeting by connecting to logs, SSHing into research dev boxes, and analyzing ML instabilities. It generated a report the team presented on screen. That is genuinely impressive and also a little alarming if you think about it deeply. The team also runs Codex overnight to generate suggested fixes, so every morning engineers wake up to a list of issues and proposed patches. New engineers onboard by pairing with existing team members, observing how they develop with Codex, and are expected to ship to production on day one. All of this is enabled by unlimited Codex usage for employees, which is worth flagging: most companies will not have this luxury, so the practices may not transfer directly.

Key takeaways:

  • Codex has over a million weekly active developers, with usage growing five-fold since January 2026
  • The core agent and CLI are fully open source, written in Rust for performance, correctness, and minimal dependencies
  • The agent loop is a state machine handling prompt assembly, inference, tool calls, and compaction to manage context window limits
  • Over ninety percent of Codex's own code was generated by Codex, similar to what Anthropic reports for Claude Code
  • Engineers on the team run four to eight parallel agents simultaneously, acting as "agent managers" rather than traditional coders
  • AI code review runs automatically on every PR, with a bespoke model achieving roughly ninety percent accuracy on valid issues
  • Non-critical code can merge with AI review only; core code still requires human review
  • AGENTS.md files serve as instructions for AI agents, much like README files for humans, and have become a de facto standard
  • The team structures codebases explicitly to maximize agent success: clear module boundaries, comprehensive tests, and validation instructions

Tradeoffs:

  • Rust over TypeScript trades early productivity and model familiarity for long-term performance, correctness, and dependency control. The bet paid off, but only because the models improved fast enough.
  • Defaulting to sandboxed execution trades adoption and convenience for safety. The team acknowledges this cost explicitly.
  • Allowing non-critical code to merge with AI review only trades thoroughness for velocity. The one-in-ten miss rate is acknowledged but the downstream impact is not explored.
  • Running four to eight parallel agents per engineer trades deep focus on individual changes for broader throughput. The quality implications of context-switching between agent outputs deserve more scrutiny.

How Codex is built

External Links (1)