Agentic Code Review: When the Machine Writes Faster Than Humans Can Read

Agentic Code Review

TLDR: AI coding agents produce code faster than humans can review it, and the data from four independent datasets shows the gap is real and growing: code churn up 861%, defect rates rising from 9% to 54%, review times up 441%, with zero-review merges climbing 31%. Osmani argues review is now the most leveraged skill in software and proposes a tiered framework based on blast radius rather than uniform process.

Summary: The essay opens with a fact that deserves to land: senior engineers used to be able to read code faster than junior developers could write it. That relationship was the accidental foundation of code review working at all. An agent breaks it completely. A thousand lines of well-formatted code arrives in less time than it takes to read a paragraph, while a human's reading speed has not changed since we started staring at screens. The constraint moved downstream, to the one step that did not get faster.

The data Osmani assembles is the strongest part of this piece. Faros AI instrumented 22,000 developers across 4,000 teams and tracked what happened as AI adoption grew. Throughput per engineer climbed, PRs merged increased, the upside is real. Then the rest of the numbers: code churn up 861%, the incidents-to-PR ratio up 242.7%, defect rate rising from 9% to 54%, review duration up 441.5%, zero-review merges up 31.3%. That last figure is the uncomfortable one. Nobody decided to stop reviewing. Reviewers simply could not keep pace with volume, so code began merging unread, and that became normal. Mature, disciplined engineering teams were hit just as hard. Good process did not protect them because the volume arrived faster than any process was built to absorb. CodeRabbit's study of 470 open source PRs found AI-coauthored changes carried roughly 1.7x more issues, with security problems 1.5 to 2x more common. GitClear found AI users producing around 4x the raw output but only 12% more delivered value. The gap between those numbers is the review problem stated in a single line.

The spectrum argument is where the essay earns its keep. Blast radius, code longevity, and team size determine what "good review" actually means for your situation. A solo developer on a greenfield project with no users has almost none of the concerns driving the Faros numbers. The real danger is the crossing point nobody notices: the moment a project gets users, review's bug-catching role suddenly matters, and its knowledge-sharing role switches on, while teams keep their solo-era habits a few months too long. That is where postmortems happen. The Faros data about mature teams holds at the far end of the spectrum, where an unreviewed change becomes comprehension debt that becomes someone's on-call incident. The point the essay makes well is that most advice in circulation is one position on this spectrum prescribing to another.

The AI-reviewing-AI section is one of the most practically useful things in the piece. An engineer ran four AI reviewers in parallel, CodeRabbit, Sentry Seer, Greptile, and Cursor BugBot, across 146 real PRs. Of 617 distinct flagged locations, 93.4% were caught by exactly one of the four tools. Almost none by three. None at all by all four. The tools never once flagged the same line. Each was strong at a different class of problem. This is the adversarial review argument demonstrated on a real codebase rather than in a paper. Four copies of one model is a single reviewer with a larger invoice. Four genuinely different reviewers surface bugs no single member could find alone. CodeRabbit leads on overall recall, Greptile shows near-zero false positives on correctness and architecture, Seer performs best on production-failure severity. The right move is not picking the best one; it is running two with deliberately different characters.

The "human moves up a level" conclusion is where Osmani lands, and it is the right read. The agent writes the code. Another agent reviews it. A third judges it. The human stops reviewing every diff and starts owning the things that do not transfer to a model: the judgment of whether this is even the right change to build, the high-blast-radius gates where being wrong is expensive, and the requirements nobody thought to write down. A closed loop of models with correlated blind spots can be both very sure and very wrong, with no human left to tell the difference. The example of Kun Chen shipping 40 PRs a day as a solo ex-Meta L8 is instructive precisely because it is so context-specific: he writes detailed plans up front, the models execute against them for hours, and he maintains an automated review gate before merge. His conditions do not transfer to a team maintaining a decade-old system.

Key takeaways:

Four independent datasets confirm that AI-generated code volume is outpacing human review capacity, with defect rates and review times rising sharply.
Running multiple AI reviewers with different architectures catches dramatically more issues than any single tool, because 93.4% of findings from one tool in a four-way comparison were unique to that tool.
The right amount of review scales with blast radius, code longevity, and team size, not with uniform process applied to every diff.
The missing-intent problem (the reviewer becoming "the first human to ever lay eyes on this code") is solvable by having agents capture their reasoning as a decision log on the PR.
AI review is already doing more reviewing than humans on many codebases; the only decision left is whether teams will be deliberate about it or let it happen by default.

Why do I care: The 441% increase in review time is the number I keep coming back to. That load falls on senior engineers, the people least replaceable and most bottlenecked already. I've seen teams declare velocity wins based on merged PR counts while their senior engineers are drowning in review queues that never shrink. The framing of "review capacity as a real resource to be measured and protected" is the most operationally useful thing in this essay, and it is the thing most engineering leaders are not tracking. I also think the adversarial multi-reviewer finding deserves more attention than it gets. Running two AI reviewers with genuinely different characters is cheap, it surfaces significantly more bugs than either alone, and it still leaves the merge decision with a human. That is a concrete practice teams can adopt tomorrow. The test-changes-first heuristic is also something I apply personally: if an agent rewrites 200 test assertions to match new behavior, that diff needs to be read assertion by assertion, not waved through because the suite is green.

Agentic Code Review