How Kilo Engineers Actually Run Parallel AI Agents Without Drowning

How 7 Kilo Code Engineers Run Up to 20 Parallel Agents and Still Ship Clean Code

TLDR: Kilo asked seven of its senior engineers how they actually run parallel coding agents day to day. The honest answer is far less dramatic than the social media flex: two to four agents in the foreground, a handful of fire-and-forget agents in the background, and a strong verification loop on top of everything. The real bottleneck has moved from writing code to checking it.

Summary: The piece opens by setting up two camps. On one side you have people like Steve Yegge, who say you can train yourself to hand-manage ten or more agents at once. On the other you have Armin Ronacher, the creator of Flask, who describes watching someone at three in the morning running their tenth parallel session and seeing not productivity but a person who probably needs to step away from the machine. Kilo decided to skip the theater and just ask its own engineers, the kind of people who not only use coding agents but build them. Mark has spent over a year on the VS Code extension and rebuilt the engine behind it. Evgeny shipped a deploy agent in a week. John is building a hosted orchestrator meant to run twenty-plus agents at once. Igor, Florian, Kirill and Imanol round out the group, with backgrounds at Netflix, Netlify, JetBrains and Rackspace.

The conversation started with Evgeny being genuinely puzzled by posts claiming people run fifty to a hundred agents at once and stay productive. John's view is blunt: it is next to impossible to keep twenty agents working in the background and pay equal attention to all of them, let alone a hundred. So they talked about the actual sweet spot, and the number that came back was small. Most of the engineers keep two to four foreground agents going at a time. The word foreground matters here. These are the agents you actively watch, review, steer and interrupt when they start drifting. Simon Willison says he focuses on reviewing and landing one significant change at a time. Mitchell Hashimoto, the co-founder of HashiCorp, calls himself "the mayor" managing at most two agents. Igor sits firmly in the two-to-three club because, in his experience, output quality on complex tasks drops fast unless you stay close, and the higher the stakes the less you can treat the agent as a black box.

The hundred-agent claims do not survive contact with the definition of "running." Mark, who genuinely does run more than twenty agents, runs only one to three in the foreground. The rest are background jobs that do not demand attention. He might spin up a cloud agent to fix change markers in a file while he keeps working on a larger merge in the foreground. Each background agent ships a pull request, the tests pass or fail, and Mark looks at the result later. The agent only asks for his attention once, when it finishes and he decides to merge or reject. This is becoming common practice. Addy Osmani at Google runs four or five background agents on low-to-medium complexity work. Mark uses a gardening metaphor: instead of doing the grunt work yourself, you apply judgment, tell the agent to take care of something, then come back later to prune. The agents do the building, you do the gardening.

Then the article gets into task sizing, which is the part I found most useful. Too small and prompting is slower than just writing the code yourself. Too large and the agent runs out of context and starts getting confidently wrong. Igor's rule of thumb is that quality usually drops around sixty percent context fill, well before the ninety-five percent mark where compaction kicks in. By the time auto-compaction runs, hallucinations have already started. The community calls this context rot, and an Anthropic employee wrote a whole piece warning about bad compacting when a window grows too large. Igor's fix is to split work into smaller sub-agents so each finishes before its context fills. The practical heuristic the piece offers is to size a task by reviewability: if the diff is too big to inspect carefully in one sitting, the task was too big. And do not mix task types in one run. "Refactor this service, improve performance, add analytics, and clean up tests" is four tasks, not one.

The other strong thread is the plan-then-execute split. Florian uses research agents to prepare plans so he ends up with a queue of pre-investigated problems ready for execution agents. Imanol tried GPT-5.5 in fast mode and found it quick but error-prone, so he settled on a slow thinking model for planning and the fast model for implementation, and called the result "so much better." Kirill pairs GPT-5.5 with thinking enabled for planning, then switches to Sonnet for fast execution. Boris Cherny, the creator of Claude Code, does the same thing: iterate on a plan, then let the model one-shot the implementation. Finally there is verification, which the article treats as the real unlock. Florian has an agent that reads PR review comments, makes the changes, and reports back, and that saved time compounds across hundreds of PRs. Boris Cherny says giving an agent the ability to verify its own work makes the final result two to three times better, and a fresh agent reviews better than the original because the person who wrote the code is the worst person to review it. The closing line lands: we are not in the age of agent psychosis, we are in the age where the bottleneck shifted from writing code to verifying it.

Key takeaways:

The realistic setup is two to four foreground agents you actively manage, plus a handful of fire-and-forget background agents that each produce a reviewable PR.
Quality degrades around sixty percent context fill, long before compaction triggers, so split work into sub-agents that finish before context rot sets in.
Size each task by whether a human can review the diff in one sitting, and keep one task type per run instead of bundling refactor, performance and tests together.
Separate planning and execution: a slow thinking model writes the plan, a fast model implements it.
A separate verification agent reviewing the work matters more than agent count, because the author is the worst reviewer of their own code.

Why do I care: If you lead a team that is about to be pressured into "everyone should run ten agents," this piece is the calm counterargument you can hand your manager. The honest number is two to four foreground agents, and the engineers who built these tools agree on that. What actually scales is not concurrency, it is the discipline around it: scope tasks to a reviewable diff, watch the context window instead of trusting the advertised size, and put a fresh agent on review duty so the original author is not grading their own homework. For architects the context-rot point is the one to internalize, because it reframes prompt and task design as a budgeting problem rather than a "bigger window is better" problem. The gardening metaphor is nicer than most, but the load-bearing idea is that your review and verification pipeline is now the constraint, so that is where the investment should go.

How 7 Kilo Code Engineers Run Up to 20 Parallel Agents and Still Ship Clean Code