GitHub's Reliability Meltdown: When AI Load Meets 18 Years of Tech Debt

The Pulse: AI Load Breaks GitHub — Why Not Other Vendors?

TLDR: GitHub has been running at 85-90% uptime for months — meaning parts of the platform are down for two to three hours every single day. The culprit is AI agent load that grew 3.5x over two years, colliding with 18 years of accumulated technical and organizational debt, all while the company is mid-migration to Azure.

Summary: Let's talk about what's actually happening at GitHub, because the story is more interesting than a simple "they got too much traffic" headline. The situation started deteriorating noticeably last month when GitHub's uptime dropped to a single nine — right at 90%. This month it fell further to around 85-86%, which means the platform responsible for hosting most of the world's professional software development was effectively unusable for a meaningful chunk of every working day. And it's not just flakiness: we're talking data integrity failures. On April 23rd, the merge queue's squash merge feature started silently dropping commits when more than one pull request was in a merge group simultaneously. Over two thousand pull requests were affected, and GitHub could offer zero assistance to the teams who had to manually dig through git history to recover lost code. The company's COO tried to minimize the impact by framing it as a tiny percentage of overall merges, which understandably infuriated affected teams. A data integrity failure is categorically different from an availability blip, and the muted executive response felt tone-deaf.

The Elasticsearch outage that followed days later took down pull requests, issues, and projects from GitHub's web UI for six hours. Then more Actions outages. Then a critical security vulnerability disclosed by Wiz that would have allowed a bad actor to access any repository with just a git push command. This is a remarkable sequence of failures in a very short window.

GitHub's CTO, Vlad Fedorov, eventually came forward with an explanation. The charts he shared were unfortunately missing Y-axes — a curious omission that made the load growth look dramatic without conveying the actual magnitude. But Fedorov provided actual numbers: load grew roughly 3.5x over the past two years, driven primarily by AI agents hammering the platform. Pull requests, workflows, branch protection checks, webhooks, notifications — a single agent-driven PR can touch a dozen different subsystems simultaneously. At scale, queue depths grow, cache misses cascade into database load, and slow dependencies start pulling down adjacent services. The company only started planning for a 10x capacity increase in October 2025, and by February 2026 they realized they actually needed to design for 30x. That gap between when the problem became undeniable and when planning started is doing a lot of work here.

What makes this story genuinely puzzling is the contrast with other infrastructure companies. Vercel, Linear, Railway, Sentry, Resend — they all appear to be handling record growth driven by the same AI wave without the kind of compounding failures GitHub is experiencing. GitHub's direct competitors, GitLab and Bitbucket, presumably face similar load trajectories, but their status pages tell a quieter story. So why is GitHub struggling so uniquely? Several factors compound: GitHub is 18 years old and carries an enormous amount of technical debt in systems that were built before anyone was thinking about horizontally scaling millions of parallel agent workflows. Scaling stateful systems like workflow queues and databases is fundamentally harder than scaling stateless compute. And GitHub has roughly 4,000 employees, only about a quarter of whom are engineers — the kind of organizational overhead that makes "just fix it" a much slower process than at leaner shops. Additionally, the company is in the middle of migrating from its own data centers to Azure, a project that was already expected to take twelve months even under stable load conditions. Doing that migration while load is spiking rapidly is a recipe for exactly the kind of visible failures we're seeing.

Mitchell Hashimoto, who created HashiCorp and Ghostty, kept a daily journal of GitHub outages affecting his work and found an X next to nearly every day for a month. After 18 years, he announced he was leaving. That framing matters: this isn't just someone being impatient. This is a creator of foundational developer tooling saying the platform has become incompatible with serious work. His complaint is simple and fair — professional tools should help you ship, not block you for hours per day.

Key takeaways:

GitHub's reliability has been at zero to one nines for months, with parts of the service down an average of two to three hours daily over the last ninety days
A data integrity bug in the merge queue silently dropped commits from over two thousand pull requests, and GitHub provided no remediation assistance
AI agent traffic grew load by roughly 3.5x over two years; GitHub only began planning for a 10x capacity increase in October 2025, and now realizes 30x is needed
The combination of 18 years of tech debt, a mid-migration to Azure, and organizational overhead at scale is making GitHub unusually slow to respond compared to smaller, leaner infra competitors
Alternatives like GitLab, Bitbucket, and self-hosted solutions like Forgejo are worth evaluating seriously for teams where GitHub outages are causing real lost productivity

Why do I care: I use GitHub every day, and so does almost every developer I know. But the real concern here isn't just uptime — it's the architectural lesson. GitHub's situation is what the innovator's dilemma looks like for infrastructure: a platform that grew to dominance under one usage model (humans making pull requests) is now straining under a fundamentally different one (AI agents making thousands of automated pull requests per minute). The failure to predict this shift, or to plan capacity for it until too late, is a strategic and engineering planning failure, not just a scaling problem. For anyone building platforms or services today, the question worth asking is: what's your 30x scenario, and are you designing for it now or waiting until you're in crisis mode?

The Pulse: AI load breaks GitHub – why not other vendors?