Long-Running Agents: The Engineering Reality Behind AI That Works for Days

Long-Running Agents: What Changes When AI Works for Days, Not Minutes

TLDR: The familiar single-session AI agent paradigm has a hard ceiling. Long-running agents that span multiple context windows, sandboxes, and sessions require solving three distinct engineering problems: finite context, lack of persistent state, and the model's inability to honestly grade its own work.

Summary: For two years, the mental model of an AI agent was a chat window with a clever loop. You type a goal, watch tokens stream, and eventually the context fills or the agent gives up. That got us surprisingly far, but it hits a ceiling fast. The model forgets. It declares work complete when it isn't. It reintroduces a bug it fixed nine turns ago. The whole structure assumes one sitting.

Long-running agents change that assumption fundamentally. The pitch is simple to state: an agent that makes continuous forward progress across many sessions and sandboxes, possibly over days or weeks, while keeping the workspace organized enough for the next session to resume coherently. The engineering is harder. You have to build a state layer that lives outside the model's context window and design handoffs so the agent doesn't lose orientation when it wakes in a fresh sandbox.

Three problems appear in essentially every serious write-up on this. First, finite context. Even a one-million-token window fills, and context rot, the gradual degradation of model quality as the window approaches its limit, kicks in well before the hard cap. A 24-hour run simply won't fit in any context window currently on the roadmap. Second, no persistent state. Anthropic's framing from their scientific computing post captures it well: imagine a software project staffed by engineers working in shifts, where each new engineer arrives with zero memory of the previous shift. Without an explicit persistence story, every handoff is a productivity disaster. Third, no reliable self-verification. Models skew positive when they evaluate their own work. Ask "are you done?" and the answer is "yes" more often than it should be. Without a separate signal that the work actually meets some bar, you get an agent that ships at 30 percent complete with complete confidence.

One of the simpler practitioner patterns is the Ralph loop, sometimes called the Ralph Wiggum technique, built by Geoffrey Huntley and Ryan Carson. The implementation is literally a bash script. Pick the next unfinished task from a file, build a prompt with the task and relevant context, call the agent, run tests or other checks, append what happened to a progress file, update the task list, repeat. It works for the same reason all the more sophisticated harnesses work: state lives outside the agent's context. The plan file is the spec, the progress file is the lab notes, the AGENTS.md equivalent is the rolling rulebook. The agent itself is amnesiac, but the filesystem isn't.

Building a long-running agent in an evening with a bash script and a JSON file is genuinely possible. What Google and Anthropic have productized is the work of making that same pattern recoverable, secure, and observable at production scale. That's an important distinction to hold onto.

Key takeaways:

Long-running agents require state that lives outside the model's context window, not inside it
The three universal failure modes are context limits, stateless sessions, and model self-grading bias
The Ralph loop pattern proves the concept works with minimal infrastructure
Production readiness means adding recoverability, security, and observability on top

Why do I care: As someone who has watched countless "intelligent" automation projects fail because nobody thought about what happens after session one, these patterns feel like table stakes. The shift to treating the agent as stateless and the filesystem as the source of truth is genuinely correct architecture. What concerns me is that most teams will reach for a managed platform before they understand why the primitives work, and they'll get burned when the platform's memory layer drifts in unexpected directions.

Long-running Agents

Anthropic's Harness Approach: Initializers, Coders, and the Brain/Hands Split

TLDR: Anthropic published two engineering posts that are worth reading end to end. The first describes a two-agent harness for autonomous full-stack development. The second introduces a brain/hands/session architectural separation that cuts time-to-first-token by 60 percent at median and over 90 percent at the 95th percentile.

Summary: Anthropic has been more public about the underlying engineering than most. Their first post on effective harnesses describes a two-agent design for autonomous full-stack development. An initializer agent runs once at the start of a project, expands the prompt into a structured feature list, and writes a boot script that future sessions will run when they start. A coding agent is then invoked repeatedly, each session asked to make incremental progress on one feature, run tests, leave a progress note, and commit. A test ratchet sits in the prompt to prevent the agent from quietly deleting tests to make the suite pass, which is, somewhat embarrassingly, a common failure mode.

The second post, on scaling managed agents, introduces a three-way split between the brain, the hands, and the session log. The brain is the orchestrating model. The hands are the tools and the sandbox that executes code. The session is a persistent, queryable event log that lives outside both. When these are tightly coupled, as they are in naive implementations, any part of the system changing means the whole thing has to change at once. Decoupled, the harness becomes stateless, sandboxes become interchangeable, and a container crash doesn't lose the run. A fresh container calls a wake function with the session ID and reconstitutes state from the log. Their reported improvement, 60 percent reduction in time-to-first-token at median, mostly comes from being able to start inference before the sandbox is even ready.

The session-as-event-log idea is the part most teams underappreciate. Without it, a container failure is a session failure and you're debugging into a stale snapshot. With it, the agent's memory is a queryable artifact that persists independently of whatever process happens to be running. For scientific computing specifically, Anthropic's post on long-running Claude describes a simpler version of the same stack: a living plan file the agent edits as it learns, a changelog as portable lab notes, and a Ralph loop that refuses to let the agent declare completion until a separate check agrees. Their flagship example is a Boltzmann solver Claude Opus built over a few days that reached sub-percent agreement with a reference implementation. Months of researcher time, compressed.

Key takeaways:

The initializer-plus-coder harness externalizes all state into files the agent reads at the start of each session
Separating brain, hands, and session log eliminates single points of failure and dramatically reduces startup latency
A test ratchet in the system prompt is a cheap mitigation for the "delete tests to pass" failure mode
Session-as-event-log makes recovery from arbitrary failures tractable

Why do I care: The brain/hands/session split is the right mental model and I wish it had a better name, because it maps almost perfectly to how you'd design any distributed stateful system. The surprising insight is that the agent itself should be treated as the dumbest, most replaceable component. The state is the product. The model is just the compute that transforms it.

Long-running Agents

Cursor's Planner/Worker/Judge Pattern for Autonomous Coding at Scale

TLDR: Cursor's approach to long-running autonomous coding introduces a three-role pipeline: a planner that defines the work, workers that execute in parallel, and a judge that evaluates completion. The separation of generation from evaluation is what makes the whole system honest.

Summary: Cursor's post on scaling long-running autonomous coding arrives at the same structural conclusion as Anthropic, through a different door. Their planner/worker/judge split is the clearest articulation I've seen of why self-grading is the central problem. The planner writes a task spec with explicit, testable completion criteria. Workers execute in parallel, which is where the interesting efficiency gains come from. The judge evaluates the output independently of the workers that produced it.

This matters because the failure mode Cursor is defending against is the same one Anthropic's test ratchet defends against: the generator and evaluator are the same model, so flattery is baked in. Separate the roles and you get something closer to honest signals. Cursor's use case is IDE-level coding assistance, but the pattern becomes more valuable, not less, when you move to genuinely parallel autonomous work where the failure modes are different and harder to catch.

The planner's task spec is worth dwelling on. Writing down an explicit done-condition before the agent starts is, according to almost every practitioner write-up, the highest-leverage move for long runs. It's an external file with testable criteria, and it exists specifically to prevent the agent from quietly redefining "done" mid-run. This turns out to be harder than it sounds. Defining work crisply enough that an agent can run on it for a day, without human correction, is a skill that requires more precision than most engineers are used to bringing to task descriptions.

The model selection angle in Cursor's design is worth noting. They use different models for different roles, sometimes unconventional choices. The ability to route specific roles to specific models is one argument for self-hosted infrastructure over managed platforms, which tend to tie you to a single provider's model lineup.

Key takeaways:

The planner/worker/judge split separates generation from evaluation at the architectural level
Writing explicit, testable done-conditions before the agent starts is the highest-leverage practice for long runs
Parallel workers require a judge that operates independently, not a self-assessment by the same model
Model selection per role is a legitimate reason to consider self-hosted runtimes

Why do I care: The honest answer is that "separate your evaluator from your generator" is not a new idea. It's what code review is. What's new is that we're rediscovering this as a hard architectural constraint when the generator is a language model that has a documented bias toward saying it's done. I find it mildly humbling that the solution to AI overconfidence is to add a second AI whose only job is to disagree.

Long-running Agents

Google's Agent Platform: The Brain/Hands Split at Infrastructure Scale

TLDR: Google's Vertex AI Agent Platform productizes the same patterns Anthropic and Cursor describe, adding a persistent Memory Bank layer, Agent Identity, Agent Registry, and a full observability stack. The new risk it introduces is memory drift, where agents learn procedural shortcuts from atypical interactions and apply them too broadly.

Summary: Google's Agent Platform takes the same brain/hands/session architecture and turns it into a managed service with substantially more operational surface area. Agent Memory Bank is the persistent long-term memory layer, generally available as of Google Next 2026. It curates memories across sessions, scopes them to a user identity, and exposes a search API so the next agent invocation can retrieve what's relevant without loading an entire history. Payhawk reported that Memory-Bank-backed expense submission cut submission time by over 50 percent.

The platform bundles Agent Sandbox for hardened code execution, Agent-to-Agent Orchestration, Agent Registry, Agent Identity, Agent Gateway, Agent Observability, and Agent Simulation. That's essentially every operational concern you'd otherwise build by hand for a production fleet. The cryptographic identity and audit log story is the part enterprises actually need, and it's rarely the part that DIY harnesses get right. If you're already inside Google Cloud, you can wire an ADK agent into Memory Bank and Sessions, deploy onto Agent Runtime, and the persistence question is largely answered.

The new risk this introduces is memory drift. The agent learns a procedural shortcut from a few atypical interactions and starts applying it broadly. Governing memory banks is not obviously simpler than governing microservices, and the failure mode is more insidious because it looks like the system working correctly. Agent Identity controls who can read and write which banks. Agent Registry tracks which version of which agent is running. The auditing question shifts from "what are my agents doing?" to "what are my agents remembering, and how is that changing their behavior over time?"

The ambient processing use case is worth calling out separately. Not every long-running agent talks to a human. Some sit on a Pub/Sub stream or a BigQuery table and act on events as they arrive: content moderation, anomaly detection, inbox triage. The architectural decision for these is different. You want durable triggering, not just a loop. Cloud Scheduler plus Agent Runtime is one clean answer.

Key takeaways:

Memory Bank solves the persistence problem at platform scale, but introduces memory drift as a new failure mode
Agent Identity and Agent Registry are the production-readiness pieces DIY harnesses consistently skip
Ambient processing agents require durable event-triggered architecture, not just looping execution
The managed stack trades control for out-of-the-box observability and identity management

Why do I care: Memory drift is the failure mode I'm watching most carefully. A model that silently learns a wrong shortcut and applies it consistently is harder to detect than a model that crashes. You get confident, coherent, incorrect behavior, which in a production agent fleet is genuinely dangerous. The identity and audit infrastructure Google is shipping is the right response, but it only helps if teams actually use it to ask the right questions about what their agents are remembering.

Long-running Agents

What to Actually Do This Week: A Practical Decision Tree

TLDR: The author distills the entire landscape into a clear decision tree: if you're extending your IDE workflow, start with Claude Code and commit progress often. If you're building a hosted agent product, pick a managed runtime. If you're doing autonomous operational work, Memory Bank-style persistence is what you need.

Summary: After surveying Anthropic, Google, and Cursor's approaches, the newsletter lands on a genuinely useful decision tree. If you're using AI for engineering work and want longer sessions without rebuilding infrastructure, the right move is to use Claude Code with a good AGENTS.md file and commit progress every meaningful unit of work. That's where most of the leverage is right now and the entry cost is low.

If you're building a product where agents run on behalf of users, don't build the runtime. The three real options today are Google's Agent Platform, Claude Managed Agents, or rolling something on top of ADK or the Anthropic SDK yourself. Managed gets you the brain/hands/session split, observability, identity, and an audit trail. Self-hosted gets you control and the ability to use unconventional models for specific roles. For most teams the right starting point is a managed runtime with custom loop logic on top.

If you're doing something autonomous and operational, monitoring, research sweeps, inbox triage, Memory Bank-style persistence is what you want and it's the part that doesn't exist in Claude Code. The cleanest stack for "agent runs every N hours, accumulates state, alerts on a threshold" is ADK plus Memory Bank plus Cloud Run plus Cloud Scheduler.

A few things matter regardless of which path you take. Write down the done-condition before the agent starts. Separate the evaluator from the generator. Commit state after each meaningful unit of work. Treat credentials as strictly outside the sandbox, not inside it. And recognize that verification and auditing 24 hours of autonomous activity is a real human-time problem that structured artifacts, PRs, commits, test runs, make tractable. Without them, you're scrolling logs and you'll miss what matters.

The newsletter closes with a point I think is underappreciated: defining work crisply enough that an agent can run on it for a day is harder than doing the work yourself. The skill that's appreciating in value isn't writing code. It's writing specs that leave no room for creative reinterpretation.

Key takeaways:

For IDE work: Claude Code with a solid AGENTS.md and frequent commits is the highest-leverage starting point
For agent products: use a managed runtime rather than building your own session and persistence infrastructure
For operational agents: Memory Bank-style persistence plus event-driven scheduling is the right architecture
Writing unambiguous done-conditions is harder than it looks and matters more than any other single practice

Why do I care: The spec-writing observation at the end is the one that will age best. We've spent decades teaching engineers to communicate with compilers. The next competency is communicating with agents that are smart enough to fill gaps creatively but not smart enough to fill them correctly. The precision required is higher, not lower, than what we're used to. That's going to be uncomfortable for a lot of people who are used to handwaving requirements.

Long-running Agents