Your Agent's Reasoning Is Fine—Its Memory Isn't: Building Production Engineer Agents with GraphRAG

Your Agent's Reasoning Is Fine—Its Memory Isn't

TLDR: The bottleneck in production incident response isn't fixing things—it's finding the context scattered across Slack threads, dashboards, and documentation. This article presents a comprehensive architecture for a Production Engineer agent that uses GraphRAG to provide structural memory about services, teams, dependencies, and incidents.

The pager goes off at 02:13. One service is down. Then another ten follow. And here's the moment that separates theory from practice: someone pastes a dashboard link into Slack, someone else replies with "Didn't we see this two weeks ago?"—but no ticket is linked, no postmortem is found. The incident feels known, but undocumented.

This is how enterprise systems decay. Not through broken code, but through forgotten understanding.

The author presents a fascinating observation about what actually slows down incident response. Most issues are fixable. What kills teams is figuring out what is really happening and how far it reaches before taking action. The fix itself is often straightforward once you have clarity. Production engineers carry entire system histories in their heads—which alert is real, which needs a manual nudge, where each dependency lies, which rollback will make things worse. This works until it doesn't: when the wrong person is asleep, when someone leaves, when the system grows just large enough that no single mind can hold it all anymore.

The proposed solution is a Production Engineer agent that reacts to alerts by identifying affected services and teams, understanding how issues propagate through the system, and surfacing the context that engineers usually spend hours reconstructing. But here's the key insight: the agent's superpower isn't reasoning—it's GraphRAG. The difference matters because traditional RAG retrieves semantically similar text chunks, while GraphRAG retrieves connected knowledge by traversing relationships in a graph.

The architecture is thoughtfully decomposed into five components: an alerting system (Prometheus firing webhooks), an agent component orchestrating the flow, a GraphRAG component serving as long-term structured memory, MCP servers providing real-time external context, and an observability layer tracking the agent's behavior. The separation of concerns is clean—the graph holds structure and history, while MCP servers provide what's happening right now.

The schema design for the knowledge graph is particularly practical. It models nodes for Services, Teams, Incidents, Runbooks, Docs, and Releases/PRs, connected by relationships like DEPENDS_ON, OWNED_BY, AFFECTED, and RESPONDED_BY. Each node carries a vector embedding derived from its LLM-generated summary. This gives two retrieval modes: semantic search over embeddings to find entry points, and graph traversal to expand through dependencies and ownership.

For architects and technical leads, the maintenance strategy is worth noting: build the graph once by scraping documentation sources, then update daily with scheduled jobs. Real-time data comes through MCP servers. The explicit priority ordering is elegant—MCP servers show current state, graph provides historical patterns, and discrepancies are flagged explicitly in reports. The system prompt even instructs the agent to note when graph data conflicts with fresh MCP data.

What's missing from this otherwise comprehensive treatment is an honest assessment of the bootstrapping problem. Building a useful knowledge graph requires organizational knowledge that's often exactly what's missing—service ownership, dependency maps, and operational context. The article assumes you can "scrape your documentation sources" but in many organizations, that documentation doesn't exist or is hopelessly outdated. The agent might need to discover and build this knowledge over time rather than starting with it.

Your Agent's Reasoning Is Fine—Its Memory Isn't

This article was generated from the Decoding AI Substack newsletter. While I've done my best to capture the essence of this piece, I encourage you to read the original article for the full architecture diagrams and implementation details.

Your Agent's Reasoning Is Fine—Its Memory Isn't

This is how enterprise systems decay. Not through broken code, but through forgotten understanding.