Home
News
Articles
Me
    1. Home
    2. News
    3. Your Agent's Reasoning Is Fine—Its Memory Isn't: Building Production Engineer Agents with GraphRAG

    Your Agent's Reasoning Is Fine—Its Memory Isn't: Building Production Engineer Agents with GraphRAG

    Published on 20.01.2026

    #substack
    #ai
    #agents

    Your Agent's Reasoning Is Fine—Its Memory Isn't

    TLDR: The bottleneck in production incident response isn't fixing things—it's finding the context scattered across Slack threads, dashboards, and documentation. This article presents a comprehensive architecture for a Production Engineer agent that uses GraphRAG to provide structural memory about services, teams, dependencies, and incidents.

    The pager goes off at 02:13. One service is down. Then another ten follow. And here's the moment that separates theory from practice: someone pastes a dashboard link into Slack, someone else replies with "Didn't we see this two weeks ago?"—but no ticket is linked, no postmortem is found. The incident feels known, but undocumented.

    This is how enterprise systems decay. Not through broken code, but through forgotten understanding.

    The author presents a fascinating observation about what actually slows down incident response. Most issues are fixable. What kills teams is figuring out what is really happening and how far it reaches before taking action. The fix itself is often straightforward once you have clarity. Production engineers carry entire system histories in their heads—which alert is real, which needs a manual nudge, where each dependency lies, which rollback will make things worse. This works until it doesn't: when the wrong person is asleep, when someone leaves, when the system grows just large enough that no single mind can hold it all anymore.

    The proposed solution is a Production Engineer agent that reacts to alerts by identifying affected services and teams, understanding how issues propagate through the system, and surfacing the context that engineers usually spend hours reconstructing. But here's the key insight: the agent's superpower isn't reasoning—it's GraphRAG. The difference matters because traditional RAG retrieves semantically similar text chunks, while GraphRAG retrieves connected knowledge by traversing relationships in a graph.

    The architecture is thoughtfully decomposed into five components: an alerting system (Prometheus firing webhooks), an agent component orchestrating the flow, a GraphRAG component serving as long-term structured memory, MCP servers providing real-time external context, and an observability layer tracking the agent's behavior. The separation of concerns is clean—the graph holds structure and history, while MCP servers provide what's happening right now.

    The schema design for the knowledge graph is particularly practical. It models nodes for Services, Teams, Incidents, Runbooks, Docs, and Releases/PRs, connected by relationships like DEPENDS_ON, OWNED_BY, AFFECTED, and RESPONDED_BY. Each node carries a vector embedding derived from its LLM-generated summary. This gives two retrieval modes: semantic search over embeddings to find entry points, and graph traversal to expand through dependencies and ownership.

    For architects and technical leads, the maintenance strategy is worth noting: build the graph once by scraping documentation sources, then update daily with scheduled jobs. Real-time data comes through MCP servers. The explicit priority ordering is elegant—MCP servers show current state, graph provides historical patterns, and discrepancies are flagged explicitly in reports. The system prompt even instructs the agent to note when graph data conflicts with fresh MCP data.

    What's missing from this otherwise comprehensive treatment is an honest assessment of the bootstrapping problem. Building a useful knowledge graph requires organizational knowledge that's often exactly what's missing—service ownership, dependency maps, and operational context. The article assumes you can "scrape your documentation sources" but in many organizations, that documentation doesn't exist or is hopelessly outdated. The agent might need to discover and build this knowledge over time rather than starting with it.

    Key takeaways:

    • The incident response bottleneck is context retrieval, not problem-solving ability
    • GraphRAG provides structural memory that traditional vector search cannot—traversing relationships instead of just finding similar text
    • Separate historical/structural knowledge (graph) from real-time state (MCP servers) with clear priority ordering
    • Daily graph updates are sufficient for production topology; real-time data comes through MCP layer
    • Make the agent explicit about data source conflicts and discrepancies

    Tradeoffs:

    • Graph databases provide relational querying power but require upfront schema design and ongoing maintenance
    • Explicit agent orchestration gives predictable debugging but sacrifices the flexibility of framework abstractions
    • Daily batch updates simplify operations but introduce potential staleness for fast-changing systems

    Link: Your Agent's Reasoning Is Fine—Its Memory Isn't


    This article was generated from the Decoding AI Substack newsletter. While I've done my best to capture the essence of this piece, I encourage you to read the original article for the full architecture diagrams and implementation details.

    External Links (1)

    Your Agent's Reasoning Is Fine—Its Memory Isn't

    decodingai.com

    Sign in to bookmark these links
    Previous
    This Week In React #265: Agent Skills, Firefox 147, View Transitions, Node.js 25.4, React Native Harmony
    Next
    AI Report Nuggets and Commentary: The Uncomfortable Truths of Early 2026
    Grzegorz Motyl

    © 2026 Grzegorz Motyl. Raising the bar of professional software development.

    GitHubTwitterEmail