Agent Infrastructure and Production RAG
Published on 07.05.2026
Agentic Harness Engineering
TLDR: A raw LLM is just a next-token predictor. To turn it into a reliable agent you wrap it in a harness that controls execution, manages memory, and enforces safety boundaries.
Summary: Most teams discover the hard way that prompting alone does not produce a reliable agent. The model needs scaffolding around it, and that scaffolding has a name now: the harness. The harness is the layer that decides what tools the model can call, how it remembers prior steps, what counts as a safe action, and when execution should stop. Without it, you get a clever autocomplete that goes off the rails the moment the task gets longer than a few turns.
The harness is also where most of the engineering effort actually lives. Picking the right model is increasingly a small decision compared to wiring up the runtime around it. Tool routing, retries, partial failures, structured memory, output validation, none of this is glamorous, and none of it is optional if you want the agent to behave on day thirty rather than just during the demo.
I think the framing matters because it pushes the conversation away from prompt tricks and toward systems design. You are building a small operating system for the model, with policies, schedulers, and guardrails. Treat it like that and the failure modes become familiar instead of mysterious.
Key takeaways:
- An LLM alone is not an agent; the harness provides execution control, memory, and safety
- Most agent reliability problems are infrastructure problems, not prompt problems
- Designing a harness is closer to systems engineering than to prompt engineering
Why do I care: As a senior frontend dev shipping AI features, I have stopped treating the model as the product. The harness is the product. If I cannot describe the tool boundary, the memory store, and the stop conditions on a whiteboard, I have no business deploying the thing to a real user.
Recursive Language Models
TLDR: When agents need to spawn sub-agents or break work down recursively, you need explicit design patterns to keep context coherent. Otherwise sub-tasks blow past the parent's intent and lose state.
Summary: Recursion sounds simple in code and gets ugly fast in agent land. The parent agent has a goal, it decomposes that goal into sub-tasks, and now each sub-agent has its own context window, its own memory, and its own opinions about what success looks like. Without a deliberate pattern for passing intent down and folding results back up, you end up with sub-agents solving problems the parent never asked about.
The patterns described here are about scoping. Each child gets a tightly defined slice of the problem, a budget, and a contract for what it returns. The parent stays the orchestrator and does not delegate judgment, only execution. That separation is what keeps the tree from collapsing into noise.
I have seen plenty of "let agents call agents" demos, and they almost always look great until the second hop. The interesting work is in the plumbing: how you serialize state at the boundary, how you bound recursion depth, how you let a child fail without killing the parent. Get that right and recursion becomes useful instead of decorative.
Key takeaways:
- Sub-agents need bounded scope, budgets, and clear return contracts
- The parent agent should orchestrate, not delegate judgment
- Boundary serialization and depth limits are what make recursion safe in practice
Why do I care: I am increasingly composing tool-using agents inside larger workflows. Treating each sub-agent as a function with a typed input and output, rather than a free-form collaborator, is the only way I have made these flows debuggable.
RALPH Loops
TLDR: RALPH loops are explicit execution patterns that force agents to reason reliably, finish tasks, and stop when they should. They are the antidote to wandering, never-terminating agents.
Summary: Anyone who has watched an agent loop forever, oscillating between two near-identical tool calls, knows the failure mode. The model wants to keep going because every step locally looks reasonable, but globally it has lost the plot. RALPH loops are a structured way to fix that, by making the reasoning step, the action step, and the termination check first-class parts of the runtime.
The loop forces the agent to reflect on progress at every iteration, not just on the next move. That sounds expensive, and it is, but the cost is what buys you stop conditions that actually fire. You trade a few extra tokens per turn for an agent that does not run for an hour and burn fifty dollars writing the same paragraph.
What I like about this framing is that it is opinionated. Most agent loops in the wild are ad hoc. Naming the pattern and prescribing the steps gives teams something to argue about and improve, rather than each engineer reinventing a slightly worse version every Monday.
Key takeaways:
- Most agent failures are termination failures, not reasoning failures
- Forcing a reflection step at every iteration is what makes stop conditions reliable
- Named, opinionated patterns beat ad hoc loops you wrote on Tuesday
Why do I care: Termination is the boring problem nobody talks about until production. I have spent more time on stop conditions and budget enforcement in agent code than on the actual reasoning logic. Patterns like this turn that into a checklist instead of a daily fire drill.
How I Automated 91% of My Business
TLDR: A transparent walkthrough of the workflows, agents, and pipelines used to run most of a one-person business on autopilot. The interesting bits are the seams between human and machine, not the models themselves.
Summary: Automation pieces written by founders tend to fall into two camps: vague and self-congratulatory, or so detailed they read like a Zapier manual. This one leans toward the detailed end, walking through the actual pipelines that handle content, ops, and customer touchpoints. The headline number, ninety-one percent, is less interesting than where the remaining nine percent sits, which is exactly at the points where judgment and reputation are on the line.
The pattern that keeps showing up is "agent drafts, human approves." Drafts are cheap, approvals are not, and pushing the boundary of what an agent can confidently submit without review is where the leverage lives. The author is essentially running a content factory with a quality gate, and the gate is the founder.
What stands out is how unremarkable the individual tools are. There is no exotic stack here, just a careful composition of off-the-shelf parts with strong opinions about handoffs. That is probably the most useful lesson, automation wins come from connecting boring tools well, not from finding a magic one.
Key takeaways:
- "Agent drafts, human approves" is the workhorse pattern for solo automation
- The unautomated nine percent is usually where reputation lives
- Most leverage comes from boring tools wired together with clear handoffs
Why do I care: I keep getting asked by clients to "use AI to automate this." The honest answer is usually that the model is the easy part, and the workflow design is where the work is. Examples like this give me a concrete reference instead of waving at vibes.
How I Automated 91% of My Business
The Ultimate LLM Knowledge Base
TLDR: A personal "second brain" built by stitching Obsidian, Readwise, and NotebookLM into a retrieval system that synthesizes information instead of merely storing it.
Summary: Every knowledge worker eventually accumulates a graveyard of saved articles, notes, and highlights that they never look at again. The promise of an LLM-backed second brain is that the pile becomes searchable in a way that feels like asking a colleague rather than grepping a folder. The setup described here uses Obsidian for structured notes, Readwise for capture, and NotebookLM for the retrieval and synthesis layer.
The interesting choice is treating NotebookLM as the synthesis engine rather than building a custom RAG stack. It saves a tremendous amount of plumbing, and it forces the user to lean on document quality rather than embedding tricks. Garbage in, garbage out is still the rule, even with a fancy summarizer on top.
I am skeptical that any single tool combination is "ultimate," but the underlying principle is sound. Capture once, structure incrementally, and put a synthesis layer on top so the system answers questions instead of just remembering them. That is a workflow worth copying even if you swap out every tool.
Key takeaways:
- A second brain is only useful if it synthesizes, not just stores
- Off-the-shelf tools like NotebookLM remove most of the RAG plumbing for personal use
- Document quality matters more than retrieval cleverness
Why do I care: I run a similar setup with different tools, and I keep relearning that the synthesis layer is what makes it stick. If I cannot ask my notes a question and get a paragraph back, I will not use them, and they will rot.
The Ultimate LLM Knowledge Base
Agentic GraphRAG
TLDR: Standard RAG hits walls around context rot, data fragmentation, and unified memory. GraphRAG addresses these by treating retrieval as a data modeling problem with an explicit ontology, not a pure algorithm problem.
Summary: Plain vector RAG looks great in a demo and starts cracking under real workloads. As the context window fills up, the signal-to-noise ratio drops and the model starts hallucinating from its own retrieved chunks. Data lives in too many silos to be neatly chunked, and tracking how facts evolve over time is something a flat embedding store was never designed to do. GraphRAG steps in by modeling entities and relationships explicitly.
The argument the author makes is that this is a data modeling problem. You sit down and define an ontology: what entities exist, how they relate, how facts are versioned. Then retrieval becomes graph traversal informed by semantic search, rather than a hopeful nearest-neighbor lookup. The complexity is real, but it is the kind of complexity software engineers already know how to manage.
I think the deeper point is that retrieval is finally being treated like the rest of the data layer. We learned this lesson with relational databases decades ago, and we are now relearning it for AI memory. If your data has structure, model it. The fact that the consumer is an LLM does not change that.
Key takeaways:
- Plain RAG breaks on context rot, fragmented data, and time-evolving facts
- GraphRAG reframes retrieval as a data modeling problem requiring an ontology
- Graph traversal plus semantic search beats blind nearest-neighbor lookup at scale
Why do I care: I have been burned by RAG demos that look great on five documents and fall apart on five thousand. Modeling entities explicitly, even crudely, has consistently been worth the effort, and this article finally puts a name on what I have been hacking together.
Opik Case Studies: Shipping RAG with Weave CLI
TLDR: A walkthrough with Michael Maximilien on building Weave CLI and shipping production RAG to enterprise customers. The headline finding: a small open-source embedding model ranked around #130 on MTEB beat OpenAI by 11% on real customer data, because of a continuous stitch-evaluate-iterate loop.
Summary: Leaderboards are seductive and mostly useless if you are shipping to actual customers. Maximilien spent a year learning that the hard way, and the punchline is that a tiny open-source embedder outperformed a name-brand commercial one by eleven percent on real-world tasks. The reason was not the model, it was the loop. They stitched components together, evaluated against customer data, iterated, and repeated.
The case study walks through using Weave CLI to ship a production-ready RAG pipeline with tracing, observation, and evaluation built in from the start. The pitch is straightforward: do not deploy blind. If you cannot trace a request through retrieval and generation, and if you cannot evaluate the result against your actual data, you are not running a system, you are running a hope.
I appreciate how anti-hype this take is. The interesting part of production RAG is not the embedding model du jour, it is the boring observability layer. Build that first and your model choices become testable hypotheses instead of religious arguments.
Key takeaways:
- MTEB rankings often do not survive contact with your customer data
- A tight stitch-evaluate-iterate loop beats picking the "best" model on paper
- Tracing and evaluation should be in the pipeline from day one, not bolted on later
Why do I care: Every RAG project I have shipped has lived or died on observability. If I cannot see what was retrieved, why, and how the model used it, I cannot debug regressions, and the whole thing becomes a black box my team is afraid to touch.