How to (Almost) Fry Your AI Agent and Your Mac Mini: Six Hard Lessons

How to (Almost) Fry Your AI Agent (and Your Mac Mini)

TLDR: Running a 35B local model as a third agentic harness on a 16GB Mac Mini that already hosts two heavy cloud-agent harnesses and a dozen background daemons will bring the machine to its knees. This post is an honest catalog of six mistakes from building a real, production-ish personal AI agent stack, and what actually fixed them.

Summary:

The author runs what is, by most standards, a genuinely sophisticated personal AI agent stack on a Mac Mini M4 with 16GB of RAM. Claude Code with Opus and Sonnet, Codex with GPT-5.4 and 5.5, and local models via Ollama and llama.cpp. On paper, it sounds like a lot. In practice, it had been working for months, handling research, content drafting, iMessage triage, Discord bots, cron orchestration, and a handful of quiet business automations.

The first and most expensive mistake was trying to add a 35B local model as a third agentic harness. The model had been running under llama.cpp with a memory-mapped flag that keeps most weights on SSD, paging them in during inference. That trick works. On a quiet machine. The Mac Mini was not a quiet machine. Between the Claude Code and Codex harnesses (which run locally in terms of context management, file watching, repo indexing, and subprocess spawning even if the model itself is remote), the headless desktop session required for AppleScript bridges and vision passes, and the long tail of cron jobs, iMessage watchers, Discord listeners, and memory consolidation loops, the disk and CPU were already spoken for. Layering a 35B model doing continuous agentic loops on top of that created a perfect storm where every process was starving every other process. The machine started restarting silently. The fix was to boot the 35B daemon entirely and route local inference only to Qwen 9B and 4B models served by Ollama, which stay inside Metal GPU memory and evict cleanly. The lesson, put plainly: treat your machine's resource budget the way you treat a short to-do list. If adding a new always-on layer would push you below your floor under realistic concurrent load, it does not go on.

The second mistake was a configuration migration that was only half done. After a benchmark comparison between Gemma 4 and Qwen 3.5, the author swapped Gemma in as the primary local model tier. The LiteLLM-routed paths got updated. The smoke tests passed. The documentation got updated. The agent's memory got updated. And then, quietly, the iMessage triage script, a couple of cron jobs, the embeddings helper, and the local-fallback chain kept calling Qwen directly via hardcoded URLs. The Gemma weights sat on disk for weeks, 17GB on a tight drive, serving zero tokens. The fix is two habits: after any config swap, grep the entire repo for the old endpoint and model names, and run a daily audit that compares which models live processes actually call against what the agent's memory says is in production. Documentation drifts faster than systems. Migrations are almost never done when you think they are.

The third mistake was a subtle state management bug that caused three separate agent sessions to each rotate the same Stripe API key within six hours of each other. The visible behavior was alarming. The actual cause was narrow: a state write that was supposed to mark a key rotation as complete updated the visible task board but not the internal intents memory. The daily shift read the intents memory, saw the rotation still pending, and queued it again. A separate iMessage wake handler read the same stale intents and queued it a third time. The fix is transactional state writes where a task that lives in two places must close in two places atomically, plus a daily shift that cross-checks intents memory against the visible task board before acting. The broader rule is one worth keeping: any autonomous loop that reads stored intents should treat tasks older than a week as suspicious by default, because the world has probably already handled them.

The fourth mistake lived in a model switcher that routed work between Claude and Codex. The Codex path could hang silently, producing no output for 26 minutes before an outer timeout killed it. The Claude fallback was wired to recognizable failure signatures like OAuth errors and network errors. A blind hang does not match those signatures, so the fallback never fired. The fix was a three-second preflight: a cheap, hard-bounded health check before committing to the expensive path. That three seconds versus a thirty-minute wake budget is a 600x safety margin. The general principle is sharp and transfers broadly: any time a small decision layer routes to a larger processing layer, the small layer must fail fast enough that a silent hang in the large layer does not simply swallow the fallback window whole.

The fifth mistake was a shell command allowlist that did prefix matching. The run_command tool checked whether a command started with an allowlisted binary name. A command beginning with curl followed by a semicolon and something destructive would have passed. The fix is a forbidden-metacharacter check on top of the prefix check, rejecting anything containing semicolons, double ampersands, pipes, backticks, subshell syntax, redirects, or newlines before the command reaches shell execution. The rule that follows from this: a prefix-based allowlist is a polite suggestion. Real safety means parsing what would actually execute and checking every piece of it, not just the first word.

The sixth mistake is the most interesting one from an agent design perspective. The agent was replying to voice memos with "sorry, I cannot transcribe audio from this channel" because a broken model ID in the transcription tool caused it to fail silently. The agent was replying to requests about livestreams with "I cannot watch live streams" when a screenshot plus a vision pass would have worked. The fix was doctrinal: rewriting the agent's identity file to treat "find a way" as the default and "I cannot" as the failure mode, plus adding a daily scanner that flags outgoing messages smelling of defeatism for review. The result was fewer apologies and the ones that remain are about things that are genuinely impossible.

The connective tissue between all six mistakes is the same. Each one started from an assumption that had stopped being true. The 35B model was light last time it was checked. Memory said Gemma because at some point Gemma was current. Codex hung in a specific recognizable way the last time it hung. The shell check was safe because no one had thought about chaining. The agent said "I cannot" because last week that path was actually broken. An autonomous system is not something you build once. It is something whose internal map of itself you have to actively keep honest against a world that quietly rearranges underneath it.

Key takeaways:

Running two cloud-agent harnesses (Claude Code, Codex) on a 16GB machine already consumes significant local resources for context management, file watching, and subprocesses, even though the models themselves run remotely. Adding a 35B local agentic model on top is hardware overcommitment.
Config migrations are almost never complete at the point when you think they are. Always grep the full repo for old endpoint names after any swap, and run a live-process audit to verify what is actually calling what.
Multi-surface state writes in autonomous agents must be transactional. If a task exists in two places, it must close in two places atomically, or stale-state-driven duplicate work is a when, not an if.
Router and switcher layers need hard-bounded preflights. A silent hang in a downstream process will silently swallow your fallback window unless the decision layer can detect and route around it in well under a second.
Prefix-based command allowlists are not allowlists. Parse the full command including shell metacharacters before deciding what is safe to execute.
Treat every "I cannot" reply from an autonomous agent as a hypothesis to test, not a verdict to accept. If the tool failed, fix the tool. If the model declined without trying, fix the doctrine.

Why do I care: This post is unusually honest about the failure modes of personal AI agent infrastructure, and honest failure analysis is rare. The six mistakes map directly onto problems that show up in production systems too, not just home lab experiments. The state management issue with duplicate work triggered by stale intents memory is a well-documented failure mode in multi-agent systems. The shell injection risk from prefix-only allowlists is a classic. The silent hang eating the fallback window is something anyone building resilient distributed systems has hit. What makes this interesting is that the author is running all of this on a single Mac Mini, which means the failure modes are compressed and visible in ways that distributed systems obscure. If you are building any kind of agentic loop, the specific patterns here, transactional state writes, hard-bounded preflights, full-command parsing, live-process audits against documented config, are directly applicable.

How to (Almost) Fry Your AI Agent (and Your Mac Mini)