AI Is Eating the Grid, RAG Evaluation Is Still Broken, and SEO Is Dead (Long Live GEO)

Published on 01.06.2026

AI & AGENTS

The Great American Grid, Rebooted

TLDR: America's power grid was not built for 2026 demand. AI data centers, EV adoption, and wildfire liability have collided into a genuine infrastructure crisis, and the author argues this year is the last exit ramp before things get genuinely bad.

Summary:

There is a version of this story that sounds like a think-piece about The Future. This is not that. The grid problem is immediate and it is compounding in ways that interact badly with each other. Data centers running AI compute are now responsible for a measurable, non-trivial chunk of new power demand, and that demand does not arrive gradually. When a hyperscaler builds a new campus, it lands on the regional grid like a dropped anchor. The infrastructure was designed for slow industrial growth and residential load curves. It does not handle anchor drops.

At the same time, EV adoption is bending the evening demand curve into shapes that utilities genuinely were not modeling five years ago. People plug in when they get home. That is the same window when the grid is already stressed from commercial cooling loads coming off peak. These two curves are now fighting for the same electrons in the same hour, and transmission capacity has not kept up.

The wildfire liability piece is the one I find most technically interesting and most underreported. In western states, utilities face massive legal exposure when their equipment starts fires. The rational response, from a board-level risk perspective, is to not invest in new transmission. More lines means more liability surface area. So the same states that need the most grid modernization for renewable integration are also the states where the investment incentives are structurally broken. That is not a technology problem. It is a legal and insurance problem that is wearing technology clothes.

Nuclear comes up as one serious candidate for a solution, and I think that framing is right. The math on renewables plus storage gets hard when you are talking about the baseload requirements of a major AI training cluster. Nuclear does not care about cloud cover or wind speed. The revival interest is real and it is being driven by people who are looking at power purchase agreements and doing arithmetic, not by ideology.

Key takeaways:

  • AI data center power demand is no longer a future projection, it is a present grid stress that transmission infrastructure was not designed to handle
  • Wildfire liability in western states is actively suppressing utility investment in new lines, creating a perverse dynamic where modernization is legally risky
  • Nuclear energy is being reconsidered on purely pragmatic energy density grounds, divorced from the political noise that surrounded it a decade ago

Why do I care: I spend most of my time in software, but software runs on electricity and electricity comes from somewhere. When I see cloud providers building their own nuclear deals and data center campuses being sited based on proximity to reliable power rather than fiber, I understand that the physical layer is now a real constraint again. That changes how I think about latency assumptions, about region selection, about what "always on" actually means when the grid underneath a region is stressed. This is infrastructure that every distributed system is quietly depending on, and right now it is not keeping pace.

The Great American Grid - Rebooted


What Production-Grade RAG Evaluation Should Look Like

TLDR: Most RAG evaluation pipelines are measuring the wrong things on toy datasets and giving teams false confidence. Tahir Nawaz lays out what evaluation actually requires in production: faithfulness scoring, groundedness checks, and self-correcting retrieval when the pipeline fails.

Summary:

I have reviewed a lot of RAG implementations over the past two years, and the pattern is one I recognize immediately. The demo works. The recall numbers on the test set look reasonable. You ship it. Then three months later someone sends you a screenshot of the system confidently asserting something that is factually wrong but grammatically perfect, and the attribution chain points to a retrieved chunk that was only tangentially related to the question. That is a retrieval faithfulness failure, and recall@k on a curated benchmark would never have caught it.

The distinction Nawaz draws between retrieval evaluation and generation evaluation is the one that most teams collapse too early. Retrieval evaluation asks whether the right chunks came back. Generation evaluation asks whether the model actually used those chunks faithfully, or whether it papered over gaps with hallucinated content that sounds plausible. These are different failure modes and they require different measurement approaches. You cannot catch a hallucination problem with a retrieval metric.

LLM-as-a-judge for faithfulness scoring is the approach that has the most traction right now, and the article covers why. You can ask a capable model to assess whether a generated answer is grounded in the retrieved context, and that judgment correlates reasonably well with human evaluation. It is not perfect, but it is far better than no generation evaluation at all, which is the state most production systems are in.

The corrective RAG section is where I think the real engineering payoff sits. The idea is that when your evaluation pipeline detects a low-confidence or ungrounded answer, the system triggers additional retrieval rather than returning the bad answer. It is a feedback loop that treats evaluation as a runtime mechanism, not just an offline metric. That is a meaningful architectural shift from how most teams have built these systems.

The companion article on chunking strategies covers the upstream half of this problem. Semantic chunking and recursive splitting both have different tradeoffs depending on content type, and getting that wrong upstream means your faithfulness problems downstream are partly structural. Both articles together give a more complete picture than either one alone.

Key takeaways:

  • Recall-based retrieval metrics on toy datasets do not expose generation faithfulness failures, which are the failure mode that actually damages user trust in production
  • Faithfulness scoring using LLM-as-a-judge approaches gives you a practical way to measure whether the model is staying grounded in retrieved context
  • Corrective RAG turns evaluation from an offline quality check into a runtime recovery mechanism, triggering additional retrieval when confidence is low

Why do I care: If you are building anything with RAG right now, and most of us are, the gap between "it works in the demo" and "it is trustworthy in production" is almost always an evaluation gap. I have seen teams spend months on chunking strategies and embedding model comparisons while running zero generation evaluation. The retrieval pipeline is only half the system. Getting serious about faithfulness scoring is not optional if you care about whether your users can actually trust the answers.

What Production-Grade RAG Evaluation Should Look Like


I Spent Six Months Watching AI Dismantle Our SEO Assumptions

TLDR: Jon Kelly ran a six-month experiment tracking content performance across AI search tools and found that traditional keyword ranking metrics became nearly irrelevant while citation-based discovery and direct answer optimization emerged as the new surface area that actually drives traffic.

Summary:

I want to be honest about where I was on this six months ago: I thought the "SEO is dead" takes were mostly content marketing hype. I was wrong, and this article has the data to back it up. Kelly did the thing that most hot takes skip, which is actually instrumenting the change over time and reporting what the numbers showed rather than what the narrative required. That is worth reading for the methodology alone, before you even get to the conclusions.

The core shift is structural. Google AI Overviews, Perplexity, and ChatGPT search are all doing something fundamentally different from the ten blue links model. They are synthesizing answers and citing sources. That changes what "being found" means. Your ranking for a keyword is less important than whether your content gets pulled into the synthesis layer, and those are governed by different signals. A page that ranks fourth for a query might get cited in every AI overview while the first-ranked page gets nothing, because the cited page has a clearer, more directly answerable structure.

GEO, Generative Engine Optimization, is the term that is sticking for the practice of structuring content so AI systems will actually use it as a source. It is not keyword stuffing with different vocabulary. It requires writing in a way that makes claims verifiable, answers questions directly without burying the lede in preamble, and builds the kind of structural credibility that citation systems can evaluate. The concrete data in the article about traffic sources shifting from organic search to AI-citation referrals over the six-month period is the most useful part.

What I find genuinely interesting is that GEO is, in some ways, forcing better writing. When your metric is "did an AI system trust this enough to cite it," you cannot get away with content that is technically ranked but substantively thin. The old playbook of hitting keyword density targets while wrapping minimal information in maximum word count does not survive citation evaluation. I think that is probably good, even if the transition is painful for publishers who built their traffic on the old model.

Key takeaways:

  • Traditional keyword ranking metrics became significantly less predictive of traffic over six months as AI search tools changed the discovery layer
  • Citation-based discovery is now a real traffic channel, meaning content structure and direct answerability matter more than keyword optimization
  • GEO (Generative Engine Optimization) requires fundamentally different content architecture optimized for synthesis and citation rather than ranking

Why do I care: I maintain documentation, write technical content, and occasionally think about whether what I publish is findable by people who need it. The answer used to be "make sure it ranks." Now the answer is more like "make sure it is the kind of thing an AI synthesis system would trust enough to cite." That is a different editorial discipline, and the sooner technical writers and developer advocates internalize it, the better. The traffic cliff that publishers are hitting is a preview of what happens when you optimize for a system that no longer exists.

I Spent Six Months Watching AI Dismantle Our SEO Assumptions