AI Models Under the Hood: Smart Contract Audits, Agent Load Imbalance, and the Codex vs Claude Showdown
Published on 15.05.2026
Can Claude Audit Smart Contracts? Zero-Shot Vulnerability Detection Across Five SWC Classes
TLDR: Claude Sonnet 4.6 was fed five known-vulnerable Solidity contracts from the SmartBugs Curated benchmark in a single zero-shot prompt, no fine-tuning, no chain-of-thought scaffolding. It caught every single one of the five vulnerabilities. The catch is that it flagged every single finding as Critical severity, making the signal hard to act on in practice.
Summary: Dilrabo Orziqulova, a third-year university student, ran a disciplined experiment: take five contracts from the SmartBugs benchmark, each representing a different Smart Contract Weakness Classification category, then send each one to Claude Sonnet 4.6 with a consistent structured prompt and record whether the model caught the bug and what severity it assigned. The five SWC classes covered are some of the nastiest patterns in Ethereum history, including reentrancy, which sits behind the infamous DAO hack, as well as integer overflow, access control failures, unprotected self-destruct, and timestamp dependence.
The CEI invariant, which stands for Checks-Effects-Interactions and describes the safe ordering of state changes before any external calls, is the standard defense against reentrancy. When a contract places an interaction before updating its own internal state, it opens a window where a malicious caller can re-enter the function and drain funds before the balance is decremented. Claude identified this pattern correctly. In fact it identified all five patterns correctly, which is genuinely impressive for a zero-shot approach with no domain-specific tooling.
The problem is severity calibration. Every vulnerability came back labeled Critical. That is not a triage list, that is an alarm that never stops ringing. A useful audit output needs to distinguish between a contract that will definitely get drained by the next block and one that has a theoretical timestamp manipulation risk worth monitoring but not losing sleep over. When everything is Critical, nothing is actionable.
The author frames this through a formal notation describing the CEI invariant and presents Algorithm 1, a structured evaluation rubric for assessing zero-shot LLM audit performance. The conclusion is cautious and honest: LLMs can detect known vulnerability classes without any specialized training, but they are not ready to replace dedicated static analysis tools like Slither or Mythril that come with calibrated severity heuristics and years of tuning on real exploit data.
Key takeaways:
- Claude Sonnet 4.6 achieved 100% detection across five SWC vulnerability classes in zero-shot conditions
- All findings were rated Critical regardless of actual exploitability, which limits practical usefulness
- The SmartBugs Curated benchmark provides a reproducible evaluation set for comparing LLM audit performance
- Zero-shot LLM auditing is a promising first pass, not a replacement for specialized tools
- The CEI pattern (Checks-Effects-Interactions) remains the primary reentrancy defense and was correctly identified
Why do I care: If you are building anything that touches on-chain contracts, this result is both exciting and sobering. An LLM catching all five known patterns in a structured benchmark is a real capability you can use in a code review pipeline, maybe as a first filter before a human or a dedicated tool digs deeper. But the severity inflation problem is a signal that these models are not yet calibrated for the nuanced risk-ranking that a real audit requires. I would treat Claude as a very smart rubber duck for contract review, not as your security team.
Can Claude Audit Smart Contracts? Zero-Shot Vulnerability Detection Across Five SWC Classes
A Few AI Agents End Up Doing 80% of the Work: That Is a Design Problem
TLDR: When AI agents in a multi-agent system are allowed to select their own collaborators, work concentrates in a small number of highly connected hub agents, following the same power-law dynamics you see in human organizations. The author argues this is a predictable design outcome, not a random failure, and shows how to build systems that distribute load more intentionally.
Summary: The piece, published by the account @pilotprotocol, takes a data-driven look at task distribution patterns across multi-agent pipelines. The core observation is that if agents can freely route subtasks to whichever peer they prefer, they will reliably converge on the same few high-performing or fast-responding agents. The result is a small cluster of overloaded hub agents handling the majority of the work while most agents in the system sit idle.
This is not surprising if you have spent any time reading about scale-free networks. The same preferential attachment dynamic that makes a few websites absorb most web traffic and a few open-source maintainers absorb most of the pull request review load also shows up in AI agent graphs when routing is left unconstrained. The math is the same, the domain is just newer.
The practical consequence for anyone building agent orchestration is that you cannot assume workload will spread evenly just because you provisioned many agents. Without deliberate load-balancing logic, your system will develop bottlenecks that look like individual agent slowdowns but are actually structural problems in how the graph is shaped. Adding more agents does not fix a concentration problem, it often just gives the hub agents more options to ignore.
The author presents the data directly and walks through three design strategies: explicit load caps on individual agents, randomized routing with quality thresholds, and reputation decay so that heavily used agents become less attractive over time. None of these are exotic, but they require you to think about your agent network as a graph with dynamics rather than a pool of interchangeable workers.
Key takeaways:
- Agent workload distribution in unconstrained systems follows power-law concentration, not uniform spread
- A few hub agents doing 80% of work is a design outcome, not a random failure
- Adding more agents to a concentrated system often fails to fix the underlying bottleneck
- Load caps, randomized routing with quality floors, and reputation decay are three practical mitigations
- Treating an agent network as a dynamic graph rather than a static pool changes how you debug performance issues
Why do I care: This resonates with every distributed system problem I have ever seen, from microservices with a hot database shard to React component trees with a single context provider doing all the heavy lifting. The instinct to add more capacity is almost always wrong when the real issue is routing. If you are building anything with multiple AI agents, read this before you scale up. The fix is in the graph topology, not the instance count.
A Few AI Agents End Up Doing 80% of the Work - That Is a Design Problem
Codex 5.3 vs Claude Opus 4.6 on a Real Java Monolith
TLDR: Author @ngirchev ran both Codex 5.3 and Claude Opus 4.6 against a production Java monolith, testing streaming bug fixes, test generation, code review, and what the author calls vibe-coding risk, meaning how likely the model is to produce confident-sounding but subtly wrong output. The comparison is personal and opinionated, which makes it more useful than a benchmark table.
Summary: First-person comparisons of AI coding assistants on real codebases are rare because most benchmarks use toy problems or synthetic repositories. This piece takes a different approach, putting both models through a real Java monolith that presumably has the accumulated complexity, legacy patterns, and inconsistent naming conventions that characterize actual production code rather than a textbook example.
The comparison covers streaming bugs, where a model is asked to identify issues in a stream of code changes being fed incrementally. It covers test generation, where the model has to understand the existing architecture well enough to write tests that are actually useful rather than tests that pass by mocking everything into irrelevance. Code review is the third dimension, asking the model to find real problems in pull request diffs. Finally, vibe-coding risk asks a practical question: when does each model produce fluent, confident output that is actually wrong in ways that are hard to catch without deep domain knowledge?
Both models performed well on structured tasks like test generation, but diverged meaningfully on the open-ended review and streaming tasks. The article is honest about the fact that Java monoliths present specific challenges around dependency graphs and framework conventions that LLMs trained primarily on greenfield code samples handle with varying degrees of grace. The vibe-coding risk framing is worth adopting as a concept because it names something that every developer using these tools has felt but not always articulated: the model sounds more certain than it should be, and catching the error requires knowing the answer before you ask the question.
Key takeaways:
- Real-world Java monolith complexity exposes model differences that synthetic benchmarks miss
- Both Codex 5.3 and Claude Opus 4.6 perform well on structured tasks like test generation
- Streaming bug detection and open-ended code review showed the most meaningful divergence
- Vibe-coding risk describes the gap between model confidence and model correctness, a practical concern for production use
- Framework-specific knowledge and legacy pattern recognition remain weak spots for both models
Why do I care: Anyone doing AI-assisted development on a large codebase needs this kind of head-to-head, not because the winner is the right choice for every team, but because the failure modes differ. Knowing that Model A drifts on streaming reviews while Model B inflates certainty on legacy API patterns tells you where to put your human review energy. I want more of this kind of grounded comparison and fewer synthetic leaderboards.
Codex 5.3 vs Claude Opus 4.6 on a Real Java Monolith
Workflows, Agents, and Multi-Agent Systems Are Not the Same Thing
TLDR: Engineering Manager Raju Dandigam, with over 13 years of experience moving from frontend to leadership, wrote a beginner-friendly guide that draws sharp distinctions between three terms that get used interchangeably in AI marketing copy but describe genuinely different system architectures with different trade-offs and failure modes.
Summary: The terminology problem in AI tooling is real. Workflow, agent, and multi-agent system are used interchangeably in blog posts, conference talks, and product documentation in ways that make it hard to reason clearly about what you are actually building or buying. This piece takes a structured approach to separating them.
A workflow in this framing is a deterministic sequence of steps where the control flow is defined ahead of time. You might call an LLM at one or more steps in that sequence, but the LLM is a component in a pipeline, not the thing deciding what happens next. Think of it as a function with an AI-powered step inside it. The output is predictable given the same input, and the failure modes are mostly about the LLM step producing unexpected text, not about the system making unexpected choices.
An agent is something different: it has access to tools, it can decide which tool to use based on the current state of a task, and it loops until it determines the task is done or hits a stopping condition. The control flow is not fixed at design time. This makes agents more flexible and more unpredictable. They can handle tasks you did not anticipate when you wrote the code, and they can also get stuck in loops or take expensive wrong turns that a deterministic workflow would never take.
A multi-agent system is a collection of agents that can delegate to each other, which is where the load-concentration problem from the previous article comes back into focus. The coordination layer, meaning how agents discover each other, route tasks, and handle failures, is the hard part that most introductions to multi-agent systems wave past. Dandigam uses real-world analogies and, according to the newsletter description, actual code examples to make the distinctions concrete, though the scrape did not capture the full code content.
Key takeaways:
- Workflows are deterministic pipelines where LLMs are components, not decision-makers
- Agents have tool access and self-directed control flow, making them flexible but less predictable
- Multi-agent systems introduce coordination complexity that is the primary engineering challenge
- Conflating these three concepts leads to incorrect architecture choices and mismatched expectations
- Understanding failure modes requires understanding which category your system falls into
Why do I care: I cannot count the number of conversations I have had where someone says "we built an agent" and describes what is actually a hardcoded prompt chain with no tool use and no feedback loop. Getting this taxonomy right matters because the testing strategy, the observability tooling, and the failure recovery approach are all different depending on which category you are in. This article is the kind of foundational explainer I would send to a team starting an AI project.
Workflows, Agents, and Multi-Agent Systems Are Not the Same Thing