Vercel Launches eve Agent Framework, AI Breaking UIs, and Building Agents the Hard Way

Published on 18.06.2026

motyl.dev<div></div></>FRONTEND

Introducing eve: Vercel's Open-Source Agent Framework

TLDR: Vercel released eve, an open-source TypeScript framework for building production-ready agents. It uses a filesystem-first convention where an agent is just a directory of files, and ships with durable execution, sandboxed compute, human-in-the-loop approvals, and evals built in.

Summary: Vercel's argument with eve is blunt: agents today are where web development was before frameworks, with every team handrolling the same infrastructure before their agent can do anything useful. eve is their answer to that. The mental model is simple. An agent is a directory. There is an agent.ts for model configuration, an instructions.md that becomes the system prompt, a tools folder for capabilities, a skills folder for domain knowledge in markdown, and a subagents folder for delegation. The filename is the definition, and eve wires everything together without boilerplate.

What makes this genuinely interesting rather than just another framework announcement is the production story. Every conversation in eve is a durable workflow, checkpointed at each step using Vercel's open-source Workflow SDK. Sessions survive crashes and deploys. An agent mid-task when you push code finishes on the version it started on. That is not a small detail. That is the kind of thing teams spend weeks building themselves, usually badly.

The sandbox story is also worth paying attention to. Every agent gets an isolated environment for shell commands and file operations, running in a separate security context from the harness. Locally it uses Docker or microsandbox; on Vercel it uses Vercel Sandbox. The backend is an adapter, so the same code runs everywhere. Human-in-the-loop approvals are built in too: any tool call can require a human to approve it, and the agent pauses indefinitely without consuming compute while waiting.

Vercel claims they are already running over a hundred agents in production on eve, and the numbers they share are concrete. Their internal data analyst agent handles over 30,000 questions a month in Slack. Their autonomous SDR agent costs around five thousand dollars a year and returns 32 times that. The support agent solves 92% of tickets on its own. These are not demo numbers. Whether those ROI figures are fully accurate is impossible to verify from outside, but the specificity at least suggests real production use rather than marketing theater.

What the announcement underplays is the lock-in angle. At launch, eve deploys to Vercel. Other platforms are "on the way." Tools like Vercel Connect handle OAuth consent and token refresh. Vercel Cron runs schedules. The sandbox in production is Vercel Sandbox. This is a framework that abstracts away infrastructure, but the abstraction maps very closely onto Vercel's own product surface. That is not inherently bad, but it is worth naming clearly before you restructure your agent architecture around it.

Key takeaways:

  • eve uses a filesystem-first convention where an agent is a directory of typed TypeScript files and markdown, and the framework handles all wiring automatically
  • Production features like durable execution, sandboxed compute, human approval flows, and OpenTelemetry tracing ship with the framework rather than requiring manual integration
  • Vercel runs over 100 internal agents on eve including a data analyst handling 30,000 monthly queries, an autonomous SDR with 32x ROI, and a support agent resolving 92% of tickets autonomously
  • At launch, eve is tightly coupled to Vercel's infrastructure stack; multi-platform support is promised but not yet available

Why do I care: I've been watching the agent framework space long enough to notice the pattern. Everyone ships the plumbing, calls it a framework, and the real question is whether the abstractions hold at the seams. What eve gets right is the filesystem-first approach. Convention over configuration worked for Rails and Next.js, and it has a real shot at working for agents too. The thing I keep coming back to is the approval flow design. Pausing a durable session indefinitely without burning compute for a human to click a button is surprisingly hard to get right, and it sounds like they have. My concern is the Vercel-shaped hole underneath everything. If you are already on Vercel, this is compelling. If you are not, watch how multi-platform support develops before committing.

Introducing eve


AI Code Reviews with CodeRabbit: Adding a Quality Gate to Your AI Workflow

TLDR: CodeRabbit is an AI code review tool that integrates at the PR stage and in your IDE, with codebase-awareness, linked issue context, and 40+ linters under the hood. It positions itself specifically as a quality gate for teams using AI to generate code faster.

Summary: The premise behind CodeRabbit is honest about something most AI coding tools are not. When you are using an AI agent to write code at speed, the review bottleneck gets worse, not better. You can generate a thousand-line diff in minutes, but a human reviewer still needs time to understand it. CodeRabbit's pitch is that it handles the heavy lifting of that review so humans can focus on the final ten percent that actually requires judgment.

The technical approach is more interesting than average. Rather than just pattern-matching on the diff, CodeRabbit pulls in context from multiple sources. It uses a Codegraph to understand dependencies across files, pulls linked issues from Jira and Linear, and runs over 40 linters and security scanners. The claim is that codebase-awareness is table stakes, and other tools stop there. CodeRabbit brings in external context on top of that. The customization story is YAML-based, which is the right call. You can encode your team's coding guidelines, style preferences, and custom pre-merge checks in natural language.

The learning loop is worth noting. You can leave feedback on reviews, and that feedback creates Learnings that the tool uses to improve future reviews. That is the right kind of flywheel if it works as advertised. The risk is that it also means early reviews for a new codebase can be noisy until the tool has calibrated.

Jensen Huang's quote is prominently featured. I am skeptical of vendor testimonials, especially ones from founders of other large companies. The more meaningful signal is the claimed 15,000 customers and six million repositories. That is a scale that suggests real adoption, even if the numbers cannot be independently verified. The "75 million defects found" metric is the kind of number that sounds impressive but is difficult to interpret without knowing how defects are defined and whether they include noise that was later dismissed.

Key takeaways:

  • CodeRabbit integrates at the PR stage and in IDE and CLI, reviewing code with codebase-aware context plus linked issue data from Jira and Linear
  • Custom pre-merge checks can be written in natural language, and the tool learns from reviewer feedback over time to reduce noise
  • The tool targets AI-assisted teams specifically, positioning itself as the quality gate that prevents speed from becoming sloppiness
  • Security scanning (SOC 2 Type II certified, SSL encrypted, zero data retention post-review) is built into the pipeline rather than being a separate step

Why do I care: Every team I have seen adopt AI coding tools hits the same wall: the code comes out faster, but PR review becomes the new bottleneck, and reviewers get fatigued by the volume. A tool that handles the obvious issues and surfaces the hard ones is genuinely useful in that context. My practical concern is that AI-reviewed AI-generated code creates a feedback loop where the same failure modes in reasoning get missed by both the generator and the reviewer. You still need a human who understands the system to catch the subtle architectural mistakes. CodeRabbit handles surface area well. It cannot replace the reviewer who knows why a particular design decision was made six months ago and why this diff subtly violates it.

AI Code Reviews | CodeRabbit


AI is Slowly Breaking Your UI

TLDR: A JavaScript Mastery video essay arguing that AI-generated interfaces are degrading UI quality in ways that are hard to notice at first, with a focus on how AI tools optimize for visual output over structural soundness.

Summary: The title is deliberately provocative, and the observation behind it is real. When you use an AI tool to generate UI components or layouts, the output tends to look right at a glance. It renders, it approximates the design intent, and it passes a quick visual check. The problem is that AI-generated UI often fails on the dimensions that are harder to see: accessibility, semantic structure, performance characteristics, and maintainability over time.

This connects to a broader pattern in AI-assisted development. The tools are optimized for immediate visual fidelity. They are trained on enormous amounts of frontend code, much of which was written without attention to accessibility or semantic HTML. The output reflects that training data. You get components that look like the thing you asked for while being structurally compromised in ways that accumulate over time.

The specific failure mode that matters most for frontend engineers is the loss of intent. A human developer writing a navigation component makes decisions about ARIA roles, keyboard handling, and focus management. An AI generating the same component reproduces the visual structure and skips the reasoning. The result passes visual QA but fails a screen reader. It looks fine in Chrome but breaks on a slow network. These issues do not show up in a code review that is already overwhelmed by volume.

There is also a compounding effect. When developers accept AI-generated UI without scrutiny, they are also learning less. The junior developer who would have struggled through the CSS layout problem and understood it is now accepting the generated output, shipping it, and moving on. Over time, the team's ability to reason about what the AI is doing degrades. This is the slow break the title is pointing at.

Key takeaways:

  • AI-generated UI components tend to optimize for visual correctness over semantic structure, accessibility, and long-term maintainability
  • The volume of AI-generated code can overwhelm review processes in ways that let structural issues through consistently
  • Teams that accept AI output without scrutiny risk losing the practical understanding needed to catch the AI's recurring failure modes
  • Frontend quality metrics need to expand beyond visual correctness to catch what AI tools systematically miss

Why do I care: This is the conversation that is not happening loudly enough in frontend engineering. We are measuring productivity in features shipped per sprint, not in accessibility compliance, semantic correctness, or long-term maintainability. The AI tools are optimized for the metric we are measuring. That is not the AI's failure. It is ours. The way I think about this: AI is a junior developer who can produce a lot of output very fast and has never once thought about accessibility. You would not ship a junior developer's code without review. Shipping AI-generated UI without structural review is the same mistake at a larger scale.

AI is slowly breaking your UI


scrcpy Isn't Just for Screen Mirroring: It's Part of My React Native Workflow Now

TLDR: A developer's account of integrating scrcpy, the open-source Android screen mirroring tool, into a React Native development workflow, using it for faster testing feedback loops and debugging directly from a desktop environment.

Summary: scrcpy is one of those tools that most Android developers know exists but fewer have thought to integrate deeply into a development workflow. It mirrors and controls an Android device from a desktop over USB or TCP/IP with very low latency, no installation required on the device, and no cloud relay. For React Native development specifically, that set of properties turns out to be genuinely useful.

The workflow argument is about iteration speed. React Native development involves constant switching between the code editor, the simulator, and the physical device. The simulator covers most cases, but there are real device behaviors that only show up on hardware, things like exact touch latency, camera access, notification handling, and platform-specific rendering differences. Without scrcpy, testing on a physical device means reaching for the phone, unlocking it, navigating to the app, and performing the interaction. With scrcpy, the device is a window on the desktop and the whole interaction happens without moving your hands.

The debugging application is the more interesting angle. scrcpy can record the screen and capture input. That means you can reproduce a reported bug by recording a session of the problematic interaction and playing it back while instrumenting the app. You can also use it alongside React Native Debugger without juggling windows across two screens. For teams doing remote pair programming on mobile apps, screen sharing a scrcpy window gives the partner a real device view without them needing physical access.

What is missing from this framing is the iOS story. scrcpy is Android-only. React Native developers working across both platforms still need a separate solution for iPhone testing, which limits how much workflow consolidation is actually possible. The article is also written from a solo developer perspective, and team setups with multiple target devices at different API levels add friction that the described workflow does not address.

Key takeaways:

  • scrcpy enables low-latency device control from desktop, reducing the physical interruption cost of testing on real Android hardware during React Native development
  • Screen recording and input capture built into scrcpy supports bug reproduction workflows without additional tooling
  • Integration with React Native Debugger in a single desktop window reduces context switching during debugging sessions
  • The workflow applies to Android only; iOS testing on physical devices requires separate tooling

Why do I care: Anything that reduces friction in the inner loop of mobile development is worth paying attention to. The context switch from keyboard to phone is small but it happens dozens of times a day, and the accumulated interruption cost is real. I appreciate that this is a practical workflow post rather than a framework announcement. It is also a good reminder that the best tools in a workflow are sometimes the ones that have been around for years but that nobody thought to use in that context.

scrcpy isn't just for screen mirroring, it's part of my React Native workflow now


Building an AI Agent in 6 Weeks (and Finally Understanding How They Work)

TLDR: Jeff Haemer, a long-career software developer, joined a Python Agentic AI cohort and built a multi-interface agent application with nearly 250 unit tests and 100% coverage in six weeks. The real outcome was a genuine mental model of how agents work, not just a working artifact.

Summary: This is the kind of learning story that is worth reading carefully, because it runs against the prevailing narrative that AI development is fast and easy. Jeff came in as an experienced developer who felt AI was a "big undifferentiated cloud of things I didn't know." He set expectations low and signed on as a beta tester deliberately. Six weeks later he had a working agent with a web UI, a command line interface, and a Telegram bot, plus a test suite rigorous enough to include mutation testing.

The software architecture point is the most valuable thing in the story. The cohort builds the reasoning core first, then wraps interfaces around it as independent units. This sounds obvious but it is not how most AI tutorials work. Most tutorials start with the interface and treat the underlying logic as something to plug in later. When the Telegram integration hit a wall during beta testing, Jeff could move on to the next week because the interface was independent of the core. The core kept working. This is not an AI-specific lesson; it is a design lesson that AI development frequently ignores in its rush to ship demos.

The mocking section is unexpectedly honest. Jeff avoided it until week four or five because he did not understand it, even after asking a test automation expert who admitted he did not fully understand it either. The breakthrough was being forced to think clearly about the boundary between code under test and outside services. That clarity then propagated back through his understanding of his own code's structure. This is exactly how learning programming works, and it is exactly what gets skipped when you let AI generate the test suite for you.

The distinction Jeff makes between using AI as a teaching assistant versus using it as an autopilot is the clearest articulation I have seen of a sustainable approach to AI-assisted learning. He used it to bridge the gap between conceptual understanding and Python syntax, not to generate code he would then have to maintain without understanding. The result is an artifact he can reason about, which is the test that matters.

The gap nobody talks about in this discussion: a cohort like this, however well designed, is not accessible to everyone. The structure, mentorship, and accountability it provides are hard to replicate solo. Most developers trying to learn AI development do not have that support structure, and the self-directed path through YouTube tutorials and documentation is considerably less effective.

Key takeaways:

  • Building the reasoning core independently of interfaces allowed the team to work around a broken Telegram integration without losing progress, validating a design principle that most AI tutorials ignore
  • Mocking, the practice of replacing external services with stand-ins for testing, was the hardest concept to internalize, but mastering it produced a fundamental change in how Jeff thought about his code's structure
  • Using AI as a targeted teaching assistant for syntax gaps, rather than as a code generator, resulted in code Jeff could understand and maintain
  • Six weeks of structured cohort learning produced not just a working artifact but a durable mental model that Jeff could apply to unfamiliar agent code he encountered in the wild

Why do I care: The framing of "finally understanding how they work" is exactly the right goal and it is undervalued in most developer education. We produce a lot of people who can use AI tools and not nearly enough who understand what those tools are doing. Jeff's story is evidence that the understanding is achievable through deliberate practice, but it requires resisting the shortcut of letting the AI do the reasoning for you. The 250 unit tests with 100% coverage is not the impressive number. The impressive number is that he understood why each test existed.

Building an AI Agent in 6 Weeks (and Finally Understanding How They Work)