AI News: Recursive Self-Improvement Becomes Real Lab Strategy, Agent Benchmarks Get Harder

AINews: Not Much Happened Today (But a Few Things Did)

TLDR: The AI community debated Claude Mythos benchmark regressions while Sakana AI formalized recursive self-improvement as an explicit research program. Agent benchmarks are evolving from quick tasks to long-horizon, economically meaningful challenges where current models perform poorly.

The title "not much happened today" is doing real rhetorical work here. The issue covers a fairly dense range of developments, so "not much" is relative to whatever the community's current baseline for significant news is, which is itself an interesting signal about the pace of the field.

The Claude Mythos discussion illustrates the benchmark reliability problem well. Users calling outputs "next level" for specific workflows while others point to regressions on the LLM Debate Benchmark in the same 24-hour window. Single-model evaluation is noisy, and the AI community's Twitter-native discourse amplifies both extremes. The practical takeaway is that model capabilities are task-specific in ways that aggregate benchmarks miss, and the only way to know how a model performs on your use case is to evaluate it on your use case.

The Anthropic result on NMR analysis is worth noting separately. Opus 4.7 matching or beating dedicated NMR software on some tasks is a concrete science application claim that's either transformative for researchers in that domain or overstated, and the distinction matters. But it illustrates the pattern of AI systems reaching into specialized domain competence in ways that weren't expected on the previous model generation.

The recursive self-improvement formalization is the most significant structural news. Sakana AI launching a dedicated RSI Lab in Tokyo, tying together The AI Scientist, Darwin Gödel Machine, and ShinkaEvolve projects, represents RSI moving from theoretical blog post discussion to staffed research program with explicit roadmaps. The emphasis on sample efficiency as a design constraint is interesting - the argument that self-improving systems can work under compute constraints rather than requiring hyperscale is a meaningful differentiator from the "you need a data center to do anything interesting" narrative.

The benchmark evolution section is where the signal is clearest. Classic SWE-bench-style evaluation measures task completion in small, well-defined windows. Agents' Last Exam evaluates against a thousand-plus economically meaningful tasks mapped to actual occupational taxonomy, with the hardest tier at a 2.6% full pass rate. SWE-Marathon tests coherence over billion-token budgets building substantial software projects from scratch. The gap between demo performance and this kind of sustained, meaningful evaluation is stark and suggests current reliability assessments are substantially overestimated.

Key takeaways:

Sakana AI formalized recursive self-improvement as an explicit research program, not just theoretical framing
New agent benchmarks use economically meaningful, long-horizon tasks where current models score in the low single digits
Model capability debates on social media are noisy; task-specific evaluation is the only reliable signal
Frontier model reliability in production settings is likely lower than benchmark scores suggest

Why do I care: The shift in agent benchmarks from task snippets to long-horizon economically meaningful work is the development I'm watching most closely. If your team is building AI agents for real workflows, the SWE-Marathon and ALE benchmarks are more relevant for setting realistic expectations than most of the headline numbers you'll see in model release announcements. The 2.6% hard-tier pass rate on ALE is a useful reality check against the "agents are ready for autonomous deployment" narrative.

[AINews] not much happened today