AI Engineer World's Fair Wave 2 and the Week Open Weights Got Loud

TLDR: AINews announces Wave 2 Call for Speakers for the AI Engineer World's Fair with new tracks for Autoresearch, Memory, World Models, Tokenmaxxing, Agentic Commerce, and Vertical AI. The same issue runs a long news roundup where Grok 4.3 ships at lower prices with mixed evals, DeepSeek V4 Pro lands as a credible open-weight coding agent, Codex keeps eating UX ground from Claude Code, and a wave of agent runtime work signals the bottleneck is moving from model IQ to harness design.

Summary: The top of the issue is housekeeping for the conference circuit. The team is pushing into Moscone West, doubling capacity for the third year, and opening submissions through sessionize. The new tracks read like a map of where they think AI engineering is heading next year. Autoresearch and Tokenmaxxing speak to internal AI-native team productivity without Goodharting. Memory and World Models name the two long-running gaps people keep tripping over. Agentic Commerce and Vertical AI in Law, Healthcare, GTM, and Finance are the bets on which industries will actually deploy. There is also a Startup Battlefield pitch event and free expo space for robotics demos, with a polite note that humanoids must be accompanied.

The model news is where it gets interesting. xAI shipped Grok 4.3 with roughly 40% lower input and 60% lower output pricing and a 4-point bump on the Intelligence Index, but reception was split. GDPval-AA jumped 321 Elo to 1500, hinting at stronger real-world agent task performance, while non-hallucination dropped 8 points and the cost story may be subsidized by uneven hardware utilization. DeepSeek V4 Pro is the more interesting headline for engineers who care about open weights. The hands-on take is that it is the first open-weight model that genuinely feels comparable to Codex or Claude Code for multi-turn agentic coding, with 1M context, hybrid CSA/HCA attention, KV cache reduced to 10%, and nearly 4x lower inference FLOPs at long context. The leading open trio of Kimi K2.6, MiMo V2.5 Pro, and DeepSeek V4 Pro now sit at 52 to 54 on the Intelligence Index, against 57 for Gemini 3.1 Pro Preview and Claude Opus 4.7 and 60 for GPT-5.5. The remaining gap is concentrated in HLE, CritPt, TerminalBench Hard, and hallucination-heavy Omniscience.

The other big thread is the agent runtime arms race. Codex is winning on product velocity, with a device toolbar for responsive testing, faster browser-use, CI status in chat, and yes, a viral pets feature that proves shipping cohesive UX matters more than another benchmark point. The community read on Codex versus Claude Code reads like a taste fight: GPT-5.5 is described as smarter and unblocking, while Opus 4.7 has better intent but wanders. Devin, Hermes, and a new TypeScript framework called Flue all converged on the same primitives in the same week. Subagents, browser-use, durable state, compaction, skills, and feedback loops are now table stakes. LangChain and LangGraph leaned into multi-user deployment with data isolation, delegated credentials, operator RBAC, durable pause and resume, and a HITL mode where a human reply lands as a tool result. Cloudflare announced Dynamic Workflows. The pattern is clear. The runtime is the moat now, not the model.

On the research side, ReaLM-Retrieve argues reasoning models should retrieve during inference, not before, with a +10.1% F1 improvement over RAG and 47% fewer retrieval calls. OCR-Memory stores long trajectories as images with indexed anchors and reports SOTA on Mind2Web and AppWorld under tight context. Meta FAIR has a self-improving pretraining method where a strong post-trained model rewrites pretraining suffixes and judges rollouts during RL-style pretraining, with reported 36.2% relative gains in factuality and 18.5% in safety. Microsoft built 1,000 synthetic computers with realistic files for 8-hour, 2,000-turn agent simulations. The local-LLM corner is busy too: PFlash claims 10x prefill speedup over llama.cpp at 128K on a 3090 (with healthy skepticism in the comments), and Qwen-Scope dropped Sparse Autoencoders for Qwen 3.5 models from 2B up to a 35B MoE, which may be the largest open-source interpretability tool yet for dense models.

Key takeaways:

The competitive frontier in coding agents is shifting from raw model quality to harness UX, with Codex setting the pace and open-weight models like DeepSeek V4 Pro now in the same conversation.
Durable execution, multi-user data isolation, delegated credentials, and explicit HITL hooks are becoming first-class runtime concerns rather than afterthoughts.
Open-weight trillion-parameter MoE models are 3 to 8 points behind frontier closed models on intelligence indexes, with the gap concentrated in hardest-task and hallucination benchmarks.

Why do I care: As a senior frontend or architect, the LangChain multi-user agent guidance is the part to bookmark. Data isolation, delegated credentials, and operator RBAC are exactly the questions you need answered before any agent feature ships into a real product. The Codex versus Claude Code framing is also useful when picking a coding harness for a team. If you ship fast and iterate, taste matters less than time-to-first-token and tool-call economy. If you're heads-down on consultancy work, the DeepSeek V4 Pro report is the strongest hint yet that self-hosted coding agents are about to stop being a downgrade. The research papers are more relevant for ML and infra audiences, but the OCR-Memory idea of storing agent trajectories as images is the kind of thing a frontend architect should keep in their back pocket for any long-running assistant feature.

https://www.latent.space/p/ainews-ai-engineer-worlds-fair-autoresearch