Agents for Everything Else: Codex for Knowledge Work, Claude for Creative Work

Article Content

Agents for Everything Else: Codex for Knowledge Work, Claude for Creative Work.

We mentioned on the Unsupervised Learning pod about the thesis that "coding agents are breaking containment", and that talk is published live today.

Both Claude and Codex had very big weeks. Today's big Codex update was "Codex for Work", basically a landing page that pitches Codex for Knowledge Work (not just coding). But it's not just a landing page update; the latest Codex now has 42% faster CUA, responsive browser, /chronicle, /goal, and the onboarding now encourages you to plug into the Microsoft/Google/Salesforce suite.

Basically, as Tibo says, "Codex now available for non-coders", Greg "Codex is for everyone, for any task done with a computer", and Sam "try it for non-coding computer work."

Anthropic launched Claude Security, a code review tool. Probably the bigger news this week was the support of creative tools like Blender, Autodesk, Adobe Creative Cloud, Ableton, Splice, Canva Affinity, and more.

GPT-5.5 is now credibly in the top tier for long-horizon cyber tasks: the UK AI Security Institute reported that GPT-5.5 became the second model to complete one of its multi-step cyber-attack simulations end-to-end, with 71.4% average pass rate for GPT-5.5 vs 68.6% for Mythos.

Codex is moving beyond coding into general computer work: OpenAI shipped a substantial Codex update framed explicitly as "for everyone, for any task done with a computer," with role-based onboarding, app connections, and workflows spanning docs, slides, spreadsheets, research, and planning.

Qwen3.6 27B looks like the most important open-weight release: Artificial Analysis ranked Qwen3.6 27B as the new open-weights leader under 150B parameters with an Intelligence Index score of 46, ahead of Gemma 4 31B and prior Qwen variants. Key details: Apache 2.0, 262K context, native multimodal input, and BF16 weights small enough to fit on a single H100.

xAI's Grok 4.3 improved sharply on agentic benchmarks while getting cheaper: Artificial Analysis measured Grok 4.3 at 53 on the Intelligence Index, up four points from Grok 4.20 v2, with approximately 40% lower input price and 60% lower output price than the prior version.

DeepSeek's multimodal direction appears tightly coupled to computer-use agents: DeepSeek trains vision into V4-Flash by having the model directly output bounding boxes and point coordinates during reasoning, interpreting this as a computer-use-oriented design rather than generic VLM work.

There is a clear shift from model-centric bragging to harness-centric engineering: Cursor published a strong note on how it tests and tunes its agent harness, focusing on runtime, evals, degradation repair, and model-specific customization.

Open-source package compromise remains an acute operational risk: Socket reported that the popular PyPI package lightning was compromised in versions 2.6.2 and 2.6.3, with malicious code executing on import, downloading Bun, and running an 11 MB obfuscated JavaScript payload aimed at credential theft.

Security scanners are becoming first-class AI products: Anthropic rolled out Claude Security, described as a repo vulnerability scanner that validates findings and suggests fixes, powered by Opus 4.7.

Agents for Everything Else: Codex for Knowledge Work, Claude for Creative Work