GPT-5.5 Tops the Charts, Kimi K2.6 Challenges Open-Weights, and AI's Climate Debt Keeps Growing

TLDR

GPT-5.5 is OpenAI's most capable model yet and also its most confidently wrong one. Moonshot AI shipped Kimi K2.6, a 1-trillion-parameter open-weights model that can run autonomous coding sessions for days. And the AI industry's energy consumption is making a mess of every sustainability pledge the big tech companies ever made.

GPT-5.5: The Best and the Most Overconfident

OpenAI released GPT-5.5 this week, their latest vision-language model built for agentic coding, computer use, and knowledge work. GPT-5.5 Pro is the same model with parallel reasoning token processing during inference. The API pricing landed at roughly double GPT-5.4 rates.

Here's what's interesting about the numbers: GPT-5.5 tops the Artificial Analysis Intelligence Index with a score of 60 points, beating Claude Opus 4.7 and Gemini 3.1 Pro Preview (both at 57). On ARC-AGI-2 visual puzzles, it hit 85.0 percent at $1.87 per task, beating Gemini 3 Deep Think's previous record of 84.6 percent at $13.62 per task. That cost difference matters. Getting better results at a fifth of the price is not incremental progress.

But here's the catch that I keep coming back to: on AA-Omniscience Accuracy, GPT-5.5 posted the highest factual recall at 57 percent. On AA-Omniscience Index, which also penalizes confident wrong answers, it fell to third place with 20 points. The model knows a lot and doesn't always know what it doesn't know. That gap between raw accuracy and calibrated accuracy is exactly the thing that bites you in production. You can catch a model that says "I don't know." You can't easily catch one that confidently invents.

The specs are solid: up to 1 million tokens via API, five reasoning levels (xhigh through none), tool search that loads tools on demand rather than all at once, and a Fast mode in Codex that generates tokens 1.5 times faster at 2.5 times the price. GPT-5.5 is the fourth flagship launch in rapid succession from OpenAI right now. The pace is dizzying.

Kimi K2.6: Multi-Day Autonomy from an Open Model

Moonshot AI updated their Kimi line with K2.6, a 1-trillion-parameter mixture-of-experts vision-language model. It's designed around a plan-write-test-debug loop that can run for days, and it can instantiate hundreds of agents working on a single task simultaneously. It produces fewer hallucinations than its predecessor.

The architecture: 1 trillion total parameters, 32 billion active per token, MoonViT vision encoder, native INT4 quantization, and a "preserve thinking" mode. Input handles text, images, and video up to 256,000 tokens. The weights are free to download from Hugging Face under a modified MIT license that allows commercial use with attribution for companies over 100 million monthly active users or $20 million monthly revenue. API pricing is $0.95 per million input tokens, $4.00 per million output tokens.

On benchmarks, K2.6 runs neck and neck with Qwen3.6 Max Preview and the newly released DeepSeek V4, sitting just behind the top closed models. What I find more interesting than the benchmark position is the autonomy arc. Moonshot started with short reasoning traces, then multi-step tool use, then multi-hour coding sessions, and now multi-day projects. Each extension reduces how often a human needs to check in. That's the actual product direction, and it's happening fast.

The "We're thinking" from The Batch makes a good point: sustained autonomy and low hallucination rates are related but less and less so. An agent that can check its own work, find mistakes, and fix them has a different error profile than one that runs straight through. The correction loop matters more than the initial accuracy.

AI's Climate Debt: Tech's Net-Zero Pledges Are Fraying

The Batch pulled together sustainability report data from Alphabet, Amazon, Meta, and Microsoft this week, and the picture is not flattering.

Alphabet called its 2030 net-zero goal a "moonshot" in its most recent Environmental Report. Their total greenhouse-gas emissions increased 54 percent between 2019 and 2024, even as emissions per unit of computation dropped. They're building data centers partially powered by natural gas in North Texas while nuclear and geothermal investments sit waiting to scale. Amazon's total carbon emissions are up 33 percent since 2019. They're building natural-gas plants in Mississippi and Indiana. Meta's emissions grew over 60 percent between 2020 and 2024 while data center electricity consumption nearly tripled. They're building a private 5-gigawatt gas-powered plant in rural Louisiana for what will be their largest facility ever. Microsoft's emissions are up 23 percent since 2020 and they recently signed an agreement with Chevron to build a natural-gas plant, even after their 20-year purchase agreement to restart Three Mile Island.

The pattern is consistent. Every company invested in wind, solar, geothermal, and nuclear. Every company saw their absolute emissions rise because AI demand outpaced efficiency gains and clean energy deployment. Microsoft described their current goal as a "marathon, not a sprint." That's an honest reframe. These are hard problems to solve at this scale and this speed. But it's also worth naming directly: the industry is emitting more now than when it made most of these pledges.

LLMs Play Rock-Paper-Scissors Better Than You Do

Caroline Wang and colleagues at University of Texas at Austin and Google studied how humans and LLMs behave in rock-paper-scissors. The finding: LLMs sometimes model their opponents with greater sophistication than people do.

In the research setup, they pitted individual LLMs against each other and against humans in sequential rounds. They looked at how well each player tracked and predicted the other's strategy. The result is a bit counterintuitive given all the discussion about LLMs lacking true reasoning. A model can, apparently, encode a gaming strategy more systematically than the average human.

The editorial take is right here too: it's tempting to assume LLMs mimic human behavior because they train on human-generated text. Finding that they can out-strategize humans in adversarial games points to something different happening. They're not just pattern-matching to what humans did. They're doing something that, at least in this narrow context, looks like better strategic modeling. Whether that scales to other domains is the question. But it complicates the "just stochastic parrots" narrative considerably.

Andrew Ng's New Prompting Course

Andrew Ng launched "AI Prompting for Everyone" this week, aimed at helping people use LLMs the way they actually work in 2026 rather than 2022. The course covers deep research mode, providing rich context with documents and images, extended thinking for important decisions, and using AI for image generation, data analysis, and simple app building. No technical background required.

The framing is honest: most people still use LLMs by asking short questions. The models have moved well past that. This course is trying to close the gap.

Key Takeaways

GPT-5.5 leads major benchmarks but is more likely to confidently produce wrong answers than competing models at the same capability tier
Kimi K2.6's 1-trillion-parameter open-weights model can run autonomous coding sessions for days with hundreds of parallel agents, available free under a modified MIT license
Alphabet, Amazon, Meta, and Microsoft all saw absolute emissions grow significantly since 2019-2020 despite clean energy investments, as AI energy demand outpaced efficiency gains
Research from UT Austin and Google found LLMs can model opponent strategy in rock-paper-scissors with greater sophistication than many human players
The AI frontier is moving at a pace where four flagship model launches in rapid succession is becoming normal