Published on 03.02.2026
TLDR: When Kilo Code made Kimi K2.5 free for a week, usage exploded to 3x projections. Developers loved its architectural reasoning capabilities, but the model's verbosity and expensive output tokens undermined promised cost savings from context caching.
Last week, Kilo Code made a bold move: they gave developers free access to Kimi K2.5, the latest reasoning model from Moonshot AI. The results surprised even the optimistic forecasters. Within hours, the floodgates opened. Usage surged to over 50 billion tokens per day on OpenRouter—tripling initial projections. This wasn't idle experimentation or curious poking around. Developers integrated K2.5 into actual production workflows, ran it through complex architectural challenges, and stress-tested its reasoning capabilities against real-world problems.
The story here matters because it reveals something important about the developer mindset: when you provide access to genuinely powerful tools, especially at no cost, developers will find ways to leverage them immediately. The enthusiasm wasn't manufactured hype. It was raw, pragmatic demand.
Perhaps the most telling metric was how quickly K2.5 ascended in Architect mode—the tool's specialized context for system design and architectural planning. Within days, it became one of the top-performing models for this task category. Normally, even exceptional models take weeks to climb these rankings. But K2.5 didn't coast on novelty. Developers praised its ability to reason through complex codebases, suggest meaningful refactoring strategies, and maintain coherent context across massive projects.
This matters for architects and engineering teams because it signals a shift in what's possible with AI-assisted design. You can now use K2.5 to explore architectural decisions at scale, stress-test your assumptions through a model that understands system complexity, and get back reasoning that accounts for dependencies and tradeoffs across your entire codebase.
The bigger picture emerging from this launch confirms something the industry has been watching quietly: the gap between open-source models and enterprise-grade proprietary solutions keeps shrinking. Kimi K2.5 demonstrates this convergence clearly. Open-source models are handling production-level complexity, maintaining reliability under load, and delivering results that would have been locked behind premium pricing tiers just a year ago.
Moonshot AI has been pushing hard on visual understanding and now they're making genuine leaps in reasoning and tool calling. The implication for teams building AI-powered development tools is significant: you have real options now. The binary choice between "free but limited" or "expensive but capable" is breaking down.
Here's where the story gets more nuanced. Kimi K2.5's headline feature is automatic context caching. The marketing promise is compelling: reduce input costs by 75%, dropping from $0.60 per million tokens to $0.10 per million for cached tokens. In theory, this makes K2.5 incredibly cost-effective for applications that reuse context across multiple requests.
The reality is more complicated. The caching works exactly as advertised—it's automatic, requires zero configuration, and genuinely reduces cached input token costs. But the model's behavior patterns undermine those savings faster than you'd expect. According to Artificial Analysis benchmarks, Kimi K2's reasoning mode consumed 140 million tokens to complete their Intelligence Index evaluations. That's roughly 2.5 times more tokens than DeepSeek-V3.2 generated, and double what GPT-5 Codex needed.
The root cause isn't broken caching. It's verbosity. K2.5 generates extensive internal reasoning tokens, and in agent mode it can execute up to 1,500 tool calls per task. This means output tokens become your real cost driver. When output tokens cost around $3.00 per million but cached inputs only cost $0.10 per million, those projected 75% savings evaporate quickly. As one Hacker News observer noted with blunt accuracy, K2.5 is "10x the price per output token" compared to competitors.
For teams on subscription plans with OpenAI or Anthropic, the math gets even less compelling. You can use your GPT-5 Codex subscription directly in Kilo now, which might be a better economic choice depending on your usage patterns.
Here's the inconvenient truth emerging across all reasoning-based models: better thinking produces more tokens, regardless of how efficiently the underlying system is engineered. You can't engineer away the reality that extensive reasoning requires extensive output. This applies equally to other reasoning models hitting the market. The question isn't whether K2.5 is "efficient"—it is. The question is whether its superior reasoning justifies the higher output token costs for your specific workloads.
For context, Kilo recently tested MiniMax M2.1 against GLM 4.7 and found M2.1 offers better cost-to-benefit ratio for many workflows. It's less capable than K2.5 in some domains but dramatically more efficient in others. That's why M2.1 became the default model for Kilo CLI 1.0—and it's completely free to use.
There were also initial latency issues during the launch week, though those appear to have resolved after the peak traffic period.
If you're evaluating K2.5 for production use, the core question is straightforward: does the quality improvement over cheaper alternatives justify the output cost delta? For architectural planning and complex system design reasoning, many teams will answer yes. For straightforward coding tasks, other models might serve you better economically.
The bigger implication is that teams now need to be sophisticated about model selection. The future might involve spending $100K per developer annually on AI tools. Understanding cost dynamics across different models and use cases isn't premature optimization—it's responsible engineering. Tools like Kilo Pass are emerging to help teams maintain flexibility, trying different models and finding the right fit for specific workflows.