The Free Tier Is a Data Collection Strategy

TLDR

The public internet is nearly exhausted as a training data source, and every major AI lab knows it. Free coding assistants, billion-dollar acquisitions, and distillation attacks on competitors are all different answers to the same question: where do we get data that actually differentiates our next model? If you're a developer using these tools, you're part of the answer.

The Data Wall Is Real

Epoch AI estimated around 300 trillion tokens of public human-generated text exist on the internet. At current training rates, labs will burn through all of it somewhere between 2026 and 2032. If you factor in overtraining, that estimate moves to now. Basically, we're there.

Every major lab has already ingested Wikipedia, Reddit, Stack Overflow, most of GitHub, most published books. What's left is either behind paywalls, low quality, or already contaminated with AI-generated content feeding back into the training pool. That last one is genuinely uncomfortable to think about. Models trained on synthetic content from earlier models, in a loop, degrading quietly.

The practical consequence is that frontier models are converging. The gap between Claude, GPT-4, and Gemini on most coding tasks has shrunk to near-negligible. API pricing dropped 60-80% between early 2025 and now. When everyone trains on the same internet, you get roughly the same model. The product is commoditizing faster than most people realize.

Three Ways Labs Are Responding

Here's what I keep thinking about: the strategies the labs are running to solve this are all essentially the same bet, just dressed up differently.

Give away the tools, collect the signal. Google Gemini CLI is free and open source. GitHub Copilot has a free tier. OpenAI is handing out Codex credits. Anthropic gives Claude Code users far more than a subscription's worth of tokens. None of this is charity. When you use an AI coding assistant, you generate training signal that doesn't exist anywhere on the public internet: what problem you were actually solving, what you tried first, what the model suggested, what you accepted, what you rewrote. A Stack Overflow answer shows correct code. Your coding session shows correct code in context. That's worth a lot more.

Buy companies that already have the data. SpaceX's reported $60 billion option to acquire Cursor makes zero sense as a product play. The valuation doesn't make sense on subscription revenue. It makes sense as the acquisition of a data pipeline: millions of senior engineers writing production code, generating continuous high-quality interaction data. SpaceX's own framing gave this away, describing Cursor's value as its "distribution to expert software engineers." That's a description of a data asset.

Same logic with OpenAI's $3 billion bid for Windsurf, and Google's $2.4 billion licensing deal after that fell apart. They didn't need another IDE. They needed Windsurf's users, and the workflows those users generate every day.

Distill from competitors. Anthropic accused DeepSeek, Moonshot AI, and MiniMax of running large-scale distillation campaigns: roughly 24,000 accounts, over 16 million exchanges, using commercial proxy services to get around China access restrictions. MiniMax alone accounted for 13 million of those exchanges. OpenAI made similar claims to U.S. legislators. Distilling your own models to build smaller, cheaper versions is normal practice. Systematically extracting a competitor's capabilities through a fake account network is a different thing.

What This Means If You're a Developer

You're the resource. That's not a reason to panic, but it is a reason to be deliberate. The companies building your tools need your workflows as much as you need their models. You trade real data, produced by real expertise, for AI assistance that's subsidized well below cost.

I use these tools constantly. Honestly, my productivity without them would take a serious hit. But I've started paying more attention to which tools I use and what exactly their terms say about training. A tool that runs locally, routes to whichever model you choose, and doesn't have its own model to train has a very different data relationship with you than one that processes everything through a single vendor's cloud.

The Kilo framing in this newsletter is self-serving, but the underlying point is fair. Architecture choices matter. Where your prompts go, who sees them, and whether they end up in a future training run are questions worth asking before you type your internal codebase context into an autocomplete box.

The Commodity Floor

Models will fully commoditize within the next year or two. I'd bet on it. The differentiation will shift to execution speed, privacy guarantees, tool integration, and how well the assistant fits into your actual workflow. Raw model intelligence is already nearly irrelevant for most day-to-day coding tasks.

The labs will keep hunting for new data sources, moving beyond coding into design, writing, data analysis, anywhere experts make decisions that generate signal. The arms race doesn't stop; it just migrates.

For now, the play is straightforward. Use the free tools while they're genuinely useful. Understand the trade-off. Don't get locked into any single vendor's ecosystem. The subsidies exist because you have something they want. That dynamic is worth keeping in mind.

Key Takeaways

Public internet training data is nearly exhausted; frontier models are converging as a result
Free AI coding tools (Gemini CLI, Copilot free tier, Claude Code, Codex credits) are primarily data collection mechanisms, not acts of generosity
Billion-dollar acquisitions of Cursor and Windsurf are data moat plays, not product acquisitions
Distillation attacks (24,000+ fake accounts against Anthropic, similar claims vs OpenAI) show the desperation for new training signal
API costs dropped 60-80% in 2025 alone; model intelligence is becoming a commodity
Developers should read ToS carefully, prefer tools with local execution or model-agnostic routing, and avoid single-vendor lock-in