Gemini 3.1 Pro Seizes Benchmarks, SaaSpocalypse Shakes Software Stocks, and Local AI Gains Ground

Published on 27.02.2026

AI & AGENTS

Gemini 3.1 Pro Takes the Lead on Benchmarks While Costing Less

TLDR: Google launched Gemini 3.1 Pro Preview, a mixture-of-experts transformer that topped the Artificial Analysis Intelligence Index and achieved state-of-the-art results on multiple benchmarks including ARC-AGI-2, GPQA Diamond, and Humanity's Last Exam. The kicker: it did so at the same price as its predecessor, suggesting model quality improvements rather than brute-force inference spending.

Summary:

Google just released Gemini 3.1 Pro Preview, and the benchmark results are turning heads. On the Artificial Analysis Intelligence Index, a weighted average of 10 benchmarks focused on economically useful work, Gemini 3.1 Pro Preview with reasoning scored 57 points at a cost of $892. Compare that to Claude Opus 4.6 at max reasoning hitting 53 points for $2,486, or GPT-5.2 at xhigh reasoning managing 51 points for $2,304. The open-weights GLM-5 scored 50 points at $547. So Google is delivering better results at significantly lower cost than the competition.

The architecture is a sparse mixture-of-experts transformer, pretrained on text, code, images, audio, and video from web scrapes, licensed materials, Google user data, and synthetic data. It was fine-tuned via reinforcement learning on datasets covering multi-step reasoning, problem solving, and theorem proving. The model accepts up to 1 million tokens of input across text, images, PDFs, audio, and video, and can produce up to 64,000 tokens of output at 108.6 tokens per second. It supports tool use including Google Search, Python code execution, file search, and function calling, plus structured outputs and three adjustable reasoning levels.

On ARC-AGI-2 visual logic puzzles, the model achieved 77.1 percent at just $0.96 per task, more than doubling its predecessor's 31.1 percent and outperforming Claude Opus 4.6 at high reasoning, which hit 69.2 percent at $3.47 per task. However, it is not all roses. Gemini 3.1 Pro Preview placed only seventh in coding on Arena, which ranks models by user preference in blind comparisons. And on Artificial Analysis' GDPval-AA agentic benchmark, it managed only 40 percent, trailing Claude Sonnet 4.6 at max reasoning with 57 percent and GLM-5 at 45 percent.

What is genuinely interesting here is the economics. The performance gains appear to stem from improved model quality rather than burning more tokens at inference. Gemini 3.1 Pro consumed roughly the same number of tokens as its predecessor while scoring significantly higher. That tells us there is still headroom in model refinement without inflating costs. The comparison between Gemini 3.1 Pro Preview and Gemini 3.1 Deep Think is also revealing: the Pro model reaches within 10 percent of Deep Think's 85 percent on ARC-AGI-2 but costs 13 times less per task. That is a strong incentive to reserve the heavy reasoning mode for truly hard problems.

What the article avoids discussing is the elephant in the room: Google disclosed virtually nothing about parameter count, architecture details, or training methods. The model card literally points to the previous version's card. For a model claiming state-of-the-art results, the lack of reproducibility information is notable. Also missing is any discussion of how these benchmark results translate to real production workloads, where the agentic benchmark weakness could be far more consequential than any number of academic test scores.

Key takeaways:

  • Gemini 3.1 Pro Preview topped the Artificial Analysis Intelligence Index at roughly one-third the cost of Claude Opus 4.6 and GPT-5.2
  • On ARC-AGI-2, it more than doubled its predecessor's score and beat Claude Opus 4.6 while costing less than a dollar per task
  • Performance gains came from model quality improvements, not increased inference compute
  • Weak spots remain in agentic tasks and user-preference rankings, suggesting benchmark leadership does not tell the whole story
  • API pricing is $2 per million input tokens for contexts under 200K tokens, rising to $4 for larger contexts

Google launches Gemini 3.1 Pro Preview


Global AI Summit Shifts From Fear to Practical Benefit

TLDR: The fourth global AI summit in New Delhi marked a decisive shift from obsessing over theoretical AI catastrophes to focusing on spreading AI's practical benefits worldwide. More than 85 countries endorsed the non-binding New Delhi Declaration, while major tech companies expanded their presences in India with billions in commitments.

Summary:

The AI Impact Summit in New Delhi from February 16 to 20 was billed as the first global AI summit hosted in the global south, and it represented a meaningful tonal shift from earlier gatherings. Where the inaugural 2023 summit was dominated by sci-fi doomsday scenarios, this year the conversation centered on realistic issues: ensuring access to compute, improving connectivity, and encouraging market competition. That is a welcome maturation.

India positioned itself as a counterweight to the US and China in the AI race. Prime Minister Modi promoted India's supply of tech talent, public technology infrastructure, and startup ecosystem, which is the third largest globally. The country backed up the rhetoric with action: as the conference opened, India allocated $1.1 billion to fund startups in AI and other high-tech fields, and the government is separately funding domestic startups to build models that can process all 22 of India's officially recognized languages while running on relatively modest compute budgets.

The corporate commitments were substantial. Anthropic and OpenAI opened offices in Bengaluru and Mumbai. OpenAI announced an agreement to use Tata Consultancy Services data centers for its ChatGPT Enterprise service. Google committed to building an AI hub in Visakhapatnam and promised additional subsea cables between India, the US, and other countries. Amazon plans $35 billion by 2030, Microsoft $17.5 billion over four years, and Google $15 billion over five years. These are not trivial numbers.

More than 85 countries, including the US and China, endorsed the New Delhi Declaration on AI Impact, a non-binding agreement emphasizing seven principles: democratization of AI resources, social empowerment, economic growth, secure and trusted AI, AI for science, nurturing of skills and education, and sustainable AI systems. However, critics had valid complaints. Amnesty International called the summit "largely irrelevant and ineffective at advancing binding rights protections" and pointed out that AI in India already contributes to "systems of mass surveillance in an already pernicious context of civil rights abuses." Also worth noting: Chinese participation was "almost nonexistent" as the schedule overlapped with Chinese New Year.

What is missing from the optimistic framing is any honest accounting of how non-binding declarations translate to actual governance. We have now had four of these summits, and the concrete regulatory output remains essentially zero. The corporate commitments to India are real, but they serve those companies' growth strategies first and foremost. The summit's focus on practical benefit is better than theoretical hand-wringing, but the gap between aspirational principles and enforceable protections remains enormous.

Key takeaways:

  • Over 85 countries endorsed the non-binding New Delhi Declaration on AI Impact
  • India positioned itself as a major AI player with $1.1 billion in startup funding and investments from Google, Microsoft, Amazon, Anthropic, and OpenAI
  • The summit's tone shifted from theoretical AI risks to practical deployment and access questions
  • Human rights organizations criticized the lack of binding governance commitments
  • Chinese participation was minimal due to scheduling conflicts with Chinese New Year

AI Impact Summit - New Delhi


The SaaSpocalypse: Investors Panic as AI Agents Target Enterprise Software

TLDR: The S&P Software and Services Index lost 25 percent of its value after Anthropic launched Claude Cowork and a series of open-source plugins targeting enterprise work functions. The panic reflects a genuine but overstated fear that AI agents will replicate or bypass traditional SaaS products, though Anthropic's subsequent partnership announcements stabilized the market somewhat.

Summary:

Investors had a collective panic attack over the past six weeks. Between January 12, when Anthropic introduced Claude Cowork, and February 23, the S&P Software and Services Index shed 25 percent of its value. Individual stocks got hammered: LegalZoom tumbled nearly 20 percent, Thomson Reuters dropped 16 percent. Jefferies strategist Jeffrey Favuzza gave the selloff its perfect name: the SaaSpocalypse.

The triggers were sequential and cumulative. On January 30, Anthropic released 11 open-source plugins, each targeting a specific white-collar job function: calendar management, document search, sales, financial analysis, data queries and visualizations, legal review, marketing, customer support, product management, biology research, and a meta-plugin that creates other plugins. Independent developers quickly contributed additional plugins replicating other software functionality. Four days later, the S&P Software and Services Index dropped another 4 percent, wiping out $285 billion in market capitalization. Then on February 20, Anthropic unveiled Claude Code Security, a cybersecurity application for detecting and patching vulnerabilities, triggering a further selloff in security software stocks.

The intellectual argument behind the panic has some merit. AI coding assistants could potentially replicate SaaS functionality. Lee Robinson at Cursor documented how his company replaced their content management system, Sanity, with a custom setup built from scratch, saving tens of thousands in recurring fees. The broader concern is that AI agents could end the "lock-in" effect where customers stay with vendors because they do not want to learn new interfaces. If an agent operates the interface for you, vendor lock-in loses its power.

But then Anthropic extended an olive branch. On February 24, they announced integrations with several of the companies Cowork threatens, including Docusign, FactSet, Google Gmail, Intuit, and Salesforce. Rather than bypassing their applications, new Cowork plugins connect to them directly. SaaS stocks jumped, though they did not recover their earlier losses. As Fintool CEO Nicolas Bustamante explained in a sharp analysis, agents can navigate complex processes and collapse expertise across multiple areas, but systems built on proprietary data, regulatory compliance, network effects, or embedded transactions remain beyond what LLMs can easily replicate.

The real takeaway is not that software is dead. It is that small teams can now build competitive products rapidly. The products with staying power will be those built on moats that LLMs cannot cross. And the Sanity response was telling: their spokesman noted that content management serves collaboration purposes that cannot easily be replicated by a git-based workflow. The SaaSpocalypse is not an extinction event. It is a forcing function for SaaS companies to articulate and defend their actual value propositions.

Key takeaways:

  • The S&P Software and Services Index lost 25 percent of its value in about six weeks following Anthropic's Cowork and plugin launches
  • Anthropic released 11 open-source plugins targeting core enterprise functions plus a meta-plugin for creating new plugins
  • Claude Code Security further rattled cybersecurity software stocks
  • Anthropic partially reversed course by partnering with SaaS companies like Docusign, Salesforce, and Intuit
  • LLMs threaten SaaS products with weak moats but cannot easily replace those built on proprietary data, compliance, or network effects

Tradeoffs:

  • AI agents can eliminate vendor lock-in by operating interfaces on users' behalf, but this trades familiar workflows for agent reliability risks
  • Building custom replacements with AI coding saves subscription costs but creates maintenance burden and loses collaborative features
  • Anthropic's dual strategy of competing with and partnering with SaaS companies creates strategic ambiguity for enterprise buyers

Did AI Really Kill SaaS?


Can Local AI Stand In for the Cloud? Intelligence Per Watt Says Not Yet, But Getting Close

TLDR: Stanford and Together AI researchers introduced the metric "intelligence per watt" to compare local versus cloud AI systems. While cloud still wins on efficiency, local systems have improved 5.3 times between 2023 and 2025, and can already handle about 89 percent of queries correctly while consuming dramatically less power in hybrid scenarios.

Summary:

A research team from Stanford and Together AI tackled a genuinely important question: as demand for LLM inference explodes and data centers multiply, could smaller models running on local hardware meaningfully lighten the load? Their answer, framed through a new metric called "intelligence per watt," is nuanced: not yet, but the trend line is compelling.

The insight is elegant in its simplicity. Cloud systems are typically more energy-efficient per user than local ones, but smaller, high-performance models are closing the gap. The researchers draw a historical parallel to the shift from mainframes to personal computers, which happened when PCs could perform well enough at comparable or lower energy costs. Intelligence per watt is defined as accuracy on a given task divided by power consumed. If local and cloud systems achieve similar accuracy, the one with higher intelligence per watt is the more efficient choice.

The methodology was thorough. The team ran various open-weights models on both laptop and data-center hardware, spanning both recent models like Qwen3, GPT-OSS, Gemma3, and IBM Granite 4.0, and older models like Mixtral-8x7B and Llama 3.1-8B. Hardware ranged from the Apple M4 Max laptop chip and Nvidia H100 server chip to the 2018-vintage Nvidia Quadro RTX 6000. They tested against over a million queries from real-world conversations, science, and academic disciplines, using GPT-4o for evaluating open-ended responses.

The results paint a clear picture with encouraging trends. Cloud computing systems running smaller models on the Nvidia B200 achieved at least 1.4 times higher intelligence per watt than the same models on local chips. But intelligence per watt for local systems rose 5.3 times between 2023 and 2025, driven by both algorithmic advances and hardware improvements. In single-turn chat and reasoning scenarios, local systems running smaller models answered approximately 88.7 percent of queries correctly relative to cloud systems while consuming substantially less power. In simulated hybrid routing scenarios where each query goes to whichever system answers correctly at lowest power, energy savings exceeded 80 percent.

There is a significant caveat the researchers acknowledged: they did not assess the intelligence per watt of proprietary models like GPT-5 because their power consumption is unknown. And when comparing accuracy alone, the most accurate local model, Qwen3-14B, trailed GPT-5, Gemini-2.5-Pro, and Claude Sonnet 4.5 by 11 to 13 percent. That accuracy gap matters in production. The paper also does not address multi-turn conversations, complex agentic workflows, or tasks requiring very large context windows, where the gap between local and cloud would likely widen substantially.

What is genuinely exciting about this research is that it reframes the local AI conversation. Privacy has traditionally been the driving argument for local inference. The prospect of rising intelligence per watt creates a compelling economic argument too. As models get more efficient and local hardware improves, the case for distributing compute away from massive data centers becomes about energy savings and cost, not just data sovereignty.

Key takeaways:

  • Intelligence per watt, a new metric measuring accuracy divided by power consumed, enables systematic comparison of local versus cloud AI
  • Local AI intelligence per watt improved 5.3 times from 2023 to 2025
  • Local systems can handle approximately 89 percent of queries relative to cloud while using dramatically less energy
  • Hybrid routing scenarios showed power savings exceeding 80 percent
  • The best local model still trails top proprietary models by 11 to 13 percent accuracy
  • The economic argument for local AI complements the traditional privacy argument

Tradeoffs:

  • Local inference offers privacy and potential energy savings but sacrifices accuracy on harder tasks
  • Hybrid routing optimizes for efficiency but adds complexity in deciding which queries run locally versus in the cloud
  • Smaller models close the power gap but remain behind proprietary models on absolute accuracy, especially for complex reasoning

Intelligence Per Watt: Local vs Cloud AI