GLM-5 vs MiniMax M2.5: Which Open-Weight Coding Model Wins at Real Tasks?

Published on 27.02.2026

AI & AGENTS

GLM-5 vs MiniMax M2.5: Three Coding Tasks, Two Models, One Clear Takeaway

TLDR: GLM-5 scored 90.5 out of 100 and MiniMax M2.5 scored 88.5 across three autonomous coding tasks covering bug hunting, legacy refactoring, and greenfield API implementation. GLM-5 builds more comprehensively and tests more thoroughly, while MiniMax M2.5 follows instructions more carefully and finishes in half the time.

Summary:

Here is a question that matters a lot more than people realize: when you hand off a coding task entirely to an AI model and walk away for twenty minutes, what do you actually get back? The Kilo Code team designed three TypeScript-based challenges to find out, pitting GLM-5 against MiniMax M2.5 in fully autonomous runs with zero human intervention. Both models score impressively on SWE-bench Verified, with MiniMax at 80.2 percent and GLM-5 at 77.8 percent, putting them within striking distance of GPT-5.2 and Claude Opus 4.6 at dramatically lower cost. But benchmarks are one thing. Shipping code is something else entirely.

The first test was a bug hunt: eight planted bugs across eleven files in a Hono and Prisma task management API, including race conditions, SQL injection vulnerabilities, JWT algorithm confusion, pagination off-by-one errors, and memory leaks. Both models found all eight bugs. The difference came down to discipline. MiniMax M2.5 followed the instruction to make minimal changes more carefully, preserving existing code behavior and documenting every fix with comments. GLM-5 fixed the race condition by adding a transaction but also removed an intentional delay, changing behavior beyond what was asked. This is the kind of subtle overreach that can bite you in production. When you tell a model "fix this, nothing more," you want it to actually mean that.

The second test asked both models to modernize a callback-heavy Express e-commerce API. GLM-5 delivered objectively better architecture here: it used express-validator for input validation with proper format checks and MongoDB ID verification, created custom error classes applied consistently across routes, and produced a cleaner overall structure. But it also broke one endpoint path, changing the API contract. MiniMax M2.5 built a simpler custom validation system that missed some edge cases but kept every single endpoint path intact. This is a real tradeoff that architects face constantly: do you want the model that builds the better solution or the one that does not break your existing contracts?

The third and most telling test was building 27 endpoints from an OpenAPI specification. GLM-5 scored a perfect 35 out of 35. It wrote 94 test cases, used proper composite primary keys for join tables, created reusable authorization middleware, and produced zero bugs. MiniMax M2.5 implemented all endpoints but had a critical authorization bug in the attachments handler, used a non-standard schema pattern with unnecessary ID fields, and wrote only 13 test cases. The difference in test coverage alone is staggering. When you are building from scratch, GLM-5 delivers codebases you could ship without modification. But it took 44 minutes compared to MiniMax M2.5's 21 minutes.

For teams evaluating these models, the lesson is not that one is universally better. It is that models have default personalities that matter for task selection. GLM-5 is the model you want for greenfield development where completeness and test coverage justify the extra time. MiniMax M2.5 is the model you want for working inside existing codebases where respecting constraints and finishing quickly matters more than architectural perfection. The real sophistication lies in knowing when to deploy which model, and potentially in auto-routing systems that make this choice for you. What the benchmark does not address, and what would be genuinely useful to see next, is how these models perform when given corrective feedback mid-task or when working on codebases with tens of thousands of lines rather than purpose-built test projects.

Key takeaways:

  • GLM-5 scores 90.5 out of 100 overall with superior architecture, testing (94 test cases vs 13), and zero bugs in greenfield implementation
  • MiniMax M2.5 scores 88.5 but finishes in half the time (21 vs 44 minutes) and follows "minimal changes" instructions more carefully
  • Both models found all eight planted bugs autonomously, but differed in fix quality and adherence to constraints
  • GLM-5 broke an API endpoint path during refactoring, while MiniMax M2.5 preserved all contracts, highlighting the tension between better architecture and safer changes
  • Open-weight models at 77 to 80 percent on SWE-bench Verified are approaching frontier model performance at significantly lower cost

Tradeoffs:

  • GLM-5 delivers comprehensive solutions with full test coverage but takes twice as long and occasionally overreaches beyond stated requirements
  • MiniMax M2.5 preserves existing code contracts and finishes faster but produces less thorough test coverage and can miss validation edge cases
  • Auto-routing between models based on task type gains efficiency but sacrifices the consistency of always using the same model

MiniMax 2.5 vs. GLM-5 across 3 Coding Tasks [Benchmark & Results]