MiniMax-M2.5: specs, benchmarks, and how to run it on Lyceum
A 230B MoE model from MiniMax built for SOTA coding and agentic workflows.
Maximilian Niroomand
June 21, 2026 · CTO & Co-Founder at Lyceum Technology
MiniMax-M2.5 is an open-weights large language model developed by MiniMax, featuring a 230-billion parameter Mixture-of-Experts (MoE) architecture with 10 billion active parameters. Designed for complex coding, agentic tool use, and real-world productivity, it rivals proprietary frontier models on key benchmarks like SWE-Bench. Lyceum Technology serves MiniMax-M2.5 via our OpenAI-compatible Serverless Inference API. Engineering teams can leverage this highly capable model through a single drop-in endpoint, served from our us-central1 region with pay-per-token pricing and no idle or base fees.
Get started: call MiniMax-M2.5 on Lyceum
To integrate MiniMax-M2.5 into your application, you can use the standard OpenAI Python SDK. Our platform provides a drop-in replacement API, meaning you do not need to rewrite your application logic or learn a new framework. By updating the base URL and providing your Lyceum API key, you can route requests directly to our managed infrastructure.
from openai import OpenAI
client = OpenAI(
base_url="https://api.lyceum.technology/api/v2/external/serverless",
api_key="<your lyceum api key>",
)
response = client.chat.completions.create(
model="MiniMaxAI/MiniMax-M2.5",
messages=[{"role": "user", "content": "Hello!"}],
max_tokens=256,
)
print(response.choices[0].message.content)Pricing and region for MiniMax-M2.5
When you deploy this model through the Lyceum Serverless Inference API, you are billed strictly on a per-token basis with no minimum commitments or idle compute charges. The MiniMax-M2.5 model is categorized under our Standard tier, which prioritizes high-capability execution for complex reasoning tasks.
The pricing for MiniMax-M2.5 is $0.30 per million input tokens and $1.20 per million output tokens. For this specific catalogue model, the hosting region is us-central1. This setup allows engineering teams to access a massive 230-billion parameter Mixture-of-Experts model without the overhead of provisioning and managing the underlying hardware. You pay only for the exact compute you consume during inference.
What MiniMax-M2.5 is good at
Frontier-level coding and agentic workflows
MiniMax-M2.5 is a 230-billion parameter Mixture-of-Experts model developed by MiniMax. During inference, it activates only 10 billion parameters per token, allowing it to maintain high throughput while delivering reasoning capabilities that rival the largest proprietary models. The model was extensively trained using reinforcement learning across hundreds of thousands of complex environments, making it highly effective for software engineering and agentic tool use.
One of the defining characteristics of MiniMax-M2.5 is its emergent Architect Mindset. Unlike standard coding models that immediately begin generating scripts based on a prompt, M2.5 proactively decomposes the task. It plans the project structure, feature requirements, and user interface design before writing any code. This architectural approach significantly reduces logical errors in complex, multi-file software projects and makes the model exceptionally well-suited for autonomous agent workflows.
Multilingual programming and web research
The model demonstrates strong proficiency across more than ten programming languages, including Python, Rust, Go, C++, and TypeScript. It handles the entire development lifecycle, from initial environment setup to system development and debugging. Furthermore, MiniMax-M2.5 excels at web research and tool calling. It can navigate complex browser environments, manage long contexts, and execute precise search iterations to gather necessary information before synthesizing a final response. This makes it a strong candidate for backend automation and data-heavy research pipelines.
Benchmarks and how it compares
MiniMax-M2.5 benchmark results
MiniMax-M2.5 has been evaluated across several rigorous industry benchmarks, demonstrating performance that matches or exceeds current frontier models. It is particularly strong in software engineering and agentic web navigation.
| Benchmark | MiniMax-M2.5 | Claude 3.5 Sonnet | GPT-4o |
|---|---|---|---|
| SWE-Bench Verified | 80.2% | 80.8% | 80.0% |
| Multi-SWE-Bench | 51.3% | Not reported | Not reported |
| BrowseComp | 76.3% | Not reported | Not reported |
| MATH-500 | 96.8% | Not reported | Not reported |
Source: MiniMax Official Release and Hugging Face Model Card.
On the SWE-Bench Verified evaluation, which tests a model's ability to resolve real-world GitHub issues, MiniMax-M2.5 scores 80.2%. This places it in the same tier as Anthropic's Claude Opus 4.6 and slightly ahead of OpenAI's GPT-5.2. In multilingual coding tasks measured by Multi-SWE-Bench, the model achieves an industry-leading 51.3%.
When compared to sibling models in the open-weights ecosystem, MiniMax-M2.5 offers a distinct advantage in reasoning speed. It completes complex agentic evaluations 37% faster than its predecessor, MiniMax-M2.1. For engineering teams evaluating models for autonomous coding agents, M2.5 provides the accuracy of a massive dense model with the inference efficiency of a sparse Mixture-of-Experts architecture.
Using it in production
Production configuration for MiniMax-M2.5
When deploying MiniMax-M2.5 in production, engineering teams must account for its specific context limits and pricing structure. The model supports a maximum context window of 204,800 tokens. This massive capacity allows you to input entire codebases, extensive API documentation, or large datasets in a single prompt. However, to maintain optimal inference speeds, we recommend utilizing prompt caching strategies and keeping routine requests well below the maximum limit.
On the Lyceum Technology platform, MiniMax-M2.5 operates under the Standard tier. This tier is designed for high-capability models that require significant compute resources for complex reasoning, as opposed to the Fast tier which prioritizes cost-efficiency for smaller models. The API endpoint for this model is routed through the us-central1 region.
To forecast your infrastructure costs, consider a typical agentic workflow. If an autonomous coding agent processes 2 million input tokens while reading repository files and generates 500,000 output tokens while writing new features, the cost calculation is straightforward. The input tokens cost $0.60 (at $0.30 per million), and the output tokens cost $0.60 (at $1.20 per million). The total cost for this extensive task is $1.20. This per-token pricing model ensures you only pay for active compute, eliminating the financial risk of maintaining idle GPU clusters. For more details on how this architecture scales, read our guide on serverless GPU inference explained.
Why run MiniMax-M2.5 on Lyceum
Why run MiniMax-M2.5 on Lyceum
Lyceum Technology gives AI teams a managed way to run frontier models without standing up their own GPU fleet. The MiniMax-M2.5 catalogue endpoint is served from our us-central1 region in the US, and you reach it through a standard OpenAI-compatible interface. Point your existing OpenAI SDK at our base URL, swap in your API key, and the model becomes a drop-in replacement, so there is no new framework to learn and no application logic to rewrite.
Managing a 230-billion parameter model on your own hardware is a complex engineering challenge. It requires advanced orchestration, continuous monitoring, and specialized inference engines like vLLM or NVIDIA Dynamo. By using our Serverless Inference API, you offload this operational burden. Our platform handles the underlying GPU provisioning, load balancing, and auto-scaling, and our open-stack approach keeps the serving layer transparent rather than a black box. You can also burst to per-second dedicated GPUs when a workload needs guaranteed throughput.
We operate our own GPU infrastructure rather than renting capacity from hyperscalers. This structural advantage lets us offer highly competitive per-token pricing with no base fees, no minimum commitments, no idle charges, and zero egress fees, all on a single unified bill. Whether you are running batch jobs or serving real-time requests, you scale from zero to thousands of concurrent requests instantly. For a deeper look at how this architecture works, read our guide on A100 vs H100 for LLM inference.