Serverless Inference Model Library Text LLMs 8 min read read

MiniMax-M2.5: specs, benchmarks, and how to run it on Lyceum

A 230B MoE model from MiniMax built for SOTA coding and agentic workflows.

Maximilian Niroomand

Maximilian Niroomand

June 21, 2026 · CTO & Co-Founder at Lyceum Technology

MiniMax-M2.5 is an open-weights large language model developed by MiniMax, featuring a 230-billion parameter Mixture-of-Experts (MoE) architecture with 10 billion active parameters. Designed for complex coding, agentic tool use, and real-world productivity, it rivals proprietary frontier models on key benchmarks like SWE-Bench. Lyceum Technology serves MiniMax-M2.5 via our OpenAI-compatible Serverless Inference API. Engineering teams can leverage this highly capable model through a single drop-in endpoint, served from our us-central1 region with pay-per-token pricing and no idle or base fees.

Get started: call MiniMax-M2.5 on Lyceum

To integrate MiniMax-M2.5 into your application, you can use the standard OpenAI Python SDK. Our platform provides a drop-in replacement API, meaning you do not need to rewrite your application logic or learn a new framework. By updating the base URL and providing your Lyceum API key, you can route requests directly to our managed infrastructure.

from openai import OpenAI

client = OpenAI(
 base_url="https://api.lyceum.technology/api/v2/external/serverless",
 api_key="<your lyceum api key>",
)
response = client.chat.completions.create(
 model="MiniMaxAI/MiniMax-M2.5",
 messages=[{"role": "user", "content": "Hello!"}],
 max_tokens=256,
)
print(response.choices[0].message.content)

Pricing and region for MiniMax-M2.5

When you deploy this model through the Lyceum Serverless Inference API, you are billed strictly on a per-token basis with no minimum commitments or idle compute charges. The MiniMax-M2.5 model is categorized under our Standard tier, which prioritizes high-capability execution for complex reasoning tasks.

The pricing for MiniMax-M2.5 is $0.30 per million input tokens and $1.20 per million output tokens. For this specific catalogue model, the hosting region is us-central1. This setup allows engineering teams to access a massive 230-billion parameter Mixture-of-Experts model without the overhead of provisioning and managing the underlying hardware. You pay only for the exact compute you consume during inference.

What MiniMax-M2.5 is good at

Frontier-level coding and agentic workflows

MiniMax-M2.5 is a 230-billion parameter Mixture-of-Experts model developed by MiniMax. During inference, it activates only 10 billion parameters per token, allowing it to maintain high throughput while delivering reasoning capabilities that rival the largest proprietary models. The model was extensively trained using reinforcement learning across hundreds of thousands of complex environments, making it highly effective for software engineering and agentic tool use.

One of the defining characteristics of MiniMax-M2.5 is its emergent Architect Mindset. Unlike standard coding models that immediately begin generating scripts based on a prompt, M2.5 proactively decomposes the task. It plans the project structure, feature requirements, and user interface design before writing any code. This architectural approach significantly reduces logical errors in complex, multi-file software projects and makes the model exceptionally well-suited for autonomous agent workflows.

Multilingual programming and web research

The model demonstrates strong proficiency across more than ten programming languages, including Python, Rust, Go, C++, and TypeScript. It handles the entire development lifecycle, from initial environment setup to system development and debugging. Furthermore, MiniMax-M2.5 excels at web research and tool calling. It can navigate complex browser environments, manage long contexts, and execute precise search iterations to gather necessary information before synthesizing a final response. This makes it a strong candidate for backend automation and data-heavy research pipelines.

Benchmarks and how it compares

MiniMax-M2.5 benchmark results

MiniMax-M2.5 has been evaluated across several rigorous industry benchmarks, demonstrating performance that matches or exceeds current frontier models. It is particularly strong in software engineering and agentic web navigation.

BenchmarkMiniMax-M2.5Claude 3.5 SonnetGPT-4o
SWE-Bench Verified80.2%80.8%80.0%
Multi-SWE-Bench51.3%Not reportedNot reported
BrowseComp76.3%Not reportedNot reported
MATH-50096.8%Not reportedNot reported

Source: MiniMax Official Release and Hugging Face Model Card.

On the SWE-Bench Verified evaluation, which tests a model's ability to resolve real-world GitHub issues, MiniMax-M2.5 scores 80.2%. This places it in the same tier as Anthropic's Claude Opus 4.6 and slightly ahead of OpenAI's GPT-5.2. In multilingual coding tasks measured by Multi-SWE-Bench, the model achieves an industry-leading 51.3%.

When compared to sibling models in the open-weights ecosystem, MiniMax-M2.5 offers a distinct advantage in reasoning speed. It completes complex agentic evaluations 37% faster than its predecessor, MiniMax-M2.1. For engineering teams evaluating models for autonomous coding agents, M2.5 provides the accuracy of a massive dense model with the inference efficiency of a sparse Mixture-of-Experts architecture.

Using it in production

Production configuration for MiniMax-M2.5

When deploying MiniMax-M2.5 in production, engineering teams must account for its specific context limits and pricing structure. The model supports a maximum context window of 204,800 tokens. This massive capacity allows you to input entire codebases, extensive API documentation, or large datasets in a single prompt. However, to maintain optimal inference speeds, we recommend utilizing prompt caching strategies and keeping routine requests well below the maximum limit.

On the Lyceum Technology platform, MiniMax-M2.5 operates under the Standard tier. This tier is designed for high-capability models that require significant compute resources for complex reasoning, as opposed to the Fast tier which prioritizes cost-efficiency for smaller models. The API endpoint for this model is routed through the us-central1 region.

To forecast your infrastructure costs, consider a typical agentic workflow. If an autonomous coding agent processes 2 million input tokens while reading repository files and generates 500,000 output tokens while writing new features, the cost calculation is straightforward. The input tokens cost $0.60 (at $0.30 per million), and the output tokens cost $0.60 (at $1.20 per million). The total cost for this extensive task is $1.20. This per-token pricing model ensures you only pay for active compute, eliminating the financial risk of maintaining idle GPU clusters. For more details on how this architecture scales, read our guide on serverless GPU inference explained.

Why run MiniMax-M2.5 on Lyceum

Why run MiniMax-M2.5 on Lyceum

Lyceum Technology gives AI teams a managed way to run frontier models without standing up their own GPU fleet. The MiniMax-M2.5 catalogue endpoint is served from our us-central1 region in the US, and you reach it through a standard OpenAI-compatible interface. Point your existing OpenAI SDK at our base URL, swap in your API key, and the model becomes a drop-in replacement, so there is no new framework to learn and no application logic to rewrite.

Managing a 230-billion parameter model on your own hardware is a complex engineering challenge. It requires advanced orchestration, continuous monitoring, and specialized inference engines like vLLM or NVIDIA Dynamo. By using our Serverless Inference API, you offload this operational burden. Our platform handles the underlying GPU provisioning, load balancing, and auto-scaling, and our open-stack approach keeps the serving layer transparent rather than a black box. You can also burst to per-second dedicated GPUs when a workload needs guaranteed throughput.

We operate our own GPU infrastructure rather than renting capacity from hyperscalers. This structural advantage lets us offer highly competitive per-token pricing with no base fees, no minimum commitments, no idle charges, and zero egress fees, all on a single unified bill. Whether you are running batch jobs or serving real-time requests, you scale from zero to thousands of concurrent requests instantly. For a deeper look at how this architecture works, read our guide on A100 vs H100 for LLM inference.

Frequently Asked Questions

What is the context window for MiniMax-M2.5?

MiniMax-M2.5 supports a massive context window of 204,800 tokens. This extensive capacity allows engineering teams to input entire code repositories, large financial datasets, or comprehensive API documentation in a single prompt, making it highly effective for complex, data-heavy agentic workflows and document analysis.

How much does the MiniMax-M2.5 API cost on Lyceum?

On our platform, the MiniMax-M2.5 model costs $0.30 per million input tokens and $1.20 per million output tokens. This per-token billing model ensures you only pay for the exact compute you consume, with no minimum commitments, base fees, or idle hardware costs.

How do I migrate to Lyceum's OpenAI-compatible API?

Migrating to our platform is straightforward. Because our Serverless Inference API is fully OpenAI-compatible, you only need to change the base URL in your existing SDK to https://www.minimax.io/news/minimax-m25 and update your API key. No application code rewrites or architectural changes are required.

Where is the MiniMax-M2.5 model hosted?

The MiniMax-M2.5 catalogue model is hosted in Lyceum's us-central1 region, which runs in the United States. You reach it through the same OpenAI-compatible Serverless Inference API as every other catalogue model, with per-token pricing and no idle or base fees.

Is MiniMax-M2.5 better than GPT-5.2 for coding?

MiniMax-M2.5 performs exceptionally well in software engineering tasks. On the SWE-Bench Verified evaluation, it scores 80.2%, which slightly edges out GPT-4o at 80.0%. Its emergent Architect Mindset helps it plan complex project structures before writing code, significantly reducing logical errors.

What is the architecture of MiniMax-M2.5?

MiniMax-M2.5 utilizes a Mixture-of-Experts architecture with a total of 230 billion parameters. However, it only activates 10 billion parameters during inference. This sparse activation allows the model to deliver frontier-level reasoning capabilities while maintaining high throughput and low latency in production.

Related Resources

/magazine/glm-5-2; /magazine/llama-3-3-70b; /magazine/gpt-oss-120b