gpt-oss-120b: specs, benchmarks, and how to run it on Lyceum
OpenAI's 117B open-weight reasoning model with configurable effort.
Caspar Lehmkühler
June 18, 2026 · Head of Product at Lyceum Technology
gpt-oss-120b is an open-weight, 117B-parameter Mixture-of-Experts (MoE) language model developed by OpenAI. The model features configurable reasoning effort, full chain-of-thought transparency, and native tool-calling capabilities. Licensed under Apache 2.0, it represents a major shift in OpenAI's strategy, offering frontier-level performance for open-source deployment. Lyceum Technology serves gpt-oss-120b via our OpenAI-compatible Serverless Inference API, allowing European teams to deploy this powerful reasoning model on GDPR-compliant, EU-hosted infrastructure with zero code changes.
Get started: call gpt-oss-120b on Lyceum
Access gpt-oss-120b through Lyceum Technology's OpenAI-compatible API. Migrating existing reasoning workflows requires only updating the base URL and providing an API key, allowing teams to switch infrastructure providers without rewriting application logic.
from openai import OpenAI
client = OpenAI(
base_url="https://api.lyceum.technology/api/v2/external/serverless",
api_key="<your lyceum api key>",
)
response = client.chat.completions.create(
model="openai/gpt-oss-120b",
messages=[{"role": "user", "content": "Hello!"}],
max_tokens=256,
)
print(response.choices[0].message.content)
Pricing and region for gpt-oss-120b
On Lyceum, gpt-oss-120b is served in the Standard tier, which is specifically optimized for high-capability reasoning tasks where accuracy is paramount. The model is hosted in the eu-north1 region, ensuring full data residency within Europe for compliance-sensitive applications. Pricing is strictly pay-per-token at $0.15 per million input tokens and $0.60 per million output tokens. There are no base fees, no minimum commitments, no egress fees, and no idle costs. You only pay for the exact compute your application consumes, making it highly efficient for bursty workloads.
What gpt-oss-120b is good at
Configurable reasoning and chain-of-thought
Unlike standard dense models, gpt-oss-120b allows developers to adjust the reasoning effort (low, medium, high) based on latency and complexity requirements. It provides full chain-of-thought (CoT) visibility, granting complete access to the model's internal reasoning process. This transparency makes debugging complex agentic workflows significantly easier and increases trust in the final outputs, as engineers can inspect exactly how the model arrived at a specific conclusion.
Agentic capabilities and tool use
OpenAI optimized gpt-oss-120b specifically for agentic workflows. It features native support for function calling, web browsing, Python code execution, and Structured Outputs. This makes it an exceptionally strong candidate for building autonomous agents that need to interact with external APIs, query databases, or execute multi-step logic. The model's ability to reliably output structured JSON ensures that downstream systems can parse its responses without brittle regex workarounds.
Hardware efficiency via sparse MoE
Despite having 117 billion total parameters, gpt-oss-120b uses a highly efficient sparse Mixture-of-Experts (MoE) architecture. During inference, it activates only about 5.1 billion parameters per token, which is roughly 4.4 percent of the total network. Combined with MXFP4 quantization applied during post-training, this architectural choice allows the model to deliver near-frontier performance while fitting entirely on a single 80GB GPU, such as an NVIDIA H100 or AMD MI300X. This efficiency translates directly into lower inference costs and faster time-to-first-token metrics.
Benchmarks and how it compares
gpt-oss-120b benchmark results
In independent evaluations, gpt-oss-120b demonstrates performance approaching OpenAI's proprietary o4-mini model, particularly in reasoning and math tasks. The model's architecture allows it to punch significantly above its active parameter count.
| Metric / Benchmark | gpt-oss-120b | Source |
|---|---|---|
| Artificial Analysis Intelligence Index | 4 / 4 units | Artificial Analysis |
| Math Index | 93 | Opper AI |
| Coding Index | 29 | Opper AI |
| Output Speed (Tokens/sec) | ~345.6 | Artificial Analysis |
When compared to its smaller sibling, gpt-oss-20b, the 120B model offers significantly higher reasoning capabilities at the cost of increased VRAM requirements (80GB versus 16GB). Against other open-weight models in the 100B+ class, gpt-oss-120b stands out for its remarkable token efficiency. Artificial Analysis noted that the model used only 21 million tokens to complete their entire benchmark suite. This is roughly a quarter of the tokens required by o4-mini operating in high-reasoning mode, and half the tokens used by o3. This efficiency means that even when the model is "thinking" through complex problems, it wastes fewer tokens, directly reducing your overall inference costs in production environments.
Using it in production
Production configuration for gpt-oss-120b
When deploying gpt-oss-120b, managing its massive 131,072-token context window is critical for cost control. Because it is a reasoning model, it generates internal chain-of-thought tokens before producing the final answer. You must account for these reasoning tokens in your output budget, as they contribute to the total tokens billed per request.
On Lyceum Technology, the model runs in the Standard tier, hosted in the eu-north1 region. This tier is designed for high-capability models where complex reasoning takes priority over raw throughput. The Standard tier ensures that the underlying GPU infrastructure provides the necessary memory bandwidth to handle the model's sparse MoE routing efficiently.
Consider a production workload processing complex document analysis. If you send a 10,000-token input prompt and the model generates 1,500 output tokens (including its reasoning chain), the cost math is straightforward:
- Input cost: 10,000 tokens × ($0.15 / 1,000,000) = $0.0015
- Output cost: 1,500 tokens × ($0.60 / 1,000,000) = $0.0009
- Total cost per request: $0.0024
Because Lyceum uses per-second, per-token billing, you only pay for the exact compute used. There are no idle costs when your application is not serving traffic, making this setup highly economical for bursty agentic workflows that experience variable demand throughout the day.
Running gpt-oss-120b on EU-sovereign infrastructure
Why run gpt-oss-120b on Lyceum
For European AI startups and enterprise teams, data residency is often a strict requirement. Routing sensitive data across the Atlantic introduces compliance risks. Lyceum Technology provides an EU-native alternative, hosting gpt-oss-120b entirely within our eu-north1 region. This ensures that your proprietary data and customer prompts never leave European borders.
By running this model on Lyceum, you benefit from our owned GPU infrastructure, which provides a structural cost advantage over API providers who simply rent compute from hyperscalers. This allows us to offer highly competitive per-token pricing without sacrificing performance or reliability. Furthermore, because our platform is built on open-stack transparency, utilizing vLLM and NVIDIA Dynamo, you avoid the vendor lock-in associated with black-box proprietary inference engines.
To learn more about how we secure your data and maintain regulatory alignment, read our comprehensive guide on GDPR-compliant LLM inference in Europe. Lyceum provides the reasoning power of gpt-oss-120b via an OpenAI-compatible API, ensuring infrastructure meets European data standards.