Serverless Inference Model Library Text LLMs 7 min read read

gpt-oss-120b: specs, benchmarks, and how to run it on Lyceum

OpenAI's 117B open-weight reasoning model with configurable effort.

Caspar Lehmkühler

Caspar Lehmkühler

June 18, 2026 · Head of Product at Lyceum Technology

gpt-oss-120b is an open-weight, 117B-parameter Mixture-of-Experts (MoE) language model developed by OpenAI. The model features configurable reasoning effort, full chain-of-thought transparency, and native tool-calling capabilities. Licensed under Apache 2.0, it represents a major shift in OpenAI's strategy, offering frontier-level performance for open-source deployment. Lyceum Technology serves gpt-oss-120b via our OpenAI-compatible Serverless Inference API, allowing European teams to deploy this powerful reasoning model on GDPR-compliant, EU-hosted infrastructure with zero code changes.

Get started: call gpt-oss-120b on Lyceum

Access gpt-oss-120b through Lyceum Technology's OpenAI-compatible API. Migrating existing reasoning workflows requires only updating the base URL and providing an API key, allowing teams to switch infrastructure providers without rewriting application logic.

from openai import OpenAI

client = OpenAI(
 base_url="https://api.lyceum.technology/api/v2/external/serverless",
 api_key="<your lyceum api key>",
)
response = client.chat.completions.create(
 model="openai/gpt-oss-120b",
 messages=[{"role": "user", "content": "Hello!"}],
 max_tokens=256,
)
print(response.choices[0].message.content)

Pricing and region for gpt-oss-120b

On Lyceum, gpt-oss-120b is served in the Standard tier, which is specifically optimized for high-capability reasoning tasks where accuracy is paramount. The model is hosted in the eu-north1 region, ensuring full data residency within Europe for compliance-sensitive applications. Pricing is strictly pay-per-token at $0.15 per million input tokens and $0.60 per million output tokens. There are no base fees, no minimum commitments, no egress fees, and no idle costs. You only pay for the exact compute your application consumes, making it highly efficient for bursty workloads.

What gpt-oss-120b is good at

Configurable reasoning and chain-of-thought

Unlike standard dense models, gpt-oss-120b allows developers to adjust the reasoning effort (low, medium, high) based on latency and complexity requirements. It provides full chain-of-thought (CoT) visibility, granting complete access to the model's internal reasoning process. This transparency makes debugging complex agentic workflows significantly easier and increases trust in the final outputs, as engineers can inspect exactly how the model arrived at a specific conclusion.

Agentic capabilities and tool use

OpenAI optimized gpt-oss-120b specifically for agentic workflows. It features native support for function calling, web browsing, Python code execution, and Structured Outputs. This makes it an exceptionally strong candidate for building autonomous agents that need to interact with external APIs, query databases, or execute multi-step logic. The model's ability to reliably output structured JSON ensures that downstream systems can parse its responses without brittle regex workarounds.

Hardware efficiency via sparse MoE

Despite having 117 billion total parameters, gpt-oss-120b uses a highly efficient sparse Mixture-of-Experts (MoE) architecture. During inference, it activates only about 5.1 billion parameters per token, which is roughly 4.4 percent of the total network. Combined with MXFP4 quantization applied during post-training, this architectural choice allows the model to deliver near-frontier performance while fitting entirely on a single 80GB GPU, such as an NVIDIA H100 or AMD MI300X. This efficiency translates directly into lower inference costs and faster time-to-first-token metrics.

Benchmarks and how it compares

gpt-oss-120b benchmark results

In independent evaluations, gpt-oss-120b demonstrates performance approaching OpenAI's proprietary o4-mini model, particularly in reasoning and math tasks. The model's architecture allows it to punch significantly above its active parameter count.

Metric / Benchmark gpt-oss-120b Source
Artificial Analysis Intelligence Index 4 / 4 units Artificial Analysis
Math Index 93 Opper AI
Coding Index 29 Opper AI
Output Speed (Tokens/sec) ~345.6 Artificial Analysis

When compared to its smaller sibling, gpt-oss-20b, the 120B model offers significantly higher reasoning capabilities at the cost of increased VRAM requirements (80GB versus 16GB). Against other open-weight models in the 100B+ class, gpt-oss-120b stands out for its remarkable token efficiency. Artificial Analysis noted that the model used only 21 million tokens to complete their entire benchmark suite. This is roughly a quarter of the tokens required by o4-mini operating in high-reasoning mode, and half the tokens used by o3. This efficiency means that even when the model is "thinking" through complex problems, it wastes fewer tokens, directly reducing your overall inference costs in production environments.

Using it in production

Production configuration for gpt-oss-120b

When deploying gpt-oss-120b, managing its massive 131,072-token context window is critical for cost control. Because it is a reasoning model, it generates internal chain-of-thought tokens before producing the final answer. You must account for these reasoning tokens in your output budget, as they contribute to the total tokens billed per request.

On Lyceum Technology, the model runs in the Standard tier, hosted in the eu-north1 region. This tier is designed for high-capability models where complex reasoning takes priority over raw throughput. The Standard tier ensures that the underlying GPU infrastructure provides the necessary memory bandwidth to handle the model's sparse MoE routing efficiently.

Consider a production workload processing complex document analysis. If you send a 10,000-token input prompt and the model generates 1,500 output tokens (including its reasoning chain), the cost math is straightforward:

  • Input cost: 10,000 tokens × ($0.15 / 1,000,000) = $0.0015
  • Output cost: 1,500 tokens × ($0.60 / 1,000,000) = $0.0009
  • Total cost per request: $0.0024

Because Lyceum uses per-second, per-token billing, you only pay for the exact compute used. There are no idle costs when your application is not serving traffic, making this setup highly economical for bursty agentic workflows that experience variable demand throughout the day.

Running gpt-oss-120b on EU-sovereign infrastructure

Why run gpt-oss-120b on Lyceum

For European AI startups and enterprise teams, data residency is often a strict requirement. Routing sensitive data across the Atlantic introduces compliance risks. Lyceum Technology provides an EU-native alternative, hosting gpt-oss-120b entirely within our eu-north1 region. This ensures that your proprietary data and customer prompts never leave European borders.

By running this model on Lyceum, you benefit from our owned GPU infrastructure, which provides a structural cost advantage over API providers who simply rent compute from hyperscalers. This allows us to offer highly competitive per-token pricing without sacrificing performance or reliability. Furthermore, because our platform is built on open-stack transparency, utilizing vLLM and NVIDIA Dynamo, you avoid the vendor lock-in associated with black-box proprietary inference engines.

To learn more about how we secure your data and maintain regulatory alignment, read our comprehensive guide on GDPR-compliant LLM inference in Europe. Lyceum provides the reasoning power of gpt-oss-120b via an OpenAI-compatible API, ensuring infrastructure meets European data standards.

Frequently Asked Questions

How much does gpt-oss-120b cost on Lyceum?

gpt-oss-120b is priced at $0.15 per million input tokens and $0.60 per million output tokens. Billing is strictly pay-per-token with no base fees, making it highly cost-effective for both low-volume testing and high-scale production workloads.

What is the context window for gpt-oss-120b?

The model features a massive 131,072-token context window (often referred to as 128k). This allows you to input hundreds of pages of text, extensive codebases, or large datasets for the model to analyze and reason over in a single request.

Where is my data processed when using this model?

When you call gpt-oss-120b on Lyceum Technology, your requests are processed entirely within our eu-north1 region. We guarantee strict EU data residency and GDPR compliance, ensuring your sensitive prompts and data never leave European borders.

How do I migrate to Lyceum's gpt-oss-120b API?

Migration to Lyceum requires zero code changes due to its OpenAI-compatible API. Simply update your OpenAI SDK client to use our base URL ([removed]), insert your Lyceum API key, and set the model string to openai/gpt-oss-120b.

How does gpt-oss-120b compare to o4-mini?

gpt-oss-120b was designed to achieve near-parity with OpenAI's proprietary o4-mini model on core reasoning, math, and coding benchmarks. However, as an open-weight model, it allows for greater transparency, including full visibility into its chain-of-thought reasoning process.

What license does gpt-oss-120b use?

OpenAI released gpt-oss-120b under the highly permissive Apache 2.0 license. This allows developers and enterprises to build, experiment, and deploy the model commercially without copyleft restrictions or patent risks.

Related Resources

/magazine/glm-5-2; /magazine/llama-3-3-70b; /magazine/glm-5-1