These flexible open-weight reasoning models are designed for developers, researchers, and enterprises who need transparency, customization while maintaining the advanced reasoning capabilities of chain-of-thought processing. Both GPT-OSS models have been trained to think step-by-step before responding with an answer, excelling at complex reasoning tasks such as coding, mathematics, planning, puzzles, and agent workflows. They feature adjustable reasoning effort levels, allowing you to balance performance with computational cost.

How to use GPT-OSS API

These models are only available to Build Tier 1 or higher users. Since reasoning models produce longer responses with chain-of-thought processing, we recommend streaming tokens for better user experience:
from together import Together

client = Together() # pass in API key to api_key or set a env variable

stream = client.chat.completions.create(
    model="openai/gpt-oss-120b",
    messages=[
        {
            "role": "user",
            "content": "Solve this logic puzzle: If all roses are flowers and some flowers are red, can we conclude that some roses are red?",
        }
    ],
    stream=True
)

for chunk in stream:
  print(chunk.choices[0].delta.content or "", end="", flush=True)
This will produce an output that contains both the chain-of-thought reasoning and the answer:
<think>
The user is asking about a logical deduction problem. Let me break this down:

Given premises:
1. All roses are flowers
2. Some flowers are red

Question: Can we conclude that some roses are red?

Let me think about this carefully. We know that all roses belong to the set of flowers. We also know that some subset of flowers are red. However, the question is whether the subset of red flowers necessarily overlaps with the subset of roses.

The red flowers could be entirely separate from roses - for example, all the red flowers could be tulips, carnations, or other non-rose flowers. There's no logical requirement that the red flowers must include any roses.

So we cannot definitively conclude that some roses are red based solely on these premises.
</think>

**Answer:** No, we cannot definitively conclude that some roses are red based on the given premises alone. While all roses are flowers and some flowers are red, the red flowers could be entirely non-rose flowers (like tulips, carnations, etc.). The premises don't establish any necessary connection between roses and the red subset of flowers.
Current Limitations - The following features are not yet supported, but will be added soon:
  • GPT-OSS 20B model not supported currently
  • Some sampling parameters not supported: Repetition penalty
  • Function calling support not available yet
  • JSON mode is currently not supported

Available Models

Two flexible open-weight models are available to meet different deployment needs: GPT-OSS 120B:
  • Model String: openai/gpt-oss-120b
  • Hardware Requirements: Fits on 80GB GPU
  • Architecture: Mixture-of-Experts (MoE) with token-choice routing
  • Context Length: 128k tokens with RoPE
  • Best for: Enterprise applications requiring maximum reasoning performance
GPT-OSS 20B:
  • Model String: openai/gpt-oss-20b
  • Hardware Requirements: Lower GPU memory requirements
  • Architecture: Optimized MoE for efficiency
  • Context Length: 128k tokens with RoPE
  • Best for: Research, development, and cost-efficient deployments

GPT-OSS Best Practices

Reasoning models like GPT-OSS should be used differently than standard instruct models to get optimal results: Recommended Parameters:
  • Reasoning Effort: Use the adjustable reasoning effort levels to control computational cost vs. accuracy
  • Temperature: Use 0.5–0.7 to balance creativity and coherence
  • No System Prompts: Provide all instructions directly in the user query
Prompting Best Practices: Think of GPT-OSS as a senior problem-solver – provide high-level objectives and let it determine the methodology:
  • Strengths: Excels at open-ended reasoning, multi-step logic, and inferring unstated requirements
  • Avoid over-prompting: Micromanaging steps can limit its advanced reasoning capabilities
  • Provide clear objectives: Balance clarity with flexibility for optimal results

GPT-OSS Use Cases

  • Code Review & Analysis: Comprehensive code analysis across large codebases with detailed improvement suggestions
  • Strategic Planning: Multi-stage planning with reasoning about optimal approaches and resource allocation
  • Complex Document Analysis: Processing legal contracts, technical specifications, and regulatory documents
  • Benchmarking AI Systems: Evaluates other LLM responses with contextual understanding, particularly useful in critical validation scenarios
  • AI Model Evaluation: Sophisticated evaluation of other AI systems with contextual understanding
  • Scientific Research: Multi-step reasoning for hypothesis generation and experimental design
  • Academic Analysis: Deep analysis of research papers and literature reviews
  • Information Extraction: Efficiently extracts relevant data from large volumes of unstructured information, ideal for RAG systems
  • Agent Workflows: Building sophisticated AI agents with complex reasoning capabilities
  • RAG Systems: Enhanced information extraction and synthesis from large knowledge bases
  • Problem Solving: Handling ambiguous requirements and inferring unstated assumptions
  • Ambiguity Resolution: Interprets unclear instructions effectively and seeks clarification when needed

Managing Context and Costs

Reasoning Effort Control:

GPT-OSS features adjustable reasoning effort levels to optimize for your specific use case:
  • Low effort: Faster responses for simpler tasks with reduced reasoning depth
  • Medium effort: Balanced performance for most use cases (recommended default)
  • High effort: Maximum reasoning for complex problems requiring deep analysis. You should also specify max_tokens of ~30,000 with this setting.

Token Management:

When working with reasoning models, it’s crucial to maintain adequate space in the context window:
  • Use max_tokens parameter to control response length and costs
  • Monitor reasoning token usage vs. output tokens - reasoning tokens can vary from hundreds to tens of thousands based on complexity
  • Consider reasoning effort level based on task complexity and budget constraints
  • Simpler problems may only require a few hundred reasoning tokens, while complex challenges could generate extensive reasoning

Cost/Latency Optimization:

  • Implement limits on total token generation using the max_tokens parameter
  • Balance thorough reasoning with resource utilization based on your specific requirements
  • Consider using lower reasoning effort for routine tasks and higher effort for critical decisions

Technical Architecture

Model Architecture:

  • MoE Design: Token-choice Mixture-of-Experts with SwiGLU activations for improved performance
  • Expert Selection: Softmax-after-topk approach for calculating MoE weights, ensuring optimal expert utilization
  • Attention Mechanism: RoPE (Rotary Position Embedding) with 128k context length
  • Attention Patterns: Alternating between full context and sliding 128-token window for efficiency
  • Attention Sink: Learned attention sink per-head with additional additive value in the softmax denominator

Tokenization:

  • Standard Compatibility: Uses the same tokenizer as GPT-4o
  • Broad Support: Ensures seamless integration with existing applications and tools

Context Handling:

  • 128k Context Window: Large context capacity for processing extensive documents
  • Efficient Patterns: Optimized attention patterns for long-context scenarios
  • Memory Optimization: GPT-OSS Large designed to fit efficiently within 80GB GPU memory