
How to use GPT-OSS API
These models are only available to Build Tier 1 or higher users. Since reasoning models produce longer responses with chain-of-thought processing, we recommend streaming tokens for better user experience:reasoning
property:
Available Models
Two flexible open-weight models are available to meet different deployment needs: GPT-OSS 120B:- Model String:
openai/gpt-oss-120b
- Hardware Requirements: Fits on 80GB GPU
- Architecture: Mixture-of-Experts (MoE) with token-choice routing
- Context Length: 128k tokens with RoPE
- Best for: Enterprise applications requiring maximum reasoning performance
- Model String:
openai/gpt-oss-20b
- Hardware Requirements: Lower GPU memory requirements
- Architecture: Optimized MoE for efficiency
- Context Length: 128k tokens with RoPE
- Best for: Research, development, and cost-efficient deployments
GPT-OSS Best Practices
Reasoning models like GPT-OSS should be used differently than standard instruct models to get optimal results: Recommended Parameters:- Reasoning Effort: Use the adjustable reasoning effort levels to control computational cost vs. accuracy.
- Temperature: Use 1.0 for maximum creativity and diverse reasoning approaches.
- Top-p: Use 1.0 to allow the full vocabulary distribution for optimal reasoning exploration.
- System Prompt: The system prompt can be provided as a
developer
message which is used to provide information about the instructions for the model and available function tools. - System message: It’s recommended not to modify the
system
message which is used to specify reasoning effort, meta information like knowledge cutoff and built-in tools.
- Strengths: Excels at open-ended reasoning, multi-step logic, and inferring unstated requirements
- Avoid over-prompting: Micromanaging steps can limit its advanced reasoning capabilities
- Provide clear objectives: Balance clarity with flexibility for optimal results
GPT-OSS Use Cases
- Code Review & Analysis: Comprehensive code analysis across large codebases with detailed improvement suggestions
- Strategic Planning: Multi-stage planning with reasoning about optimal approaches and resource allocation
- Complex Document Analysis: Processing legal contracts, technical specifications, and regulatory documents
- Benchmarking AI Systems: Evaluates other LLM responses with contextual understanding, particularly useful in critical validation scenarios
- AI Model Evaluation: Sophisticated evaluation of other AI systems with contextual understanding
- Scientific Research: Multi-step reasoning for hypothesis generation and experimental design
- Academic Analysis: Deep analysis of research papers and literature reviews
- Information Extraction: Efficiently extracts relevant data from large volumes of unstructured information, ideal for RAG systems
- Agent Workflows: Building sophisticated AI agents with complex reasoning capabilities
- RAG Systems: Enhanced information extraction and synthesis from large knowledge bases
- Problem Solving: Handling ambiguous requirements and inferring unstated assumptions
- Ambiguity Resolution: Interprets unclear instructions effectively and seeks clarification when needed
Managing Context and Costs
Reasoning Effort Control:
GPT-OSS features adjustable reasoning effort levels to optimize for your specific use case:- Low effort: Faster responses for simpler tasks with reduced reasoning depth
- Medium effort: Balanced performance for most use cases (recommended default)
- High effort: Maximum reasoning for complex problems requiring deep analysis. You should also specify
max_tokens
of ~30,000 with this setting.
Token Management:
When working with reasoning models, it’s crucial to maintain adequate space in the context window:- Use
max_tokens
parameter to control response length and costs - Monitor reasoning token usage vs. output tokens - reasoning tokens can vary from hundreds to tens of thousands based on complexity
- Consider reasoning effort level based on task complexity and budget constraints
- Simpler problems may only require a few hundred reasoning tokens, while complex challenges could generate extensive reasoning
Cost/Latency Optimization:
- Implement limits on total token generation using the
max_tokens
parameter - Balance thorough reasoning with resource utilization based on your specific requirements
- Consider using lower reasoning effort for routine tasks and higher effort for critical decisions
Technical Architecture
Model Architecture:
- MoE Design: Token-choice Mixture-of-Experts with SwiGLU activations for improved performance
- Expert Selection: Softmax-after-topk approach for calculating MoE weights, ensuring optimal expert utilization
- Attention Mechanism: RoPE (Rotary Position Embedding) with 128k context length
- Attention Patterns: Alternating between full context and sliding 128-token window for efficiency
- Attention Sink: Learned attention sink per-head with additional additive value in the softmax denominator
Tokenization:
- Standard Compatibility: Uses the same tokenizer as GPT-4o
- Broad Support: Ensures seamless integration with existing applications and tools
Context Handling:
- 128k Context Window: Large context capacity for processing extensive documents
- Efficient Patterns: Optimized attention patterns for long-context scenarios
- Memory Optimization: GPT-OSS Large designed to fit efficiently within 80GB GPU memory