reasoning output field, and then output a final answer in the content field.
Supported models
| Model | Model ID | Type | Context | Tool Calling |
|---|---|---|---|---|
| DeepSeek-R1 | deepseek-ai/DeepSeek-R1 | Reasoning only | 164K | No |
| DeepSeek V3.1 | deepseek-ai/DeepSeek-V3.1 | Hybrid (off by default) | 164K | Non-reasoning only |
| Qwen3.5 397B A17B | Qwen/Qwen3.5-397B-A17B | Hybrid (on by default) | 128K | No |
| Minimax M2.5 | MiniMaxAI/MiniMax-M2.5 | Reasoning only | 228.7K | No |
| Kimi K2 Thinking | moonshotai/Kimi-K2-Thinking | Reasoning only | 256K | Yes |
| Kimi K2.5 | moonshotai/Kimi-K2.5 | Hybrid (on by default) | 256K | Yes |
| GLM-5 | zai-org/GLM-5 | Hybrid (on by default) | 200K | Yes |
| GPT-OSS 120B | openai/gpt-oss-120b | Reasoning only (adjustable effort) | 128K | No |
| GPT-OSS 20B | openai/gpt-oss-20b | Reasoning only (adjustable effort) | 128K | No |
- Reasoning only: Always produces reasoning tokens. Cannot be toggled off.
- Hybrid: Supports both reasoning and non-reasoning modes via
reasoning={"enabled": True/False}. - Adjustable effort: Supports
reasoning_effortparameter to control reasoning depth ("low","medium","high").
Quickstart
Most reasoning models return a separatereasoning field alongside content in the response. Since reasoning models produce longer outputs, we recommend streaming:
DeepSeek-R1 uses a different format, it outputs reasoning inside
<think> tags within the content field rather than a separate reasoning field. See Handling reasoning tokens for details.Enabling and disabling reasoning
Hybrid models let you toggle reasoning on or off using thereasoning parameter. This is useful when you want reasoning for complex queries but want faster, cheaper responses for simple ones.
chat_template_kwargs:
reasoning={"enabled": True/False}:
deepseek-ai/DeepSeek-V3.1Qwen/Qwen3.5-397B-A17B(on by default)moonshotai/Kimi-K2.5(on by default)zai-org/GLM-5(on by default)
For DeepSeek V3.1, function calling only works in non-reasoning mode (
reasoning={"enabled": False}).Reasoning effort
GPT-OSS models support areasoning_effort parameter that controls how much computation the model spends on reasoning. This lets you balance accuracy against cost and latency.
"low": Faster responses for simpler tasks with reduced reasoning depth."medium": Balanced performance for most use cases (recommended default)."high": Maximum reasoning for complex problems. Setmax_tokensto ~30,000 with this setting.
Controlling reasoning depth via prompting
For models that don’t support areasoning_effort parameter, you can influence how much the model thinks by including instructions in your prompt. This is a simple way to reduce token usage and latency when the problem doesn’t warrant deep reasoning.
Ask the model to keep its thinking concise:
This technique works across all reasoning models. The model won’t hit an exact word count, but it reliably produces shorter or longer reasoning chains in response to the guidance. Combine it with
max_tokens for a hard ceiling on total output.Thinking modes
GLM-5 supports advanced thinking modes that control how reasoning integrates with tool calling and multi-turn conversations.Interleaved thinking
The default mode. The model reasons between tool calls and after receiving tool results, enabling complex step-by-step reasoning where it interprets each tool output before deciding what to do next.Preserved thinking
The model retains reasoning content from previous assistant turns in the conversation context, improving reasoning continuity and cache hit rates. This is ideal for coding agents and multi-turn agentic workflows. Enable preserved thinking by settingclear_thinking to false:
reasoning from previous turns back in the conversation:
Turn-level thinking
Control reasoning on a per-turn basis within the same session. Enable thinking for hard turns (planning, debugging) and disable it for simple ones (facts, rewording) to save cost. For a complete tool-calling example with GLM-5 thinking modes, see the GLM-5 Quickstart.Handling reasoning tokens
There are two patterns for accessing reasoning tokens depending on the model.Separate reasoning field
Most models (Kimi K2 Thinking, Kimi K2.5, GLM-5, DeepSeek V3.1, GPT-OSS) return reasoning in a dedicated reasoning field on the response message or streaming delta:
<think> tags in content
DeepSeek-R1 embeds reasoning directly in the content field using <think> tags:
Prompting best practices
Reasoning models should be prompted differently than standard models. Here are consolidated recommendations:| Tip | Details |
|---|---|
| Use the right temperature | DeepSeek-R1: 0.6. Kimi K2 Thinking / Kimi K2.5 (thinking) / GLM-5: 1.0. GPT-OSS: 1.0. Kimi K2.5 (instant): 0.6. |
| System prompts vary by model | DeepSeek-R1: omit system prompts entirely. Kimi models: use "You are Kimi, an AI assistant created by Moonshot AI." GPT-OSS: use the developer role message. |
| Don’t add chain-of-thought instructions | These models already reason step-by-step. Telling them to “think step by step” is unnecessary and can hurt performance. |
| Avoid few-shot examples | Few-shot prompting can degrade performance. Describe the task and desired output format instead. |
| Think in goals, not steps | Provide high-level objectives (e.g., “Analyze this data and identify trends”) and let the model determine the methodology. Over-prompting limits reasoning ability. |
| Structure your prompt | Use XML tags, markdown formatting, or labeled sections to separate different parts of your prompt. |
Set generous max_tokens | Reasoning tokens can number in the tens of thousands for complex problems. Ensure your max_tokens accommodates both reasoning and content. |
When not to use reasoning
Non-reasoning models are a better fit when:- Latency is critical: Real-time voice agents, instant-response chatbots, or other applications that need fast responses.
- Tasks are straightforward: Simple classification, basic text generation, factual lookups, or quick summaries don’t benefit from extended reasoning.
- Cost is the priority: High-volume pipelines processing many simple queries. Reasoning tokens significantly increase per-query costs.
Managing costs and latency
Reasoning tokens can vary from a few hundred for simple problems to tens of thousands for complex challenges. Here are strategies to manage costs:- Use
max_tokens: Set a token limit to cap total output. This reduces costs but may truncate reasoning on complex problems — find the right balance for your use case. - Toggle reasoning on hybrid models: Use
reasoning={"enabled": False}for simple queries and only enable it when the task benefits from deeper analysis. - Use reasoning effort levels: On GPT-OSS, use
reasoning_effort="low"for routine tasks and"high"for critical decisions. - Use turn-level thinking: On GLM-5, disable thinking for simple turns and enable it only for complex ones within the same session.
- Prompt for shorter reasoning: Include instructions like “Please be succinct in your thinking” to reduce reasoning token usage on simpler problems. See Controlling reasoning depth via prompting.
- Stream responses: Since reasoning models produce longer outputs, streaming with
stream=Trueprovides a better user experience by showing partial results as they arrive.