Skip to main content
Reasoning models are trained to think step-by-step before responding with an answer. Given an input prompt, they first produce a chain of thought, visible as tokens that show up in the reasoning output field, and then output a final answer in the content field.

Supported models

ModelModel IDTypeContextTool Calling
DeepSeek-R1deepseek-ai/DeepSeek-R1Reasoning only164KNo
DeepSeek V3.1deepseek-ai/DeepSeek-V3.1Hybrid (off by default)164KNon-reasoning only
Qwen3.5 397B A17BQwen/Qwen3.5-397B-A17BHybrid (on by default)128KNo
Minimax M2.5MiniMaxAI/MiniMax-M2.5Reasoning only228.7KNo
Kimi K2 Thinkingmoonshotai/Kimi-K2-ThinkingReasoning only256KYes
Kimi K2.5moonshotai/Kimi-K2.5Hybrid (on by default)256KYes
GLM-5zai-org/GLM-5Hybrid (on by default)200KYes
GPT-OSS 120Bopenai/gpt-oss-120bReasoning only (adjustable effort)128KNo
GPT-OSS 20Bopenai/gpt-oss-20bReasoning only (adjustable effort)128KNo
Type definitions:
  • Reasoning only: Always produces reasoning tokens. Cannot be toggled off.
  • Hybrid: Supports both reasoning and non-reasoning modes via reasoning={"enabled": True/False}.
  • Adjustable effort: Supports reasoning_effort parameter to control reasoning depth ("low", "medium", "high").

Quickstart

Most reasoning models return a separate reasoning field alongside content in the response. Since reasoning models produce longer outputs, we recommend streaming:
from together import Together

client = Together()

stream = client.chat.completions.create(
    model="moonshotai/Kimi-K2.5",
    messages=[
        {
            "role": "user",
            "content": "Which number is bigger, 9.11 or 9.9?",
        }
    ],
    stream=True,
)

for chunk in stream:
    if chunk.choices:
        delta = chunk.choices[0].delta

        # Show reasoning tokens if present
        if hasattr(delta, "reasoning") and delta.reasoning:
            print(delta.reasoning, end="", flush=True)

        # Show content tokens if present
        if hasattr(delta, "content") and delta.content:
            print(delta.content, end="", flush=True)
The response contains both the model’s reasoning process and the final answer:
{
  "choices": [
    {
      "message": {
        "role": "assistant",
        "content": "9.9 is bigger than 9.11.",
        "reasoning": "Let me compare 9.11 and 9.9. Both have 9 as the integer part, so I need to compare the decimal parts: 0.11 vs 0.9. Since 0.9 = 0.90, and 0.90 > 0.11, we know 9.9 > 9.11."
      }
    }
  ]
}
DeepSeek-R1 uses a different format, it outputs reasoning inside <think> tags within the content field rather than a separate reasoning field. See Handling reasoning tokens for details.

Enabling and disabling reasoning

Hybrid models let you toggle reasoning on or off using the reasoning parameter. This is useful when you want reasoning for complex queries but want faster, cheaper responses for simple ones.
from together import Together

client = Together()

# Enable reasoning
response = client.chat.completions.create(
    model="moonshotai/Kimi-K2.5",
    messages=[
        {
            "role": "user",
            "content": "Prove that the square root of 2 is irrational.",
        }
    ],
    reasoning={"enabled": True},
    stream=True,
)

for chunk in response:
    delta = chunk.choices[0].delta

    if hasattr(delta, "reasoning") and delta.reasoning:
        print(delta.reasoning, end="", flush=True)

    if hasattr(delta, "content") and delta.content:
        print(delta.content, end="", flush=True)
Alternatively, you can enable or disable reasoning using chat_template_kwargs:
response = client.chat.completions.create(
    model="Qwen/Qwen3.5-397B-A17B",
    messages=[
        {
            "role": "user",
            "content": "Prove that the square root of 2 is irrational.",
        }
    ],
    chat_template_kwargs={
        "thinking": True,
        # or use "enable_thinking": True
    },
    stream=True,
)
GLM-5 has thinking enabled by default. Pass reasoning={"enabled": False} to disable it for simple tasks where reasoning overhead isn’t needed.
The following models support reasoning={"enabled": True/False}:
  • deepseek-ai/DeepSeek-V3.1
  • Qwen/Qwen3.5-397B-A17B (on by default)
  • moonshotai/Kimi-K2.5 (on by default)
  • zai-org/GLM-5 (on by default)
For DeepSeek V3.1, function calling only works in non-reasoning mode (reasoning={"enabled": False}).

Reasoning effort

GPT-OSS models support a reasoning_effort parameter that controls how much computation the model spends on reasoning. This lets you balance accuracy against cost and latency.
  • "low": Faster responses for simpler tasks with reduced reasoning depth.
  • "medium": Balanced performance for most use cases (recommended default).
  • "high": Maximum reasoning for complex problems. Set max_tokens to ~30,000 with this setting.
from together import Together

client = Together()

stream = client.chat.completions.create(
    model="openai/gpt-oss-120b",
    messages=[
        {
            "role": "user",
            "content": "Solve: If all roses are flowers and some flowers are red, can we conclude that some roses are red?",
        }
    ],
    temperature=1.0,
    top_p=1.0,
    reasoning_effort="medium",
    stream=True,
)

for chunk in stream:
    print(chunk.choices[0].delta.content or "", end="", flush=True)

Controlling reasoning depth via prompting

For models that don’t support a reasoning_effort parameter, you can influence how much the model thinks by including instructions in your prompt. This is a simple way to reduce token usage and latency when the problem doesn’t warrant deep reasoning. Ask the model to keep its thinking concise:
response = client.chat.completions.create(
    model="moonshotai/Kimi-K2-Thinking",
    messages=[
        {
            "role": "user",
            "content": "Please be succinct in your thinking.\n\nWhat is the derivative of x^3 + 2x?",
        }
    ],
    stream=True,
)
You can also suggest an approximate budget for the reasoning process:
response = client.chat.completions.create(
    model="deepseek-ai/DeepSeek-R1",
    messages=[
        {
            "role": "user",
            "content": "Please use around 1000 words to think, but do not literally count each one.\n\nExplain why quicksort has O(n log n) average-case complexity.",
        }
    ],
    stream=True,
)
This technique works across all reasoning models. The model won’t hit an exact word count, but it reliably produces shorter or longer reasoning chains in response to the guidance. Combine it with max_tokens for a hard ceiling on total output.

Thinking modes

GLM-5 supports advanced thinking modes that control how reasoning integrates with tool calling and multi-turn conversations.

Interleaved thinking

The default mode. The model reasons between tool calls and after receiving tool results, enabling complex step-by-step reasoning where it interprets each tool output before deciding what to do next.
from together import Together

client = Together()

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get the current weather for a location.",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {"type": "string", "description": "City name"}
                },
                "required": ["location"],
            },
        },
    }
]

response = client.chat.completions.create(
    model="zai-org/GLM-5",
    messages=[
        {"role": "user", "content": "What's the weather in Paris and Tokyo?"}
    ],
    tools=tools,
)

print(
    json.dumps(
        response.choices[0].message.model_dump()["tool_calls"],
        indent=2,
    )
)
In this mode, the model will reason about which tool to call first, interpret the result, then reason again before making the next call.

Preserved thinking

The model retains reasoning content from previous assistant turns in the conversation context, improving reasoning continuity and cache hit rates. This is ideal for coding agents and multi-turn agentic workflows. Enable preserved thinking by setting clear_thinking to false:
response = client.chat.completions.create(
    model="zai-org/GLM-5",
    messages=messages,
    tools=tools,
    stream=True,
    chat_template_kwargs={
        "clear_thinking": False,  # Preserved Thinking
    },
)
When using preserved thinking, include the unmodified reasoning from previous turns back in the conversation:
messages.append(
    {
        "role": "assistant",
        "content": content,
        "reasoning": reasoning,  # Return reasoning content faithfully
        "tool_calls": tool_calls,
    }
)
When using preserved thinking, all consecutive reasoning blocks must exactly match the original sequence generated by the model. Do not reorder or edit these blocks — otherwise performance may degrade and cache hit rates will be affected.

Turn-level thinking

Control reasoning on a per-turn basis within the same session. Enable thinking for hard turns (planning, debugging) and disable it for simple ones (facts, rewording) to save cost. For a complete tool-calling example with GLM-5 thinking modes, see the GLM-5 Quickstart.

Handling reasoning tokens

There are two patterns for accessing reasoning tokens depending on the model.

Separate reasoning field

Most models (Kimi K2 Thinking, Kimi K2.5, GLM-5, DeepSeek V3.1, GPT-OSS) return reasoning in a dedicated reasoning field on the response message or streaming delta:
from together import Together

client = Together()

response = client.chat.completions.create(
    model="moonshotai/Kimi-K2.5",
    messages=[
        {
            "role": "user",
            "content": "Say test 10 times",
        }
    ],
)

print("Reasoning:", response.choices[0].message.reasoning)
print("Answer:", response.choices[0].message.content)

<think> tags in content

DeepSeek-R1 embeds reasoning directly in the content field using <think> tags:
<think>
Let me compare 9.11 and 9.9 by looking at their decimal parts...
0.11 vs 0.9 — since 0.9 is larger, 9.9 > 9.11.
</think>

**Answer:** 9.9 is bigger.
To extract the reasoning and answer separately:
import re

content = response.choices[0].message.content

think_match = re.search(r"<think>(.*?)</think>", content, re.DOTALL)

reasoning = think_match.group(1).strip() if think_match else ""

answer = re.sub(r"<think>.*?</think>", "", content, flags=re.DOTALL).strip()

Prompting best practices

Reasoning models should be prompted differently than standard models. Here are consolidated recommendations:
TipDetails
Use the right temperatureDeepSeek-R1: 0.6. Kimi K2 Thinking / Kimi K2.5 (thinking) / GLM-5: 1.0. GPT-OSS: 1.0. Kimi K2.5 (instant): 0.6.
System prompts vary by modelDeepSeek-R1: omit system prompts entirely. Kimi models: use "You are Kimi, an AI assistant created by Moonshot AI." GPT-OSS: use the developer role message.
Don’t add chain-of-thought instructionsThese models already reason step-by-step. Telling them to “think step by step” is unnecessary and can hurt performance.
Avoid few-shot examplesFew-shot prompting can degrade performance. Describe the task and desired output format instead.
Think in goals, not stepsProvide high-level objectives (e.g., “Analyze this data and identify trends”) and let the model determine the methodology. Over-prompting limits reasoning ability.
Structure your promptUse XML tags, markdown formatting, or labeled sections to separate different parts of your prompt.
Set generous max_tokensReasoning tokens can number in the tens of thousands for complex problems. Ensure your max_tokens accommodates both reasoning and content.

When not to use reasoning

Non-reasoning models are a better fit when:
  • Latency is critical: Real-time voice agents, instant-response chatbots, or other applications that need fast responses.
  • Tasks are straightforward: Simple classification, basic text generation, factual lookups, or quick summaries don’t benefit from extended reasoning.
  • Cost is the priority: High-volume pipelines processing many simple queries. Reasoning tokens significantly increase per-query costs.
For these use cases, consider models like Kimi K2, DeepSeek V3, or Llama 4.

Managing costs and latency

Reasoning tokens can vary from a few hundred for simple problems to tens of thousands for complex challenges. Here are strategies to manage costs:
  • Use max_tokens: Set a token limit to cap total output. This reduces costs but may truncate reasoning on complex problems — find the right balance for your use case.
  • Toggle reasoning on hybrid models: Use reasoning={"enabled": False} for simple queries and only enable it when the task benefits from deeper analysis.
  • Use reasoning effort levels: On GPT-OSS, use reasoning_effort="low" for routine tasks and "high" for critical decisions.
  • Use turn-level thinking: On GLM-5, disable thinking for simple turns and enable it only for complex ones within the same session.
  • Prompt for shorter reasoning: Include instructions like “Please be succinct in your thinking” to reduce reasoning token usage on simpler problems. See Controlling reasoning depth via prompting.
  • Stream responses: Since reasoning models produce longer outputs, streaming with stream=True provides a better user experience by showing partial results as they arrive.