Reasoning - Together AI docs

Reasoning models are trained to think step-by-step before responding with an answer. Given an input prompt, they first produce a chain of thought, visible as tokens in the reasoning output field, and then output a final answer in the content field.

Supported models

Reasoning models fall into a few behavioral types:

Reasoning only: Always produces reasoning tokens. Cannot be toggled off.
Hybrid: Supports both reasoning and non-reasoning modes via reasoning={"enabled": True/False}.
Adjustable effort: Supports the reasoning_effort parameter to control reasoning depth ("low", "medium", or "high").

The following models support reasoning on serverless inference:

Model	API string	Type	Context length
MiniMax M2.7	`MiniMaxAI/MiniMax-M2.7`	Reasoning only	202K
DeepSeek-V4-Pro	`deepseek-ai/DeepSeek-V4-Pro`	Hybrid (on by default)	512K
GLM-5.1	`zai-org/GLM-5.1`	Hybrid (on by default)	200K
GLM-5	`zai-org/GLM-5`	Hybrid (on by default)	200K
Kimi K2.6	`moonshotai/Kimi-K2.6`	Hybrid (on by default)	262K
Qwen3.6 Plus	`Qwen/Qwen3.6-Plus`	Hybrid (on by default)	1M
Qwen3.5 9B	`Qwen/Qwen3.5-9B`	Hybrid (on by default)	262K
Cogito v2.1 671B	`deepcogito/cogito-v2-1-671b`	Hybrid (on by default)	164K
Nemotron 3 Ultra 550B A55B	`nvidia/nemotron-3-ultra-550b-a55b`	Hybrid (on by default)	512K
GPT-OSS 120B	`openai/gpt-oss-120b`	Adjustable effort	128K
GPT-OSS 20B	`openai/gpt-oss-20b`	Adjustable effort	128K

Additional reasoning models, including DeepSeek-R1 and its distillations, Qwen QwQ-32B, and DeepSeek V3.1 (hybrid), are available on dedicated endpoints.

Quickstart

Most reasoning models return a separate reasoning field alongside content in the response. Reasoning models produce longer outputs, so streaming is recommended:

from together import Together

client = Together()

stream = client.chat.completions.create(
    model="moonshotai/Kimi-K2.6",
    messages=[
        {
            "role": "user",
            "content": "Which number is bigger, 9.11 or 9.9?",
        }
    ],
    stream=True,
)

for chunk in stream:
    if chunk.choices:
        delta = chunk.choices[0].delta

        # Show reasoning tokens if present
        if hasattr(delta, "reasoning") and delta.reasoning:
            print(delta.reasoning, end="", flush=True)

        # Show content tokens if present
        if hasattr(delta, "content") and delta.content:
            print(delta.content, end="", flush=True)

import Together from "together-ai";
import type { ChatCompletionChunk } from "together-ai/resources/chat/completions";

const together = new Together();

const stream = await together.chat.completions.stream({
  model: "moonshotai/Kimi-K2.6",
  messages: [
    { role: "user", content: "Which number is bigger, 9.11 or 9.9?" },
  ],
} as any);

for await (const chunk of stream) {
  const delta = chunk.choices[0]?.delta as ChatCompletionChunk.Choice.Delta & {
    reasoning?: string;
  };

  // Show reasoning tokens if present
  if (delta?.reasoning) process.stdout.write(delta.reasoning);

  // Show content tokens if present
  if (delta?.content) process.stdout.write(delta.content);
}

curl -X POST "https://api.together.ai/v1/chat/completions" \
     -H "Authorization: Bearer $TOGETHER_API_KEY" \
     -H "Content-Type: application/json" \
     -d '{
        "model": "moonshotai/Kimi-K2.6",
        "messages": [
          {"role": "user", "content": "Which number is bigger, 9.11 or 9.9?"}
        ],
        "stream": true
     }'

The response contains both the model’s reasoning process and the final answer:

{
  "choices": [
    {
      "message": {
        "role": "assistant",
        "content": "9.9 is bigger than 9.11.",
        "reasoning": "Let me compare 9.11 and 9.9. Both have 9 as the integer part, so I need to compare the decimal parts: 0.11 vs 0.9. Since 0.9 = 0.90, and 0.90 > 0.11, we know 9.9 > 9.11."
      }
    }
  ]
}

DeepSeek-R1 uses a different format. It outputs reasoning inside <think> tags within the content field rather than a separate reasoning field. See Handle reasoning tokens for details.

Enable and disable reasoning

Hybrid models let you toggle reasoning on or off using the reasoning parameter. This is useful when you want reasoning for complex queries but want faster, cheaper responses for simple ones.

from together import Together

client = Together()

# Enable reasoning
response = client.chat.completions.create(
    model="moonshotai/Kimi-K2.6",
    messages=[
        {
            "role": "user",
            "content": "Prove that the square root of 2 is irrational.",
        }
    ],
    reasoning={"enabled": True},
    stream=True,
)

for chunk in response:
    if not chunk.choices:
        continue
    delta = chunk.choices[0].delta

    if hasattr(delta, "reasoning") and delta.reasoning:
        print(delta.reasoning, end="", flush=True)

    if hasattr(delta, "content") and delta.content:
        print(delta.content, end="", flush=True)

import Together from "together-ai";

const together = new Together();

const stream = await together.chat.completions.stream({
  model: "moonshotai/Kimi-K2.6",
  messages: [
    { role: "user", content: "Prove that the square root of 2 is irrational." },
  ],
  reasoning: { enabled: true },
});

for await (const chunk of stream) {
  const delta = chunk.choices[0]?.delta;

  if (delta?.reasoning) process.stdout.write(delta.reasoning);
  if (delta?.content) process.stdout.write(delta.content);
}

curl -X POST "https://api.together.ai/v1/chat/completions" \
     -H "Authorization: Bearer $TOGETHER_API_KEY" \
     -H "Content-Type: application/json" \
     -d '{
        "model": "moonshotai/Kimi-K2.6",
        "messages": [
          {"role": "user", "content": "Prove that the square root of 2 is irrational."}
        ],
        "reasoning": {"enabled": true},
        "stream": true
     }'

Alternatively, you can enable or disable reasoning using chat_template_kwargs:

response = client.chat.completions.create(
    model="Qwen/Qwen3.5-9B",
    messages=[
        {
            "role": "user",
            "content": "Prove that the square root of 2 is irrational.",
        }
    ],
    chat_template_kwargs={
        "thinking": True,
        # or use "enable_thinking": True
    },
    stream=True,
)

GLM-5 has thinking enabled by default. Pass reasoning={"enabled": False} to disable it for simple tasks where reasoning overhead isn’t needed.

For the list of hybrid models, see Supported models.

For DeepSeek V3.1, function calling only works in non-reasoning mode (reasoning={"enabled": False}).

Reasoning effort

GPT-OSS models support a reasoning_effort parameter that controls how much computation the model spends on reasoning. This lets you balance accuracy against cost and latency.

"low": Faster responses for simpler tasks with reduced reasoning depth.
"medium": Balanced performance for most use cases (recommended default).
"high": Maximum reasoning for complex problems. Set max_tokens to ~30,000 with this setting.

DeepSeek-V4-Pro accepts only "high" and "max" for reasoning_effort. Other values are mapped automatically:

"low" and "medium" map to "high".
"high" and "xhigh" map to "max".

Nemotron 3 Ultra 550B A55B defaults to high reasoning effort. To switch to medium effort, pass chat_template_kwargs={"medium_effort": True}:

response = client.chat.completions.create(
    model="nvidia/nemotron-3-ultra-550b-a55b",
    messages=[
        {
            "role": "user",
            "content": "Prove that the square root of 2 is irrational.",
        }
    ],
    chat_template_kwargs={"medium_effort": True},
)

from together import Together

client = Together()

stream = client.chat.completions.create(
    model="openai/gpt-oss-120b",
    messages=[
        {
            "role": "user",
            "content": "Solve: If all roses are flowers and some flowers are red, can we conclude that some roses are red?",
        }
    ],
    temperature=1.0,
    top_p=1.0,
    reasoning_effort="medium",
    stream=True,
)

for chunk in stream:
    if chunk.choices:
        print(chunk.choices[0].delta.content or "", end="", flush=True)

import Together from "together-ai";

const together = new Together();

const stream = await together.chat.completions.create({
  model: "openai/gpt-oss-120b",
  messages: [
    {
      role: "user",
      content:
        "Solve: If all roses are flowers and some flowers are red, can we conclude that some roses are red?",
    },
  ],
  temperature: 1.0,
  top_p: 1.0,
  reasoning_effort: "medium",
  stream: true,
});

for await (const chunk of stream) {
  process.stdout.write(chunk.choices[0]?.delta?.content || "");
}

curl -X POST "https://api.together.ai/v1/chat/completions" \
     -H "Authorization: Bearer $TOGETHER_API_KEY" \
     -H "Content-Type: application/json" \
     -d '{
        "model": "openai/gpt-oss-120b",
        "messages": [
          {"role": "user", "content": "Solve: If all roses are flowers and some flowers are red, can we conclude that some roses are red?"}
        ],
        "temperature": 1.0,
        "top_p": 1.0,
        "reasoning_effort": "medium",
        "stream": true
     }'

Controlling reasoning depth via prompting

For models that don’t support a reasoning_effort parameter, you can influence how much the model thinks by including instructions in your prompt. This is a simple way to reduce token usage and latency when the problem doesn’t warrant deep reasoning. Ask the model to keep its thinking concise:

response = client.chat.completions.create(
    model="moonshotai/Kimi-K2.6",
    messages=[
        {
            "role": "user",
            "content": "Please be succinct in your thinking.\n\nWhat is the derivative of x^3 + 2x?",
        }
    ],
    stream=True,
)

You can also suggest an approximate budget for the reasoning process:

response = client.chat.completions.create(
    model="moonshotai/Kimi-K2.6",
    messages=[
        {
            "role": "user",
            "content": "Please use around 1000 words to think, but do not literally count each one.\n\nExplain why quicksort has O(n log n) average-case complexity.",
        }
    ],
    stream=True,
)

This technique works across all reasoning models. The model won’t hit an exact word count, but it reliably produces shorter or longer reasoning chains in response to the guidance. Combine it with max_tokens for a hard ceiling on total output.

Thinking modes

GLM-5 supports advanced thinking modes that control how reasoning integrates with tool calling and multi-turn conversations.

Interleaved thinking

The default mode. The model reasons between tool calls and after receiving tool results, enabling complex step-by-step reasoning where it interprets each tool output before deciding what to do next.

from together import Together

client = Together()

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get the current weather for a location.",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {"type": "string", "description": "City name"}
                },
                "required": ["location"],
            },
        },
    }
]

response = client.chat.completions.create(
    model="zai-org/GLM-5",
    messages=[
        {"role": "user", "content": "What's the weather in Paris and Tokyo?"}
    ],
    tools=tools,
)

print(
    json.dumps(
        response.choices[0].message.model_dump()["tool_calls"],
        indent=2,
    )
)

In this mode, the model will reason about which tool to call first, interpret the result, then reason again before making the next call.

Preserved thinking

The model retains reasoning content from previous assistant turns in the conversation context, improving reasoning continuity and cache hit rates. This is ideal for coding agents and multi-turn agentic workflows. Enable preserved thinking by setting clear_thinking to false:

response = client.chat.completions.create(
    model="zai-org/GLM-5",
    messages=messages,
    tools=tools,
    stream=True,
    chat_template_kwargs={
        "clear_thinking": False,  # Preserved Thinking
    },
)

When using preserved thinking, include the unmodified reasoning from previous turns back in the conversation:

messages.append(
    {
        "role": "assistant",
        "content": content,
        "reasoning": reasoning,  # Return reasoning content faithfully
        "tool_calls": tool_calls,
    }
)

When using preserved thinking, all consecutive reasoning blocks must exactly match the original sequence generated by the model. Don’t reorder or edit these blocks. Otherwise, performance may degrade and cache hit rates will drop.

Turn-level thinking

Control reasoning on a per-turn basis within the same session. Enable thinking for hard turns (planning, debugging) and disable it for simple ones (facts, rewording) to save cost. For a complete tool-calling example with GLM-5.2 thinking modes, see the GLM-5.2 Quickstart.

Handle reasoning tokens

There are two patterns for accessing reasoning tokens, depending on the model.

Separate `reasoning` field

Most models (Kimi K2.6, GLM-5, DeepSeek-V4-Pro, GPT-OSS) return reasoning in a dedicated reasoning field on the response message or streaming delta:

from together import Together

client = Together()

response = client.chat.completions.create(
    model="moonshotai/Kimi-K2.6",
    messages=[
        {
            "role": "user",
            "content": "Say test 10 times",
        }
    ],
)

print("Reasoning:", response.choices[0].message.reasoning)
print("Answer:", response.choices[0].message.content)

Use the reasoning field for both input and output. The model returns its chain of thought in reasoning (or delta.reasoning when streaming), and you pass it back under the same reasoning key when you send a prior assistant turn to the API for preserved thinking or multi-turn tool calling. The older reasoning_content key is still accepted on input for backward compatibility.

`<think>` tags in content

DeepSeek-R1 embeds reasoning directly in the content field using <think> tags:

<think>
Let me compare 9.11 and 9.9 by looking at their decimal parts...
0.11 vs 0.9 , since 0.9 is larger, 9.9 > 9.11.
</think>

**Answer:** 9.9 is bigger.

To extract the reasoning and answer separately:

import re

content = response.choices[0].message.content

think_match = re.search(r"<think>(.*?)</think>", content, re.DOTALL)

reasoning = think_match.group(1).strip() if think_match else ""

answer = re.sub(r"<think>.*?</think>", "", content, flags=re.DOTALL).strip()

Structured outputs with reasoning models

Reasoning models can return JSON that conforms to a schema, the same way non-reasoning models do. The model still produces its chain of thought in the reasoning field, then writes the structured answer to content. Example: have Kimi K2.6 solve a math problem and return the steps as typed JSON.

Python

import json
from together import Together
from pydantic import BaseModel

client = Together()


class Step(BaseModel):
    explanation: str
    output: str


class MathReasoning(BaseModel):
    steps: list[Step]
    final_answer: str


completion = client.chat.completions.create(
    model="moonshotai/Kimi-K2.6",
    messages=[
        {
            "role": "system",
            "content": "You are a helpful math tutor. Guide the user through the solution step by step.",
        },
        {"role": "user", "content": "how can I solve 8x + 7 = -23"},
    ],
    response_format={
        "type": "json_schema",
        "json_schema": {
            "name": "math_reasoning",
            "schema": MathReasoning.model_json_schema(),
        },
    },
)

math_reasoning = json.loads(completion.choices[0].message.content)
print(json.dumps(math_reasoning, indent=2))

Example output:

JSON

{
  "steps": [
    {
      "explanation": "To solve 8x + 7 = -23, I need to isolate x.",
      "output": ""
    },
    {
      "explanation": "Subtract 7 from both sides.",
      "output": "8x = -30"
    },
    {
      "explanation": "Divide both sides by 8.",
      "output": "x = -30/8 = -15/4"
    }
  ],
  "final_answer": "x = -15/4"
}

For the full structured-outputs reference, see Structured outputs.

Prompting best practices

Prompt reasoning models differently than standard models:

Tip	Details
Use the right temperature	DeepSeek-R1: 0.6. Kimi K2.6 (thinking) / GLM-5: 1.0. GPT-OSS: 1.0. Kimi K2.6 (instant): 0.6.
System prompts vary by model	DeepSeek-R1: omit system prompts entirely. Kimi models: use `"You are Kimi, an AI assistant created by Moonshot AI."` GPT-OSS: use the `developer` role message.
Don’t add chain-of-thought instructions	These models already reason step-by-step. Telling them to “think step by step” is unnecessary and can hurt performance.
Avoid few-shot examples	Few-shot prompting can degrade performance. Describe the task and desired output format instead.
Think in goals, not steps	Provide high-level objectives (e.g., “Analyze this data and identify trends”) and let the model determine the methodology. Over-prompting limits reasoning ability.
Structure your prompt	Use XML tags, markdown formatting, or labeled sections to separate different parts of your prompt.
Set generous `max_tokens`	Reasoning tokens can number in the tens of thousands for complex problems. Ensure your `max_tokens` accommodates both reasoning and content.

When not to use reasoning

Non-reasoning models are a better fit when:

Latency is critical: Real-time voice agents, instant-response chatbots, or other applications that need fast responses.
Tasks are straightforward: Simple classification, basic text generation, factual lookups, or quick summaries don’t benefit from extended reasoning.
Cost is the priority: High-volume pipelines processing many simple queries. Reasoning tokens significantly increase per-query costs.

For these use cases, consider faster non-reasoning models like Llama 3.3 70B or Qwen3.5 9B.

Manage costs and latency

Reasoning tokens can vary from a few hundred for simple problems to tens of thousands for complex challenges. Strategies to keep costs and latency in check:

Count reasoning tokens: Reasoning output is billed as completion tokens and reported under usage.completion_tokens_details.reasoning_tokens. See OpenAI compatibility for the full usage-object shape, which varies by model.
Use max_tokens: Set a token limit to cap total output. This reduces costs but may truncate reasoning on complex problems, find the right balance for your use case.
Toggle reasoning on hybrid models: Use reasoning={"enabled": False} for simple queries and only enable it when the task benefits from deeper analysis.
Use reasoning effort levels: On GPT-OSS, use reasoning_effort="low" for routine tasks and "high" for critical decisions.
Use turn-level thinking: On GLM-5, disable thinking for simple turns and enable it only for complex ones within the same session.
Prompt for shorter reasoning: Include instructions like “Please be succinct in your thinking” to reduce reasoning token usage on simpler problems. See Controlling reasoning depth via prompting.
Stream responses: Since reasoning models produce longer outputs, streaming with stream=True provides a better user experience by showing partial results as they arrive.

​Supported models

​Quickstart

​Enable and disable reasoning

​Reasoning effort

​Controlling reasoning depth via prompting

​Thinking modes

​Interleaved thinking

​Preserved thinking

​Turn-level thinking

​Handle reasoning tokens

​Separate reasoning field

​<think> tags in content

​Structured outputs with reasoning models

​Prompting best practices

​When not to use reasoning

​Manage costs and latency

Supported models

Quickstart

Enable and disable reasoning

Reasoning effort

Controlling reasoning depth via prompting

Thinking modes

Interleaved thinking

Preserved thinking

Turn-level thinking

Handle reasoning tokens

Separate `reasoning` field

`<think>` tags in content

Structured outputs with reasoning models

Prompting best practices

When not to use reasoning

Manage costs and latency