reasoning or reasoning_content alongside the final assistant response, you shape how the model thinks through problems before producing an answer.
This page covers the reasoning data shape, supported models, and launch parameters.
Supported models
| Organization | Model | API ID |
|---|---|---|
| Qwen | Qwen 3 (0.6B, 1.7B, 4B, 8B, 14B, 32B, 30B-A3B, 235B-A22B) and Base variants | Qwen/Qwen3-* |
| Qwen | Qwen 3 Next 80B A3B Thinking | Qwen/Qwen3-Next-80B-A3B-Thinking |
| Qwen | Qwen 3.5 (0.8B, 2B, 4B, 9B, 27B, 35B-A3B, 122B-A10B, 397B-A17B) | Qwen/Qwen3.5-* |
| Qwen | Qwen 3.6 35B A3B | Qwen/Qwen3.6-35B-A3B |
| Z.ai | GLM 4.6, GLM 4.7, GLM 5, GLM 5.1 | zai-org/GLM-* |
| Gemma 4 31B IT, Gemma 4 26B A4B IT | google/gemma-4-* | |
| OpenAI | GPT-OSS 20B, GPT-OSS 120B | openai/gpt-oss-* |
Prepare your data
Prepare data in a JSONL file. Each assistant message should carry the chain of thought in areasoning (or reasoning_content) field and the final answer in content.
Conversational format
When fine-tuning reasoning models on conversational data, only the last assistant message is trained on by default. For multi-turn reasoning, split the conversation so each assistant message is the final message in its own example.
Preference format
For preference fine-tuning, both outputs carryreasoning. See preference tuning for the broader DPO workflow.
Validate and upload
Upload your data using the Together Python/TypeScript SDK or the Together CLI:Launch the job
LoRA is the default. Passlora=False for full fine-tuning.
Watch and deploy
Reasoning jobs use the same lifecycle as text jobs:- Poll the job with the SDK or CLI. Expect 10 to 30 minutes for a LoRA job on an 8B model with a few thousand examples.
- Deploy the result on a dedicated endpoint.
- Call the endpoint with the same chat-completions shape. The model emits
reasoning_contentalongsidecontentfor clients that surface it. See Inference → Reasoning for details.