How to use GLM-5
Get started with this model in just a few lines of code. The model ID iszai-org/GLM-5 and it supports a 200K context window with up to 128K output tokens. Thinking is enabled by default, so you’ll receive both reasoning tokens and content tokens.
Thinking Modes
GLM-5 has thinking enabled by default and supports multiple thinking modes for different scenarios:- Interleaved Thinking (default): The model thinks between tool calls and after receiving tool results, enabling complex step-by-step reasoning — interpreting each tool output before deciding what to do next.
- Preserved Thinking: The model retains reasoning content from previous assistant turns in the context, improving reasoning continuity and cache hit rates. Ideal for coding agents and agentic workflows.
- Turn-level Thinking: Control reasoning on a per-turn basis within the same session — enable thinking for hard turns, disable it for simple ones.
Recommended Thinking Mode by Use Case
| Scenario | Mode | Rationale |
|---|---|---|
| General chat | Interleaved Thinking (default) | Step-by-step reasoning between tool calls |
| Coding agents (e.g., Claude Code, Roo Code) | Interleaved + Preserved Thinking | Retains reasoning across turns for continuity |
| Simple factual queries | Thinking disabled | Faster responses, lower cost |
Disabling Thinking
For lightweight tasks where you don’t need the model to reason:Tool Calling with Interleaved and Preserved Thinking
GLM-5 excels at multi-turn tool calling with reasoning interleaved between each step. The model thinks about each tool result before deciding what to do next, enabling sophisticated agentic workflows. GLM-5 also supports streaming tool calls — setstream=True to receive tool call parameters in real-time as they’re generated, rather than waiting for the complete function call.
For agentic workflows, we recommend enabling Preserved Thinking so the model retains reasoning from previous turns. Set "clear_thinking": false in chat_template_kwargs to keep reasoning content in context.
The example below demonstrates a multi-turn conversation where the model:
- Reasons about the user’s request and calls a weather tool
- Receives the tool result, reasons about it, and responds naturally
Python
Output
When using Preserved Thinking, all consecutive
reasoning blocks must exactly match the original sequence generated by the model. Do not reorder or edit these blocks — otherwise, performance may degrade and cache hit rates will be affected.Use Cases
GLM-5 excels in scenarios requiring deep reasoning and autonomous, multi-step execution:- Complex Systems Engineering: Tackle multi-component system design, architecture decisions, and integration challenges that require reasoning through dependencies and trade-offs
- Long-Horizon Agentic Workflows: Build autonomous agents that maintain coherent goal-directed behavior across extended sequences of tool calls — stable across 200+ sequential invocations
- Coding & Debugging: Solve complex software engineering tasks (SWE-bench, Terminal Bench), generate patches, debug intricate issues, and reason through large codebases
- Multi-Step Research & Analysis: Automate research workflows using tools and APIs with interleaved reasoning between each step
- STEM Problem-Solving: Advanced math, logic puzzles, and scientific reasoning with transparent chain-of-thought processing
- Tool Orchestration: Build agents that chain multiple tool calls with reasoning steps, making finer-grained decisions based on intermediate results
Prompting Tips
| Tip | Rationale |
|---|---|
| Temperature = 1.0, top_p = 0.95 | Recommended defaults for most tasks. Avoid tuning both simultaneously — pick one to adjust. |
| Temperature = 0.7 for SWE tasks | Use lower temperature with top_p=1.0 for SWE-bench and Terminal Bench tasks. |
| Temperature = 0 for Tau2-Bench | Use temperature=0 with max_tokens=16384 for multi-turn agentic benchmarks. |
| Think in goals, not steps | GLM-5 is agentic — give high-level objectives and let it orchestrate sub-tasks and tool calls. |
| Use Preserved Thinking for agents | Set "clear_thinking": false in chat_template_kwargs for coding agents and multi-turn agentic workflows to maintain reasoning continuity. |
| Return reasoning content faithfully | When using Preserved Thinking, always return the unmodified reasoning from previous turns back to the API. |
| Use Turn-level Thinking to save cost | Disable thinking on simple turns (facts, rewording) and enable it on complex turns (planning, debugging) within the same session. |
| Set generous max tokens | GLM-5 supports up to 128K output tokens. Default max_tokens of 131072 accommodates deep reasoning. For SWE and agentic benchmark tasks, 16384 is sufficient. |
General Limitations
GLM-5 is optimized for deep reasoning and agentic tasks, but there are scenarios where other models may be a better fit:- Latency-sensitive applications: The reasoning process generates additional tokens, making GLM-5 slower than non-reasoning models. For real-time voice agents or instant-response scenarios, consider a non-reasoning model.
- Simple, direct tasks: For straightforward classification, basic text generation, or quick factual lookups, the reasoning overhead adds unnecessary cost and latency — disable thinking or use a faster model.
- Cost-sensitive high-volume pipelines: Reasoning tokens increase output volume. If you’re processing many simple queries at scale, consider using Turn-level Thinking to selectively enable reasoning only where it adds value.