zai-org/GLM-5.2. Pricing is $1.40 per 1M input tokens, $4.40 per 1M output tokens, and $0.26 per 1M cached input tokens.
How to use GLM-5.2
Get started with this model in a few lines of code. Since thinking is on by default, stream the response and handle both thereasoning and content fields on each delta.
Reasoning effort
GLM-5.2 accepts two effort levels through thereasoning_effort parameter when thinking is enabled:
"high": enhanced reasoning. Use for most coding and reasoning tasks."max": deep reasoning, the default. Use for the hardest planning, architecture, and multi-step agentic problems. Setmax_tokensgenerously, since"max"mode can produce long chains of thought.
Thinking modes
GLM-5.2 has thinking enabled by default and supports multiple thinking modes for different scenarios:- Interleaved thinking (default): the model thinks between tool calls and after receiving tool results, interpreting each tool output before deciding what to do next.
- Preserved thinking: the model retains reasoning content from previous assistant turns in the context, improving reasoning continuity and cache hit rates. Ideal for coding agents and agentic workflows.
- Turn-level thinking: control reasoning on a per-turn basis within the same session. Enable thinking for hard turns, disable it for simple ones.
Recommended thinking mode by use case
| Scenario | Mode | Rationale |
|---|---|---|
| General chat | Interleaved thinking (default) | Step-by-step reasoning between tool calls. |
| Coding agents (e.g., Claude Code, Codex) | Interleaved and preserved thinking | Retains reasoning across turns for continuity. |
| Simple factual queries | Thinking disabled | Faster responses, lower cost. |
Disabling thinking
For lightweight tasks where you don’t need the model to reason, passreasoning={"enabled": False}. This disables the chain of thought and returns only the final answer.
Preserved thinking
For agentic workflows, enable preserved thinking so the model retains reasoning from previous turns. Set"clear_thinking": false in chat_template_kwargs to keep reasoning content in context across turns.
Python
When using preserved thinking, all consecutive
reasoning blocks must exactly match the original sequence generated by the model. Don’t reorder or edit these blocks, otherwise performance may degrade and cache hit rates will be affected.Function calling and streaming tool calls
GLM-5.2 supports tool calling with reasoning interleaved between each step. Define tools in the standard OpenAI-compatible schema and pass them viatools.
To stream tool calls, set stream=True. The model emits tool call parameters incrementally, so concatenate the arguments fragments from each delta to rebuild the full call. Together does not use a separate tool_stream parameter.
Structured outputs
GLM-5.2 supports structured outputs. Pass a JSON schema throughresponse_format to constrain the response to a fixed shape.
Use cases
GLM-5.2 is strongest in scenarios that require deep reasoning and autonomous, multi-step execution:- Project-level codebase work: Hold an entire codebase in a single reasoning workflow, retaining module boundaries, API contracts, and architectural constraints.
- Long-horizon refactoring: Run cross-file, multi-step engineering tasks end to end, from planning through verification.
- Production-grade standards: Follow team engineering standards (lint rules, build commands, commit conventions) consistently across long sessions.
- Coding and debugging: Solve complex software engineering tasks, generate patches, and debug intricate issues across large codebases.
- Front-end and design: Build polished UIs from a prompt or mockup, producing clean component code and well-structured layouts with strong attention to visual detail.
- Tool orchestration: Chain multiple tool calls with reasoning between steps, making finer-grained decisions based on intermediate results.
- Research reproduction: Turn a paper’s architecture, loss functions, and data pipelines into a runnable project with consistency across files.
Usage tips
| Tip | Rationale |
|---|---|
| Temperature = 1.0, top_p = 0.95 | Recommended defaults. Adjust only one of the two, not both at once. |
| Think in goals, not steps | GLM-5.2 is agentic. Give high-level objectives and let it orchestrate sub-tasks and tool calls. |
Use reasoning_effort="max" for hard tasks | Deep reasoning for complex planning, architecture, and multi-step agentic problems. |
| Use preserved thinking for agents | Set "clear_thinking": false in chat_template_kwargs for coding agents to maintain reasoning continuity. |
| Return reasoning content faithfully | When using preserved thinking, return the unmodified reasoning from previous turns back to the API. |
| Use turn-level thinking to save cost | Disable thinking on simple turns and enable it on complex turns within the same session. |
| Set generous max tokens | GLM-5.2 supports up to 128K output tokens. Raise max_tokens for deep reasoning, and keep it lower for short agentic turns. |
| State constraints explicitly | For engineering tasks, spell out hard constraints (no new dependencies, no API changes, run the tests) so the model holds the line. |