February, 2026
Serverless Model Bring UpsThe following models have been added:
zai-org/GLM-5
Dedicated Container Inference LaunchTogether AI has officially launched Dedicated Container Inference (DCI), formerly known as BYOC.DCI empowers users to containerize, deploy, and scale custom models on Together AI with ease.
Python SDK v2.0 General AvailabilityTogether AI is releasing the Python SDK v2.0 — a new, type-safe, OpenAPI-driven client designed to be faster, easier to maintain, and ready for everything we’re building next.
- Install:
pip install togetheroruv add together - Migration Guide: A detailed Python SDK Migration Guide covers API-by-API changes, type updates, and troubleshooting tips
- Code and Docs: Access the Together Python v2 repo and reference docs with code examples
- Main Goal: Replace the legacy v1 Python SDK with a modern, strongly-typed, OpenAPI-generated client that matches the API surface more closely and stays in lock-step with new features
- **Net New: All new features will be built in version 2 moving forward. This first version already includes beta APIs for our Instant Clusters!
Model DeprecationsThe following models have been deprecated and are no longer available:
togethercomputer/m2-bert-80M-32k-retrievalSalesforce/Llama-Rank-V1togethercomputer/Refuel-Llm-V2togethercomputer/Refuel-Llm-V2-SmallQwen/Qwen3-235B-A22B-fp8-tputqwen-qwen2-5-14b-instruct-lorameta-llama/Llama-4-Scout-17B-16E-InstructQwen/Qwen2.5-72B-Instruct-Turbometa-llama/Meta-Llama-3.1-405B-Instruct-TurboBAAI/bge-large-en-v1.5
Serverless Model Bring UpsThe following models have been added:
Qwen/Qwen3-Coder-Next-FP8
Model DeprecationsThe following models have been deprecated and are no longer available:
deepseek-ai/DeepSeek-R1-0528-tput
January, 2026
Model RedirectsThe following models are now being automatically redirected to their upgraded versions. See our Model Lifecycle Policy for details.
These are same-lineage upgrades with compatible behavior. If you need the original version, deploy it as a Dedicated Endpoint.
| Original Model | Redirects To |
|---|---|
mistralai/Mistral-7B-Instruct-v0.3 | mistralai/Ministral-3-14B-Instruct-2512 |
zai-org/GLM-4.6 | zai-org/GLM-4.7 |
Serverless Model Bring UpsThe following models have been added:
moonshotai/Kimi-K2.5
Model RedirectThe following model is now being automatically redirected to its upgraded version. See our Model Lifecycle Policy for details.
This is a same-lineage upgrade with compatible behavior. If you need the original version, deploy it as a Dedicated Endpoint.
| Original Model | Redirects To |
|---|---|
DeepSeek-V3-0324 | DeepSeek-V3.1 |
Prompt Caching Now Enabled by Default for Dedicated EndpointsPrompt caching is now automatically enabled for all newly created Dedicated Endpoints. This change improves performance and reduces costs by default.What’s changing:
- The
disable_prompt_cachefield (API),--no-prompt-cacheflag (CLI), and related SDK parameters are now deprecated. - Prompt caching will always be enabled — the field is accepted but ignored after deprecation.
- Now: Field is deprecated; setting it has no effect (prompt caching is always on).
- February 2026: Field will be removed.
--no-prompt-cachein CLI commands has no effect. You can remove it.disable_prompt_cachefrom API requests has no effect. You can remove it.- SDK calls that set this parameter have no effect. You can remove it.
Serverless Model Bring UpsThe following models have been added:
zai-org/GLM-4.7
Model DeprecationsThe following models have been deprecated and are no longer available:
Qwen/Qwen2.5-VL-72B-Instruct
December, 2025
Model DeprecationsThe following models have been deprecated and are no longer available:
deepseek-ai/DeepSeek-R1-Distill-Llama-70Bmeta-llama/Meta-Llama-3-70B-Instruct-Turboblack-forest-labs/FLUX.1-schnell-freemeta-llama/Meta-Llama-Guard-3-8B
Model Redirects Now ActiveThe following models are now being automatically redirected to their upgraded versions. See our Model Lifecycle Policy for details.
These are same-lineage upgrades with compatible behavior. If you need the original version, deploy it as a Dedicated Endpoint.
| Original Model | Redirects To |
|---|---|
Kimi-K2 | Kimi-K2-0905 |
DeepSeek-V3 | DeepSeek-V3-0324 |
DeepSeek-R1 | DeepSeek-R1-0528 |
Python SDK v2.0 Release CandidateTogether AI is releasing the Python SDK v2.0 Release Candidate — a new, OpenAPI‑generated, strongly‑typed client that replaces the legacy v1.0 package and brings the SDK into lock‑step with the latest platform features.
pip install together==2.0.0a9- RC Period: The v2.0 RC window starts today and will run for approximately 1 month. During this time we’ll iterate quickly based on developer feedback and may make a few small, well‑documented breaking changes before GA.
- Type‑Safe, Modern Client: Stronger typing across parameters and responses, keyword‑only arguments, explicit
NOT_GIVENhandling for optional fields, and richtogether.types.*definitions for chat messages, eval parameters, and more. - Redesigned Error Model: Replaces
TogetherExceptionwith a newTogetherErrorhierarchy, includingAPIStatusErrorand specific HTTP status code errors such asBadRequestError (400),AuthenticationError (401),RateLimitError (429), andInternalServerError (5xx), plus transport (APIConnectionError,APITimeoutError) and validation (APIResponseValidationError) errors. - New Jobs API: Adds first‑class support for the Jobs API (
client.jobs.*) so you can create, list, and inspect asynchronous jobs directly from the SDK without custom HTTP wrappers. - New Hardware API: Adds the Hardware API (
client.hardware.*) to discover available hardware, filter by model compatibility, and compute effective hourly pricing fromcents_per_minute. - Raw Response & Streaming Helpers: New
.with_raw_responseand.with_streaming_responsehelpers make it easier to debug, inspect headers and status codes, and stream completions via context managers with automatic cleanup. - Code Interpreter Sessions: Adds session management for the Code Interpreter (
client.code_interpreter.sessions.*), enabling multi‑step, stateful code‑execution workflows that were not possible in the legacy SDK. - High Compatibility for Core APIs: Most core usage patterns, including
chat.completions,completions,embeddings,images.generate, audio transcription/translation/speech,rerank,fine_tuning.create/list/retrieve/cancel, andmodels.list— are designed to be drop‑in compatible between v1 and v2. - Targeted Breaking Changes: Some APIs (Files, Batches, Endpoints, Evals, Code Interpreter, select fine‑tuning helpers) have updated method names, parameters, or response shapes; these are fully documented in the Python SDK Migration Guide and Breaking Changes notes.
- Migration Resources: A dedicated Python SDK Migration Guide is available with API‑by‑API before/after examples, a feature parity matrix, and troubleshooting tips to help teams smoothly transition from v1 to v2 during the RC period.
Serverless Model Bring UpsThe following models have been added:
mistralai/Ministral-3-14B-Instruct-2512
November, 2025
Serverless Model Bring UpsThe following models have been added:
zai-org/GLM-4.6moonshotai/Kimi-K2-Thinking
Enhanced Audio Capabilities: Real-time Text-to-Speech and Speech-to-TextTogether AI expands audio capabilities with real-time streaming for both TTS and STT, new models, and speaker diarization.
- Real-time Text-to-Speech: WebSocket API for lowest-latency interactive applications
- New TTS Models: Orpheus 3B (
canopylabs/orpheus-3b-0.1-ft) and Kokoro 82M (hexgrad/Kokoro-82M) supporting REST, streaming, and WebSocket endpoints - Real-time Speech-to-Text: WebSocket streaming transcription with Whisper for live audio applications
- Voxtral Model: New Mistral AI speech recognition model (
mistralai/Voxtral-Mini-3B-2507) for audio transcriptions - Speaker Diarization: Identify and label different speakers in audio transcriptions with a free
diarizeflag - TTS WebSocket endpoint:
/v1/audio/speech/websocket - STT WebSocket endpoint:
/v1/realtime - Check out the Text-to-Speech guide and Speech-to-Text guide
October, 2025
Model DeprecationsThe following image models have been deprecated and are no longer available:
black-forest-labs/FLUX.1-pro(Calls to FLUX.1-pro will now redirect to FLUX.1.1-pro)black-forest-labs/FLUX.1-Canny-pro
Video Generation API & 40+ New Image and Video ModelsTogether AI expands into multimedia generation with comprehensive video and image capabilities. Read more
- New Video Generation API: Create high-quality videos with models like OpenAI Sora 2, Google Veo 3.0, and Minimax Hailuo
- 40+ Image & Video Models: Including Google Imagen 4.0 Ultra, Gemini Flash Image 2.5 (Nano Banana), ByteDance SeeDream, and specialized editing tools
- Unified Platform: Combine text, image, and video generation through the same APIs, authentication, and billing
- Production-Ready: Serverless endpoints with transparent per-model pricing and enterprise-grade infrastructure
- Video endpoints:
/videos/createand/videos/retrieve - Image endpoint:
/images/generations
September, 2025
Improved Batch Inference API: Enhanced UI, Expanded Model Support, and Rate Limit IncreaseWhat’s New
- Streamlined UI: Create and track batch jobs in an intuitive interface — no complex API calls required.
- Universal Model Access: The Batch Inference API now supports all serverless models and private deployments, so you can run batch workloads on exactly the models you need.
- Massive Scale Jump: Rate limits are up from 10M to 30B enqueued tokens per model per user, a 3000× increase. Need more? We’ll work with you to customize.
- Lower Cost: For most serverless models, the Batch Inference API runs at 50% the cost of our real-time API, making it the most economical way to process high-throughput workloads.
Qwen3-Next-80B Models ReleaseNew Qwen3-Next-80B models now available for both thinking and instruction tasks.
- Model ID:
Qwen/Qwen3-Next-80B-A3B-Thinking - Model ID:
Qwen/Qwen3-Next-80B-A3B-Instruct
Fine-Tuning Platform UpgradesEnhanced fine-tuning capabilities with expanded model support and increased context lengths. Read moreEnable fine-tuning for new large models:
Increased maximum supported context length (per model and variant):DeepSeek Models
Enhanced Hugging Face integrations:
openai/gpt-oss-120bdeepseek-ai/DeepSeek-V3.1deepseek-ai/DeepSeek-V3.1-Basedeepseek-ai/DeepSeek-R1-0528deepseek-ai/DeepSeek-R1deepseek-ai/DeepSeek-V3-0324deepseek-ai/DeepSeek-V3deepseek-ai/DeepSeek-V3-BaseQwen/Qwen3-Coder-480B-A35B-InstructQwen/Qwen3-235B-A22B(context length 32,768 for SFT and 16,384 for DPO)Qwen/Qwen3-235B-A22B-Instruct-2507(context length 32,768 for SFT and 16,384 for DPO)meta-llama/Llama-4-Maverick-17B-128Emeta-llama/Llama-4-Maverick-17B-128E-Instructmeta-llama/Llama-4-Scout-17B-16Emeta-llama/Llama-4-Scout-17B-16E-Instruct
Increased maximum supported context length (per model and variant):DeepSeek Models
- DeepSeek-R1-Distill-Llama-70B: SFT: 8192 → 24,576, DPO: 8192 → 8192
- DeepSeek-R1-Distill-Qwen-14B: SFT: 8192 → 65,536, DPO: 8192 → 12,288
- DeepSeek-R1-Distill-Qwen-1.5B: SFT: 8192 → 131,072, DPO: 8192 → 16,384
- gemma-3-1b-it: SFT: 16,384 → 32,768, DPO: 16,384 → 12,288
- gemma-3-1b-pt: SFT: 16,384 → 32,768, DPO: 16,384 → 12,288
- gemma-3-4b-it: SFT: 16,384 → 131,072, DPO: 16,384 → 12,288
- gemma-3-4b-pt: SFT: 16,384 → 131,072, DPO: 16,384 → 12,288
- gemma-3-12b-pt: SFT: 16,384 → 65,536, DPO: 16,384 → 8,192
- gemma-3-27b-it: SFT: 12,288 → 49,152, DPO: 12,288 → 8,192
- gemma-3-27b-pt: SFT: 12,288 → 49,152, DPO: 12,288 → 8,192
- Qwen3-0.6B / Qwen3-0.6B-Base: SFT: 8192 → 32,768, DPO: 8192 → 24,576
- Qwen3-1.7B / Qwen3-1.7B-Base: SFT: 8192 → 32,768, DPO: 8192 → 16,384
- Qwen3-4B / Qwen3-4B-Base: SFT: 8192 → 32,768, DPO: 8192 → 16,384
- Qwen3-8B / Qwen3-8B-Base: SFT: 8192 → 32,768, DPO: 8192 → 16,384
- Qwen3-14B / Qwen3-14B-Base: SFT: 8192 → 32,768, DPO: 8192 → 16,384
- Qwen3-32B: SFT: 8192 → 24,576, DPO: 8192 → 4096
- Qwen2.5-72B-Instruct: SFT: 8192 → 24,576, DPO: 8192 → 8192
- Qwen2.5-32B-Instruct: SFT: 8192 → 32,768, DPO: 8192 → 12,288
- Qwen2.5-32B: SFT: 8192 → 49,152, DPO: 8192 → 12,288
- Qwen2.5-14B-Instruct: SFT: 8192 → 32,768, DPO: 8192 → 16,384
- Qwen2.5-14B: SFT: 8192 → 65,536, DPO: 8192 → 16,384
- Qwen2.5-7B-Instruct: SFT: 8192 → 32,768, DPO: 8192 → 16,384
- Qwen2.5-7B: SFT: 8192 → 131,072, DPO: 8192 → 16,384
- Qwen2.5-3B-Instruct: SFT: 8192 → 32,768, DPO: 8192 → 16,384
- Qwen2.5-3B: SFT: 8192 → 32,768, DPO: 8192 → 16,384
- Qwen2.5-1.5B-Instruct: SFT: 8192 → 32,768, DPO: 8192 → 16,384
- Qwen2.5-1.5B: SFT: 8192 → 32,768, DPO: 8192 → 16,384
- Qwen2-72B-Instruct / Qwen2-72B: SFT: 8192 → 32,768, DPO: 8192 → 8192
- Qwen2-7B-Instruct: SFT: 8192 → 32,768, DPO: 8192 → 16,384
- Qwen2-7B: SFT: 8192 → 131,072, DPO: 8192 → 16,384
- Qwen2-1.5B-Instruct: SFT: 8192 → 32,768, DPO: 8192 → 16,384
- Qwen2-1.5B: SFT: 8192 → 131,072, DPO: 8192 → 16,384
- Llama-3.3-70B-Instruct-Reference: SFT: 8,192 → 24,576, DPO: 8,192 → 8,192
- Llama-3.2-3B-Instruct: SFT: 8,192 → 131,072, DPO: 8,192 → 24,576
- Llama-3.2-1B-Instruct: SFT: 8,192 → 131,072, DPO: 8,192 → 24,576
- Meta-Llama-3.1-8B-Instruct-Reference: SFT: 8,192 → 131,072, DPO: 8,192 → 16,384
- Meta-Llama-3.1-8B-Reference: SFT: 8,192 → 131,072, DPO: 8,192 → 16,384
- Meta-Llama-3.1-70B-Instruct-Reference: SFT: 8,192 → 24,576, DPO: 8,192 → 8,192
- Meta-Llama-3.1-70B-Reference: SFT: 8,192 → 24,576, DPO: 8,192 → 8,192
- mistralai/Mistral-7B-v0.1: SFT: 8,192 → 32,768, DPO: 8,192 → 32,768
- teknium/OpenHermes-2p5-Mistral-7B: SFT: 8,192 → 32,768, DPO: 8,192 → 32,768
Enhanced Hugging Face integrations:
- Fine-tune any < 100B parameter CausalLM from Hugging Face Hub
- Support for DPO variants such as LN-DPO, DPO+NLL, and SimPO
- Support fine-tuning with maximum batch size
- Public
fine-tunes/models/limitsandfine-tunes/models/supportedendpoints - Automatic filtering of sequences with no trainable tokens (e.g., if a sequence prompt is longer than the model’s context length, the completion is pushed outside the window)
Together Instant Clusters General AvailabilitySelf-service NVIDIA GPU clusters with API-first provisioning. Read more
- New API endpoints for cluster management:
/v1/gpu_cluster- Create and manage GPU clusters/v1/shared_volume- High-performance shared storage/v1/regions- Available data center locations
- Support for NVIDIA Blackwell (HGX B200) and Hopper (H100, H200) GPUs
- Scale from single-node (8 GPUs) to hundreds of interconnected GPUs
- Pre-configured with Kubernetes, Slurm, and networking components
Serverless LoRA and Dedicated Endpoints support for EvaluationsYou can now run evaluations:
- Using Serverless LoRA models, including supported LoRA fine-tuned models
- Using Dedicated Endpoints, including fine-tuned models deployed via dedicated endpoints
Kimi-K2-Instruct-0905 Model ReleaseUpgraded version of Moonshot’s 1 trillion parameter MoE model with enhanced performance. Read more
- Model ID:
moonshot-ai/Kimi-K2-Instruct-0905
August, 2025
DeepSeek-V3.1 Model ReleaseUpgraded version of DeepSeek-R1-0528 and DeepSeek-V3-0324. Read more
Model DeprecationsThe following models have been deprecated and are no longer available:
- Dual Modes: Fast mode for quick responses, thinking mode for complex reasoning
- 671B total parameters with 37B active parameters
- Model ID:
deepseek-ai/DeepSeek-V3.1
Model DeprecationsThe following models have been deprecated and are no longer available:
meta-llama/Llama-3.2-90B-Vision-Instruct-Turboblack-forest-labs/FLUX.1-cannymeta-llama/Llama-3-8b-chat-hfblack-forest-labs/FLUX.1-reduxblack-forest-labs/FLUX.1-depthdeepseek-ai/DeepSeek-R1-Distill-Qwen-1.5BNousResearch/Nous-Hermes-2-Mixtral-8x7B-DPOmeta-llama/Llama-3.2-11B-Vision-Instruct-Turbometa-llama-llama-3-3-70b-instruct-loraQwen/Qwen2.5-14Bmeta-llama/Llama-Vision-FreeQwen/Qwen2-72B-Instructgoogle/gemma-2-27b-itmeta-llama/Meta-Llama-3-8B-Instructperplexity-ai/r1-1776nvidia/Llama-3.1-Nemotron-70B-Instruct-HFQwen/Qwen2-VL-72B-Instruct
GPT-OSS Models Fine-Tuning SupportFine-tune OpenAI’s open-source models to create domain-specific variants. Read more
- Supported models:
gpt-oss-20Bandgpt-oss-120B - Supports 16K context SFT, 8k context DPO
OpenAI GPT-OSS Models Now AvailableOpenAI’s first open-weight models now accessible through Together AI. Read more
- Model IDs:
openai/gpt-oss-20b,openai/gpt-oss-120b
July, 2025
VirtueGuard Model ReleaseEnterprise-grade gaurd model for safety monitoring with 8ms response time. Read more
- Real-time content filtering and bias detection
- Prompt injection protection
- Model ID:
VirtueAI/VirtueGuard-Text-Lite
Together Evaluations FrameworkBenchmarking platform using LLM-as-a-judge methodology for model performance assessment. Read more
- Create custom LLM-as-a-Judge evaluation suites for your domain
- Support
compare,classifyandscorefunctionality - Enables comparing models, prompts and LLM configs, scoring and classifying LLM outputs
Qwen3-Coder-480B Model ReleaseAgentic coding model with top SWE-Bench Verified performance. Read more
- 480B total parameters with 35B active (MoE architecture)
- 256K context length for entire codebase handling
- Leading SWE-Bench scores on software engineering benchmarks
- Model ID:
Qwen/Qwen3-Coder-480B-A35B-Instruct-FP8
NVIDIA HGX B200 Hardware SupportRecord-breaking serverless inference speed for DeepSeek-R1-0528 using NVIDIA’s Blackwell architecture. Read more
- Dramatically improved throughput and lower latency
- Same API endpoints and pricing
- Model ID:
deepseek-ai/DeepSeek-R1
Kimi-K2-Instruct Model LaunchMoonshot AI’s 1 trillion parameter MoE model with frontier-level performance. Read more
- Excels at tool use, and multi-step tasks and strong multilingual support
- Great agentic and function calling capabilities
- Model ID:
moonshotai/Kimi-K2-Instruct
Whisper Speech-to-Text APIsHigh-performance audio transcription that’s 15× faster than OpenAI with support for files over 1 GB. Read more
- Multiple audio formats with timestamp generation
- Speaker diarization and language detection
- Use
/audio/transcriptionsand/audio/translationsendpoint - Model ID:
openai/whisper-large-v3