Serverless models
Chat models
In the table below, models marked as "Turbo" are quantized to FP8 and those marked as "Lite" are INT4. All our other models are at full precision (FP16).
If you're not sure which chat model to use, we currently recommend Llama 3.3 70B Turbo (meta-llama/Llama-3.3-70B-Instruct-Turbo
) to get started.
Organization | Model Name | API Model String | Context length | Quantization |
---|---|---|---|---|
Qwen | Qwen3 235B A22B Throughput | Qwen/Qwen3-235B-A22B-fp8-tput | 128000 | FP8 |
Meta | Llama 4 Maverick (17Bx128E) | meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8 | 1048576 | FP8 |
Meta | Llama 4 Scout (17Bx16E) | meta-llama/Llama-4-Scout-17B-16E-Instruct | 1048576 | FP16 |
DeepSeek | DeepSeek-R1 | deepseek-ai/DeepSeek-R1 | 128000 | FP8 |
DeepSeek | DeepSeek-V3-0324 | deepseek-ai/DeepSeek-V3 | 16384 | FP8 |
DeepSeek | DeepSeek R1 Distill Llama 70B | deepseek-ai/DeepSeek-R1-Distill-Llama-70B | 131072 | FP16 |
DeepSeek | DeepSeek R1 Distill Qwen 1.5B | deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B | 131072 | FP16 |
DeepSeek | DeepSeek R1 Distill Qwen 14B | deepseek-ai/DeepSeek-R1-Distill-Qwen-14B | 131072 | FP16 |
mistralai | Mistral Small 3 Instruct (24B) | mistralai/Mistral-Small-24B-Instruct-2501 | 32768 | FP16 |
Meta | Llama 3.1 8B Instruct Turbo | meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo | 131072 | FP8 |
Meta | Llama 3.3 70B Instruct Turbo | meta-llama/Llama-3.3-70B-Instruct-Turbo | 131072 | FP8 |
Nvidia | Llama 3.1 Nemotron 70B | nvidia/Llama-3.1-Nemotron-70B-Instruct-HF | 32768 | FP16 |
Qwen | Qwen 2.5 7B Instruct Turbo | Qwen/Qwen2.5-7B-Instruct-Turbo | 32768 | FP8 |
Qwen | Qwen 2.5 72B Instruct Turbo | Qwen/Qwen2.5-72B-Instruct-Turbo | 32768 | FP8 |
Qwen | Qwen2.5 Vision Language 72B Instruct | Qwen/Qwen2.5-VL-72B-Instruct | 32768 | FP8 |
Qwen | Qwen 2.5 Coder 32B Instruct | Qwen/Qwen2.5-Coder-32B-Instruct | 32768 | FP16 |
Qwen | QwQ-32B | Qwen/QwQ-32B | 32768 | FP16 |
Qwen | Qwen 2 Instruct (72B) | Qwen/Qwen2-72B-Instruct | 32768 | FP16 |
Qwen | Qwen2 VL 72B Instruct | Qwen/Qwen2-VL-72B-Instruct | 32768 | FP16 |
Meta | Llama 3.1 405B Instruct Turbo | meta-llama/Meta-Llama-3.1-405B-Instruct-Turbo | 130815 | FP8 |
Meta | Llama 3 8B Instruct Turbo | meta-llama/Meta-Llama-3-8B-Instruct-Turbo\ | 8192 | FP8 |
Meta | Llama 3.2 3B Instruct Turbo | meta-llama/Llama-3.2-3B-Instruct-Turbo | 131072 | FP16 |
Meta | Llama 3 8B Instruct Lite | meta-llama/Meta-Llama-3-8B-Instruct-Lite | 8192 | INT4 |
Meta | Llama 3 8B Instruct Reference | meta-llama/Llama-3-8b-chat-hf | 8192 | FP16 |
Meta | Llama 3 70B Instruct Reference | meta-llama/Llama-3-70b-chat-hf | 8192 | FP16 |
Gemma 2 27B | google/gemma-2-27b-it | 8192 | FP16 | |
Gemma 2 9B | google/gemma-2-9b-it* | 8192 | FP16 | |
Gemma Instruct (2B) | google/gemma-2b-it* | 8192 | FP16 | |
Gryphe | MythoMax-L2 (13B) | Gryphe/MythoMax-L2-13b* | 4096 | FP16 |
mistralai | Mistral (7B) Instruct | mistralai/Mistral-7B-Instruct-v0.1 | 8192 | FP16 |
mistralai | Mistral (7B) Instruct v0.2 | mistralai/Mistral-7B-Instruct-v0.2 | 32768 | FP16 |
mistralai | Mistral (7B) Instruct v0.3 | mistralai/Mistral-7B-Instruct-v0.3 | 32768 | FP16 |
mistralai | Mixtral-8x7B Instruct (46.7B) | mistralai/Mixtral-8x7B-Instruct-v0.1* | 32768 | FP16 |
mistralai | Mixtral-8x22B Instruct (141B) | mistralai/Mixtral-8x22B-Instruct-v0.1* | 65536 | FP16 |
NousResearch | Nous Hermes 2 - Mixtral 8x7B-DPO (46.7B) | NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO | 32768 | FP16 |
*Deprecated model, see Deprecations for more details
Chat Model Examples
- PDF to Chat App - Chat with your PDFs (blogs, textbooks, papers)
- Open Deep Research Notebook - Generate long form reports using a single prompt
- RAG with Reasoning Models Notebook - RAG with DeepSeek-R1
- Fine-tuning Chat Models Notebook - Tune Language models for conversation
- Building Agents - Agent workflows with language models
Image models
Use our Images endpoint for Image Models.
Organization | Model Name | Model String for API | Default steps |
---|---|---|---|
Black Forest Labs | Flux.1 [schnell] (free)* | black-forest-labs/FLUX.1-schnell-Free | N/A |
Black Forest Labs | Flux.1 [schnell] (Turbo) | black-forest-labs/FLUX.1-schnell | 4 |
Black Forest Labs | Flux.1 Dev | black-forest-labs/FLUX.1-dev | 28 |
Black Forest Labs | Flux.1 Canny | black-forest-labs/FLUX.1-canny | 28 |
Black Forest Labs | Flux.1 Depth | black-forest-labs/FLUX.1-depth | 28 |
Black Forest Labs | Flux.1 Redux | black-forest-labs/FLUX.1-redux | 28 |
Black Forest Labs | Flux1.1 [pro] | black-forest-labs/FLUX.1.1-pro | - |
Black Forest Labs | Flux.1 [pro] | black-forest-labs/FLUX.1-pro | - |
Note: Due to high demand, FLUX.1 [schnell] Free has a model specific rate limit of 10 img/min. Flux Pro 1 and Flux Pro 1.1 are limited to users Build Tier 2 and above. Flux models can also only be used with credits. Users are unable to call Flux with a zero or negative balance.
*Free model has reduced rate limits and performance compared to our paid Turbo endpoint for Flux Shnell named black-forest-labs/FLUX.1-schnell
Image Model Examples
- Blinkshot.io - A realtime AI image playground built with Flux Schnell
- Logo Creator - An logo generator that creates professional logos in seconds using Flux Pro 1.1
- PicMenu - A menu visualizer that takes a restaurant menu and generates nice images for each dish.
- Flux LoRA Inference Notebook - Using LoRA fine-tuned image generations models
How FLUX pricing works
For FLUX models (except for pro) pricing is based on the size of generated images (in megapixels) and the number of steps used (if the number of steps exceed the default steps).
- Default pricing: The listed per megapixel prices are for the default number of steps.
- Using more or fewer steps: Costs are adjusted based on the number of steps used only if you go above the default steps. If you use more steps, the cost increases proportionally using the formula below. If you use fewer steps, the cost does not decrease and is based on the default rate.
Here’s a formula to calculate cost:
Cost = MP × Price per MP × (Steps ÷ Default Steps)
Where:
- MP = (Width × Height ÷ 1,000,000)
- Price per MP = Cost for generating one megapixel at the default steps
- Steps = The number of steps used for the image generation. This is only factored in if going above default steps.
Vision models
If you're not sure which vision model to use, we currently recommend Llama 3.2 11B Turbo (meta-llama/Llama-3.2-11B-Vision-Instruct-Turbo
) to get started. For model specific rate limits, navigate here.
Organization | Model Name | API Model String | Context length |
---|---|---|---|
Meta | Llama 4 Maverick (17Bx128E) | meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8 | 524288 |
Meta | Llama 4 Scout (17Bx16E) | meta-llama/Llama-4-Scout-17B-16E-Instruct | 327680 |
Meta | (Free) Llama 3.2 11B Vision Instruct Turbo* | meta-llama/Llama-Vision-Free | 131072 |
Meta | Llama 3.2 11B Vision Instruct Turbo | meta-llama/Llama-3.2-11B-Vision-Instruct-Turbo | 131072 |
Meta | Llama 3.2 90B Vision Instruct Turbo | meta-llama/Llama-3.2-90B-Vision-Instruct-Turbo | 131072 |
Qwen | Qwen2 Vision Language 72B Instruct | Qwen/Qwen2-VL-72B-Instruct | 32768 |
Qwen | Qwen2.5 Vision Language 72B Instruct | Qwen/Qwen2.5-VL-72B-Instruct | 32768 |
*Free model has reduced rate limits compared to paid version of Llama 3.2 Vision 11B named Llama-3.2-11B-Vision-Instruct-Turbo
Vision Model Examples
- LlamaOCR - A tool that takes documents (like receipts) and outputs markdown
- Wireframe to Code - A wireframe to app tool that takes in a UI mockup of a site and give you React code.
- Extracting Structured Data from Images - Extract information from images as JSON
Audio models
Use our Audio endpoint for audio models.
Organization | Model Name | Model String for API |
---|---|---|
Cartesia | Cartesia Sonic 2 | Cartesia/Sonic-2 |
Audio Model Examples
- PDF to Podcast Notebook - Generate a NotebookLM style podcast given a PDF
- Audio Podcast Agent Workflow - Agent workflow to generate audio files given input content
Code models
Use our Completions endpoint for Code Models.
Organization | Model Name | Model String for API | Context length |
---|---|---|---|
Qwen | Qwen 2.5 Coder 32B Instruct | Qwen/Qwen2.5-Coder-32B-Instruct | 32768 |
Code Model Examples
- LlamaCoder - An open source app to generate small apps with one prompt. Powered by Llama 3 405B.
- Code Generation Agent - An agent workflow to generate and iteratively improve code.
Embedding models
Model Name | Model String for API | Model Size | Embedding Dimension | Context Window |
---|---|---|---|---|
M2-BERT-80M-2K-Retrieval | togethercomputer/m2-bert-80M-2k-retrieval* | 80M | 768 | 2048 |
M2-BERT-80M-8K-Retrieval | togethercomputer/m2-bert-80M-8k-retrieval* | 80M | 768 | 8192 |
M2-BERT-80M-32K-Retrieval | togethercomputer/m2-bert-80M-32k-retrieval | 80M | 768 | 32768 |
UAE-Large-v1 | WhereIsAI/UAE-Large-V1* | 326M | 1024 | 512 |
BGE-Large-EN-v1.5 | BAAI/bge-large-en-v1.5 | 326M | 1024 | 512 |
BGE-Base-EN-v1.5 | BAAI/bge-base-en-v1.5 | 102M | 768 | 512 |
Embedding Model Examples
- Contextual RAG - An open source implementation of contextual RAG by Anthropic
- Code Generation Agent - An agent workflow to generate and iteratively improve code
- Multimodal Search and Image Generation - Search for images and generate more similar ones
- Visualizing Embeddings - Visualizing and clustering vector embeddings
Rerank models
Our Rerank API has built-in support for the following models, that we host via our serverless endpoints.
Organization | Model Name | Model Size | Model String for API | Max Doc Size (tokens) | Max Docs |
---|---|---|---|---|---|
Salesforce | LlamaRank | 8B | Salesforce/Llama-Rank-v1 | 8192 | 1024 |
Rerank Model Examples
- Search and Reranking - Simple semantic search pipeline improved using a reranker
- Implementing Hybrid Search Notebook - Implementing semantic + lexical search along with reranking
Language models
Use our Completions endpoint for Language Models.
Organization | Model Name | Model String for API | Context length |
---|---|---|---|
Meta | LLaMA-2 (70B) | meta-llama/Llama-2-70b-hf | 4096 |
mistralai | Mixtral-8x7B (46.7B) | mistralai/Mixtral-8x7B-v0.1 | 32768 |
Moderation models
Use our Completions endpoint to run a moderation model as a standalone classifier, or use it alongside any of the other models above as a filter to safeguard responses from 100+ models, by specifying the parameter "safety_model": "MODEL_API_STRING"
Organization | Model Name | Model String for API | Context length |
---|---|---|---|
Meta | Llama Guard (8B) | meta-llama/Meta-Llama-Guard-3-8B | 8192 |
Updated about 23 hours ago