Serverless models

Chat models

In the table below, models marked as "Turbo" are quantized to FP8 and those marked as "Lite" are INT4. All our other models are at full precision (FP16).

If you're not sure which chat model to use, we currently recommend Llama 3.3 70B Turbo (meta-llama/Llama-3.3-70B-Instruct-Turbo) to get started.

Organization	Model Name	API Model String	Context length	Quantization
Qwen	Qwen3 235B A22B Throughput	Qwen/Qwen3-235B-A22B-fp8-tput	40960	FP8
Meta	Llama 4 Maverick (17Bx128E)	meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8	1048576	FP8
Meta	Llama 4 Scout (17Bx16E)	meta-llama/Llama-4-Scout-17B-16E-Instruct	1048576	FP16
DeepSeek	DeepSeek-R1-0528	deepseek-ai/DeepSeek-R1	163839	FP8
Perplexity AI	Perplexity AI R1-1776	perplexity-ai/r1-1776	163840	FP16
Mistral AI	Magistral Small 2506 API	mistralai/Magistral-Small-2506	40960	BF16
DeepSeek	DeepSeek-R1-0528 Throughput	deepseek-ai/DeepSeek-R1-0528-tput	163839	FP8
DeepSeek	DeepSeek-V3-0324	deepseek-ai/DeepSeek-V3	163839	FP8
DeepSeek	DeepSeek R1 Distill Llama 70B	deepseek-ai/DeepSeek-R1-Distill-Llama-70B	131072	FP16
DeepSeek	DeepSeek R1 Distill Qwen 1.5B	deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B*	131072	FP16
DeepSeek	DeepSeek R1 Distill Qwen 14B	deepseek-ai/DeepSeek-R1-Distill-Qwen-14B	131072	FP16
Marin Community	Marin 8B Instruct	marin-community/marin-8b-instruct	4096	FP16
Mistral AI	Mistral Small 3 Instruct (24B)	mistralai/Mistral-Small-24B-Instruct-2501	32768	FP16
Meta	Llama 3.1 8B Instruct Turbo	meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo	131072	FP8
Meta	Llama 3.3 70B Instruct Turbo	meta-llama/Llama-3.3-70B-Instruct-Turbo	131072	FP8
Meta	Llama 3.3 70B Instruct Turbo (Free)*	meta-llama/Llama-3.3-70B-Instruct-Turbo-Free	131072	FP8
Nvidia	Llama 3.1 Nemotron 70B	nvidia/Llama-3.1-Nemotron-70B-Instruct-HF	32768	FP16
Qwen	Qwen 2.5 7B Instruct Turbo	Qwen/Qwen2.5-7B-Instruct-Turbo	32768	FP8
Qwen	Qwen 2.5 72B Instruct Turbo	Qwen/Qwen2.5-72B-Instruct-Turbo	32768	FP8
Qwen	Qwen2.5 Vision Language 72B Instruct	Qwen/Qwen2.5-VL-72B-Instruct	32768	FP8
Qwen	Qwen 2.5 Coder 32B Instruct	Qwen/Qwen2.5-Coder-32B-Instruct	32768	FP16
Qwen	QwQ-32B	Qwen/QwQ-32B	32768	FP16
Qwen	Qwen 2 Instruct (72B)	Qwen/Qwen2-72B-Instruct	32768	FP16
Qwen	Qwen2 VL 72B Instruct	Qwen/Qwen2-VL-72B-Instruct	32768	FP16
Arcee	Arcee AI Virtuoso Medium	arcee-ai/virtuoso-medium-v2	128000	-
Arcee	Arcee AI Coder-Large	arcee-ai/coder-large	32768	-
Arcee	Arcee AI Virtuoso-Large	arcee-ai/virtuoso-large	128000	-
Arcee	Arcee AI Maestro	arcee-ai/maestro-reasoning	128000	-
Arcee	Arcee AI Caller	arcee-ai/caller	32768	-
Arcee	Arcee AI Blitz	arcee-ai/arcee-blitz	32768	-
Meta	Llama 3.1 405B Instruct Turbo	meta-llama/Meta-Llama-3.1-405B-Instruct-Turbo	130815	FP8
Meta	Llama 3.2 3B Instruct Turbo	meta-llama/Llama-3.2-3B-Instruct-Turbo	131072	FP16
Meta	Llama 3 8B Instruct Lite	meta-llama/Meta-Llama-3-8B-Instruct-Lite	8192	INT4
Meta	Llama 3 8B Instruct Reference	meta-llama/Llama-3-8b-chat-hf*	8192	FP16
Meta	Llama 3 70B Instruct Reference	meta-llama/Llama-3-70b-chat-hf	8192	FP16
Google	Gemma 2 27B	google/gemma-2-27b-it	8192	FP16
Google	Gemma Instruct (2B)	google/gemma-2b-it*	8192	FP16
Gryphe	MythoMax-L2 (13B)	Gryphe/MythoMax-L2-13b*	4096	FP16
Mistral AI	Mistral (7B) Instruct	mistralai/Mistral-7B-Instruct-v0.1	8192	FP16
Mistral AI	Mistral (7B) Instruct v0.2	mistralai/Mistral-7B-Instruct-v0.2	32768	FP16
Mistral AI	Mistral (7B) Instruct v0.3	mistralai/Mistral-7B-Instruct-v0.3	32768	FP16
NousResearch	Nous Hermes 2 - Mixtral 8x7B-DPO (46.7B)	NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO*	32768	FP16

* The Free version of Llama 3.3 70B Instruct Turbo has a reduced rate limit of 6 requests/minute for users on the free tier and 10 requests/minute for all other build tiers.

*Deprecated model, see Deprecations for more details

Chat Model Examples

PDF to Chat App - Chat with your PDFs (blogs, textbooks, papers)
Open Deep Research Notebook - Generate long form reports using a single prompt
RAG with Reasoning Models Notebook - RAG with DeepSeek-R1
Fine-tuning Chat Models Notebook - Tune Language models for conversation
Building Agents - Agent workflows with language models

Image models

Use our Images endpoint for Image Models.

Organization	Model Name	Model String for API	Default steps
Black Forest Labs	Flux.1 [schnell] (free)*	black-forest-labs/FLUX.1-schnell-Free	N/A
Black Forest Labs	Flux.1 [schnell] (Turbo)	black-forest-labs/FLUX.1-schnell	4
Black Forest Labs	Flux.1 Dev	black-forest-labs/FLUX.1-dev	28
Black Forest Labs	Flux.1 Canny	black-forest-labs/FLUX.1-canny*	28
Black Forest Labs	Flux.1 Depth	black-forest-labs/FLUX.1-depth*	28
Black Forest Labs	Flux.1 Redux	black-forest-labs/FLUX.1-redux*	28
Black Forest Labs	Flux1.1 [pro]	black-forest-labs/FLUX.1.1-pro	-
Black Forest Labs	Flux.1 [pro]	black-forest-labs/FLUX.1-pro	28
Black Forest Labs	Flux .1 Kontext [pro]	black-forest-labs/FLUX.1-kontext-pro	28
Black Forest Labs	Flux .1 Kontext [max]	black-forest-labs/FLUX.1-kontext-max	28
Black Forest Labs	Flux .1 Kontext [dev]	black-forest-labs/FLUX.1-kontext-dev	28

Note: Due to high demand, FLUX.1 [schnell] Free has a model specific rate limit of 10 img/min. Flux Pro 1, Flux Pro 1.1, Flux .1 Kontext [pro], and Flux .1 Kontext [max] are limited to users Build Tier 2 and above. Flux models can also only be used with credits. Users are unable to call Flux with a zero or negative balance.

*Free model has reduced rate limits and performance compared to our paid Turbo endpoint for Flux Shnell named black-forest-labs/FLUX.1-schnell

Image Model Examples

Blinkshot.io - A realtime AI image playground built with Flux Schnell
Logo Creator - An logo generator that creates professional logos in seconds using Flux Pro 1.1
PicMenu - A menu visualizer that takes a restaurant menu and generates nice images for each dish.
Flux LoRA Inference Notebook - Using LoRA fine-tuned image generations models

How FLUX pricing works
For FLUX models (except for pro) pricing is based on the size of generated images (in megapixels) and the number of steps used (if the number of steps exceed the default steps).

Default pricing: The listed per megapixel prices are for the default number of steps.
Using more or fewer steps: Costs are adjusted based on the number of steps used only if you go above the default steps. If you use more steps, the cost increases proportionally using the formula below. If you use fewer steps, the cost does not decrease and is based on the default rate.

Here’s a formula to calculate cost:

Cost = MP × Price per MP × (Steps ÷ Default Steps)

Where:

MP = (Width × Height ÷ 1,000,000)
Price per MP = Cost for generating one megapixel at the default steps
Steps = The number of steps used for the image generation. This is only factored in if going above default steps.

Vision models

If you're not sure which vision model to use, we currently recommend Llama 4 Scout (meta-llama/Llama-4-Scout-17B-16E-Instruct) to get started. For model specific rate limits, navigate here.

Organization	Model Name	API Model String	Context length
Meta	Llama 4 Maverick (17Bx128E)	meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8	524288
Meta	Llama 4 Scout (17Bx16E)	meta-llama/Llama-4-Scout-17B-16E-Instruct	327680
Meta	(Free) Llama 3.2 11B Vision Instruct Turbo*	meta-llama/Llama-Vision-Free	131072
Meta	Llama 3.2 11B Vision Instruct Turbo	meta-llama/Llama-3.2-11B-Vision-Instruct-Turbo*	131072
Meta	Llama 3.2 90B Vision Instruct Turbo	meta-llama/Llama-3.2-90B-Vision-Instruct-Turbo*	131072
Qwen	Qwen2 Vision Language 72B Instruct	Qwen/Qwen2-VL-72B-Instruct	32768
Qwen	Qwen2.5 Vision Language 72B Instruct	Qwen/Qwen2.5-VL-72B-Instruct	32768
Arcee	Arcee AI Spotlight	arcee_ai/arcee-spotlight	128000

*Free model has reduced rate limits compared to paid version of Llama 3.2 Vision 11B named Llama-3.2-11B-Vision-Instruct-Turbo

Vision Model Examples

LlamaOCR - A tool that takes documents (like receipts) and outputs markdown
Wireframe to Code - A wireframe to app tool that takes in a UI mockup of a site and give you React code.
Extracting Structured Data from Images - Extract information from images as JSON

Audio models

Use our Audio endpoint for audio models.

Organization	Model Name	Model String for API
Cartesia	Cartesia Sonic 2	cartesia/sonic-2
Cartesia	Cartesia Sonic	cartesia/sonic

Audio Model Examples

PDF to Podcast Notebook - Generate a NotebookLM style podcast given a PDF
Audio Podcast Agent Workflow - Agent workflow to generate audio files given input content

Code models

Use our Completions endpoint for Code Models.

Organization	Model Name	Model String for API	Context length
Qwen	Qwen 2.5 Coder 32B Instruct	Qwen/Qwen2.5-Coder-32B-Instruct	32768

Code Model Examples

LlamaCoder - An open source app to generate small apps with one prompt. Powered by Llama 3 405B.
Code Generation Agent - An agent workflow to generate and iteratively improve code.

Embedding models

Model Name	Model String for API	Model Size	Embedding Dimension	Context Window
M2-BERT-80M-32K-Retrieval	togethercomputer/m2-bert-80M-32k-retrieval	80M	768	32768
BGE-Large-EN-v1.5	BAAI/bge-large-en-v1.5	326M	1024	512
BGE-Base-EN-v1.5	BAAI/bge-base-en-v1.5	102M	768	512
GTE-Modernbert-base	Alibaba-NLP/gte-modernbert-base	149M	768	8192
Multilingual-e5-large-instruct	intfloat/multilingual-e5-large-instruct	560M	1024	514

Embedding Model Examples

Contextual RAG - An open source implementation of contextual RAG by Anthropic
Code Generation Agent - An agent workflow to generate and iteratively improve code
Multimodal Search and Image Generation - Search for images and generate more similar ones
Visualizing Embeddings - Visualizing and clustering vector embeddings

Rerank models

Our Rerank API has built-in support for the following models, that we host via our serverless endpoints.

Organization	Model Name	Model Size	Model String for API	Max Doc Size (tokens)	Max Docs
Salesforce	LlamaRank	8B	Salesforce/Llama-Rank-v1	8192	1024
MixedBread	Rerank Large	1.6B	mixedbread-ai/Mxbai-Rerank-Large-V2	32768	-

Rerank Model Examples

Search and Reranking - Simple semantic search pipeline improved using a reranker
Implementing Hybrid Search Notebook - Implementing semantic + lexical search along with reranking

Language models

Use our Completions endpoint for Language Models.

Organization	Model Name	Model String for API	Context length
Meta	LLaMA-2 (70B)	meta-llama/Llama-2-70b-hf	4096
mistralai	Mixtral-8x7B (46.7B)	mistralai/Mixtral-8x7B-v0.1	32768

Moderation models

Use our Completions endpoint to run a moderation model as a standalone classifier, or use it alongside any of the other models above as a filter to safeguard responses from 100+ models, by specifying the parameter "safety_model": "MODEL_API_STRING"

Organization	Model Name	Model String for API	Context length
Meta	Llama Guard (8B)	meta-llama/Meta-Llama-Guard-3-8B	8192
Meta	Llama Guard 4 (12B)	meta-llama/Llama-Guard-4-12B	1048576