Serverless models

Chat models

In the table below, models marked as "Turbo" are quantized to FP8 and those marked as "Lite" are INT4. All our other models are at full precision (FP16).

If you're not sure which chat model to use, we currently recommend Llama 3.1 8B Turbo (meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo) to get started.

OrganizationModel NameAPI Model StringContext lengthQuantization
MetaLlama 3.3 70B Instruct Turbometa-llama/Llama-3.3-70B-Instruct-Turbo131072FP8
MetaLlama 3.1 8B Instruct Turbometa-llama/Meta-Llama-3.1-8B-Instruct-Turbo131072FP8
MetaLlama 3.1 70B Instruct Turbometa-llama/Meta-Llama-3.1-70B-Instruct-Turbo131072FP8
MetaLlama 3.1 405B Instruct Turbometa-llama/Meta-Llama-3.1-405B-Instruct-Turbo130815FP8
MetaLlama 3 8B Instruct Turbometa-llama/Meta-Llama-3-8B-Instruct-Turbo8192FP8
MetaLlama 3 70B Instruct Turbometa-llama/Meta-Llama-3-70B-Instruct-Turbo8192FP8
MetaLlama 3.2 3B Instruct Turbometa-llama/Llama-3.2-3B-Instruct-Turbo131072FP16
MetaLlama 3 8B Instruct Litemeta-llama/Meta-Llama-3-8B-Instruct-Lite8192INT4
MetaLlama 3 70B Instruct Litemeta-llama/Meta-Llama-3-70B-Instruct-Lite8192INT4
MetaLlama 3 8B Instruct Referencemeta-llama/Llama-3-8b-chat-hf8192FP16
MetaLlama 3 70B Instruct Referencemeta-llama/Llama-3-70b-chat-hf8192FP16
NvidiaLlama 3.1 Nemotron 70Bnvidia/Llama-3.1-Nemotron-70B-Instruct-HF32768FP16
QwenQwen 2.5 Coder 32B InstructQwen/Qwen2.5-Coder-32B-Instruct32768FP16
QwenQwQ-32B-PreviewQwen/QwQ-32B-Preview32768FP16
MicrosoftWizardLM-2 8x22Bmicrosoft/WizardLM-2-8x22B65536FP16
GoogleGemma 2 27Bgoogle/gemma-2-27b-it8192FP16
GoogleGemma 2 9Bgoogle/gemma-2-9b-it8192FP16
databricksDBRX Instructdatabricks/dbrx-instruct32768FP16
DeepSeekDeepSeek LLM Chat (67B)deepseek-ai/deepseek-llm-67b-chat4096FP16
DeepSeekDeepSeek-V3deepseek-ai/DeepSeek-V3131072FP8
GoogleGemma Instruct (2B)google/gemma-2b-it8192FP16
GrypheMythoMax-L2 (13B)Gryphe/MythoMax-L2-13b4096FP16
MetaLLaMA-2 Chat (13B)meta-llama/Llama-2-13b-chat-hf4096FP16
mistralaiMistral (7B) Instructmistralai/Mistral-7B-Instruct-v0.18192FP16
mistralaiMistral (7B) Instruct v0.2mistralai/Mistral-7B-Instruct-v0.232768FP16
mistralaiMistral (7B) Instruct v0.3mistralai/Mistral-7B-Instruct-v0.332768FP16
mistralaiMixtral-8x7B Instruct (46.7B)mistralai/Mixtral-8x7B-Instruct-v0.132768FP16
mistralaiMixtral-8x22B Instruct (141B)mistralai/Mixtral-8x22B-Instruct-v0.165536FP16
NousResearchNous Hermes 2 - Mixtral 8x7B-DPO (46.7B)NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO32768FP16
QwenQwen 2.5 7B Instruct TurboQwen/Qwen2.5-7B-Instruct-Turbo32768FP8
QwenQwen 2.5 72B Instruct TurboQwen/Qwen2.5-72B-Instruct-Turbo32768FP8
QwenQwen 2 Instruct (72B)Qwen/Qwen2-72B-Instruct32768FP16
QwenQwen2 VL 72B InstructQwen/Qwen2-VL-72B-Instruct32768FP16
upstageUpstage SOLAR Instruct v1 (11B)upstage/SOLAR-10.7B-Instruct-v1.04096FP16

Image models

Use our Images endpoint for Image Models.

OrganizationModel NameModel String for APIDefault steps
Black Forest LabsFlux.1 [schnell] (free)*black-forest-labs/FLUX.1-schnell-FreeN/A
Black Forest LabsFlux.1 [schnell] (Turbo)black-forest-labs/FLUX.1-schnell4
Black Forest LabsFlux.1 Devblack-forest-labs/FLUX.1-dev28
Black Forest LabsFlux.1 Cannyblack-forest-labs/FLUX.1-canny28
Black Forest LabsFlux.1 Depthblack-forest-labs/FLUX.1-depth28
Black Forest LabsFlux.1 Reduxblack-forest-labs/FLUX.1-redux28
Black Forest LabsFlux1.1 [pro]black-forest-labs/FLUX.1.1-pro-
Black Forest LabsFlux.1 [pro]black-forest-labs/FLUX.1-pro-
Stability AIStable Diffusion XL 1.0stabilityai/stable-diffusion-xl-base-1.0-

*Free model has reduced rate limits and performance compared to our paid Turbo endpoint for Flux Shnell named black-forest-labs/FLUX.1-schnell

How FLUX pricing works
For FLUX models (except for pro) pricing is based on the size of generated images (in megapixels) and the number of steps used (if the number of steps exceed the default steps).

  • Default pricing: The listed per megapixel prices are for the default number of steps.
  • Using more or fewer steps: Costs are adjusted based on the number of steps used only if you go above the default steps. If you use more steps, the cost increases proportionally using the formula below. If you use fewer steps, the cost does not decrease and is based on the default rate.

Here’s a formula to calculate cost:

Cost = MP × Price per MP × (Steps ÷ Default Steps)

Where:

  • MP = (Width × Height ÷ 1,000,000)
  • Price per MP = Cost for generating one megapixel at the default steps
  • Steps = The number of steps used for the image generation. This is only factored in if going above default steps.

Vision models

If you're not sure which vision model to use, we currently recommend Llama 3.2 11B Turbo (meta-llama/Llama-3.2-11B-Vision-Instruct-Turbo) to get started. For model specific rate limits, navigate here.

OrganizationModel NameAPI Model StringContext length
Meta(Free) Llama 3.2 11B Vision Instruct Turbo*meta-llama/Llama-Vision-Free131072
MetaLlama 3.2 11B Vision Instruct Turbometa-llama/Llama-3.2-11B-Vision-Instruct-Turbo131072
MetaLlama 3.2 90B Vision Instruct Turbometa-llama/Llama-3.2-90B-Vision-Instruct-Turbo131072

*Free model has reduced rate limits compared to paid version of Llama 3.2 Vision 11B named Llama-3.2-11B-Vision-Instruct-Turbo

Code models

Use our Completions endpoint for Code Models.

OrganizationModel NameModel String for APIContext length
QwenQwen 2.5 Coder 32B InstructQwen/Qwen2.5-Coder-32B-Instruct32768

Language models

Use our Completions endpoint for Language Models.

OrganizationModel NameModel String for APIContext length
MetaLLaMA-2 (70B)meta-llama/Llama-2-70b-hf4096
mistralaiMistral (7B)mistralai/Mistral-7B-v0.18192
mistralaiMixtral-8x7B (46.7B)mistralai/Mixtral-8x7B-v0.132768

Moderation models

Use our Completions endpoint to run a moderation model as a standalone classifier, or use it alongside any of the other models above as a filter to safeguard responses from 100+ models, by specifying the parameter "safety_model": "MODEL_API_STRING"

OrganizationModel NameModel String for APIContext length
MetaLlama Guard (7B)Meta-Llama/Llama-Guard-7b4096

Embedding models

Model NameModel String for APIModel SizeEmbedding DimensionContext Window
M2-BERT-80M-2K-Retrievaltogethercomputer/m2-bert-80M-2k-retrieval80M7682048
M2-BERT-80M-8K-Retrievaltogethercomputer/m2-bert-80M-8k-retrieval80M7688192
M2-BERT-80M-32K-Retrievaltogethercomputer/m2-bert-80M-32k-retrieval80M76832768
UAE-Large-v1WhereIsAI/UAE-Large-V1326M1024512
BGE-Large-EN-v1.5BAAI/bge-large-en-v1.5326M1024512
BGE-Base-EN-v1.5BAAI/bge-base-en-v1.5102M768512
Sentence-BERTsentence-transformers/msmarco-bert-base-dot-v5110M768512
BERTbert-base-uncased110M768512

Rerank models

Our Rerank API has built-in support for the following models, that we host via our serverless endpoints.

OrganizationModel NameModel SizeModel String for APIMax Doc Size (tokens)Max Docs
SalesforceLlamaRank8BSalesforce/Llama-Rank-v181921024