POST
/
inference
Inference
curl --request POST \
  --url https://api.together.xyz/inference \
  --header 'Authorization: <api-key>' \
  --header 'Content-Type: application/json' \
  --data '{
  "model": "togethercomputer/RedPajama-INCITE-Instruct-7B-v0.1",
  "prompt": "The capital of France is",
  "max_tokens": 1,
  "temperature": 0.7,
  "top_p": 0.7,
  "top_k": 50,
  "repetition_penalty": 1
}'
"{\n  \"status\": \"finished\",\n  \"prompt\": [\n    \"The capital of France is \"\n  ],\n  \"model\": \"togethercomputer/RedPajama-INCITE-Instruct-7B-v0.1\",\n  \"model_owner\": \"\",\n  \"tags\": {},\n  \"num_returns\": 1,\n  \"args\": {\n    \"model\": \"togethercomputer/RedPajama-INCITE-Instruct-7B-v0.1\",\n    \"prompt\": \"The capital of France is \",\n    \"temperature\": 0.8,\n    \"top_p\": 0.7,\n    \"top_k\": 50,\n    \"max_tokens\": 1\n  },\n  \"subjobs\": [],\n  \"output\": {\n    \"choices\": [\n      {\n        \"finish_reason\": \"length\",\n        \"index\": 0,\n        \"text\": \" Paris\"\n      }\n    ],\n    \"raw_compute_time\": 0.06382315792143345,\n    \"result_type\": \"language-model-inference\"\n  }\n}"

Authorizations

Authorization
string
header
default:default
required

Body

application/json
model
string
default:mistralai/Mixtral-8x7B-Instruct-v0.1
required

The name of the model to query.

prompt
string
default:<s>[INST] What is the capital of France? [/INST]
required

A string providing context for the model to complete.

max_tokens
integer
default:512
required

The maximum number of tokens to generate.

stop
string[]

A list of string sequences that will truncate (stop) inference text output. For example, "</s>" will stop generation as soon as the model generates the given token.

temperature
number
default:0.7

A decimal number that determines the degree of randomness in the response. Lower values (e.g. 0.2) result in more consistent outputs, while higher values (e.g. 1.0) generate more diverse and creative results. We generally recommend altering this or top_p but not both.

top_p
number
default:0.7

The top_p (nucleus) parameter is used to dynamically adjust the number of choices for each predicted token based on the cumulative probabilities. It specifies a probability threshold, below which all less likely tokens are filtered out. This technique helps to maintain diversity and generate more fluent and natural-sounding text.

top_k
integer
default:50

The top_k parameter is used to limit the number of choices for the next predicted word or token. It specifies the maximum number of tokens to consider at each step, based on their probability of occurrence. This technique helps to speed up the generation process and can improve the quality of the generated text by focusing on the most likely options.

repetition_penalty
number
default:1

A number that controls the diversity of generated text by reducing the likelihood of repeated sequences. Higher values decrease repetition.

stream_tokens
boolean

If true, stream tokens as Server-Sent Events as the model generates them instead of waiting for the full model response. If false, return a single JSON object containing the results.

safety_model
string

Run an LLM-based input-output safeguard model on top of any model.

n
integer
default:1

How many completions to generate for each prompt

Response

200

status
string
Example:

"finished"

prompt
string[]
model
string
Example:

"togethercomputer/RedPajama-INCITE-Instruct-7B-v0.1"

model_owner
string
Example:

""

tags
object
num_returns
integer
default:0
Example:

1

args
object
subjobs
any[]
output
object