Authorizations
Body
The name of the model to query.
A string providing context for the model to complete.
The maximum number of tokens to generate.
A list of string sequences that will truncate (stop) inference text output. For example, "</s>" will stop generation as soon as the model generates the given token.
A decimal number that determines the degree of randomness in the response. Lower values (e.g. 0.2) result in more consistent outputs, while higher values (e.g. 1.0) generate more diverse and creative results. We generally recommend altering this or top_p
but not both.
The top_p
(nucleus) parameter is used to dynamically adjust the number of choices for each predicted token based on the cumulative probabilities. It specifies a probability threshold, below which all less likely tokens are filtered out. This technique helps to maintain diversity and generate more fluent and natural-sounding text.
The top_k
parameter is used to limit the number of choices for the next predicted word or token. It specifies the maximum number of tokens to consider at each step, based on their probability of occurrence. This technique helps to speed up the generation process and can improve the quality of the generated text by focusing on the most likely options.
A number that controls the diversity of generated text by reducing the likelihood of repeated sequences. Higher values decrease repetition.
If true, stream tokens as Server-Sent Events as the model generates them instead of waiting for the full model response. If false, return a single JSON object containing the results.
Run an LLM-based input-output safeguard model on top of any model.
How many completions to generate for each prompt