Parameters

Learn more about the parameters you can configure when running inference.

Max tokens

The maximum number of tokens to generate. Shorter token lengths will provide faster performance.

Stop words

In some cases, when you know the answer to just be a single word, you may set the max_tokens to be 1, such that in the output we only have one token. However, if we have a more complicated query (e.g., information extraction), we may want more than one token and at the same time keep the output short and relevant. In this case we can use stop words as seen in the example below:

import requests

endpoint = 'https://api.together.xyz/inference'

res = requests.post(endpoint, json={
    "model": "togethercomputer/RedPajama-INCITE-7B-Instruct",
    "prompt": """\
        Given a review from Amazon's food products, the task is to generate a short summary of the given review in the input.

        Input: I have bought several of the Vitality canned dog food products and have found them all to be of good quality. The product looks more like a stew than a processed meat and it smells better. My Labrador is finicky and she appreciates this product better than most.
        Output: Good Quality Dog Food

        Input: Product arrived labeled as Jumbo Salted Peanuts...the peanuts were actually small sized unsalted. Not sure if this was an error or if the vendor intended to represent the product as 'Jumbo'.
        Output: Not as Advertised

        Input: My toddler loves this game to a point where he asks for it. That's a big thing for me. Secondly, no glitching unlike one of their competitors (PlayShifu). Any tech I don’t have to reach out to support for help is a good tech for me. I even enjoy some of the games and activities in this. Overall, this is a product that shows that the developers took their time and made sure people would not be asking for refund. I’ve become bias regarding this product and honestly I look forward to buying more of this company’s stuff. Please keep up the great work.
        Output:""",
    "top_p": 1,
    "top_k": 40,
    "temperature": 0.8,
    "max_tokens": 100,
    "repetition_penalty": 1,
    "stop": "\n\n"
}, headers={
    "Authorization": "Bearer <YOUR_API_KEY>",
    "User-Agent": "<YOUR_APP_NAME>"
})
print(res.json())
print(res.json()['output']['choices'][0]['text'])

Here we set the max_tokens to be 100, as we are expecting several words describing this review, and at the same time we set the stop to be \n\n such that the model will stop when it sees this stop word, and semantically this means that this sentence is over. In this case, we will receive a response like this (only the output field is shown, and other fields are omitted):

'output': {
	'choices': [
		{'finish_reason': 'length', 'index': 0, 'text': ' A Good Game'}
	], 
	'raw_compute_time': 1.0364392050541937, 
	'result_type': 'language-model-inference'
}

Temperature

A decimal number that determines the degree of randomness in the response. A value of 0 will always yield the same output. A temperature much less than 1 favors more correctness and is appropriate for question answering or summarization. A value approaching 1 introduces more randomness in the output.

Top_p

The top_p (nucleus) parameter is used to dynamically adjust the number of choices for each predicted token based on the cumulative probabilities. It specifies a probability threshold, below which all less likely tokens are filtered out. This technique helps to maintain diversity and generate more fluent and natural-sounding text.

Top_k

The top_k parameter is used to limit the number of choices for the next predicted word or token. It specifies the maximum number of tokens to consider at each step, based on their probability of occurrence. This technique helps to speed up the generation process and can improve the quality of the generated text by focusing on the most likely options.

Repetition penalty

Controls the diversity of generated text by reducing the likelihood of repeated sequences. Higher values decrease repetition.

Logprops (API only)

An integer that specifies how many top token log probabilities are included in the response for each token generation step.