Data preparation
Together Fine-tuning API accepts two data formats for training dataset files: text data and tokenized data (in the form of Parquet files). Below, you can learn about different types of those formats and the scenarios in which they can be most useful.
Note: Regardless of the dataset format, the data file size must be under 5GB.
Text Data
Data formats
Together Fine-tuning API accepts three text dataset formats for the training dataset. Your data file must be in the .jsonl
format with fields that indicate the dataset format. You can have other fields, but they will be ignored during training. To speed up the data uploading and processing steps and to maximize the number of examples per file, we recommend to remove the unused fields.
Also, if the data has two or more possible formats (e.g., it contains both text
and messages
), the Together client will show an error at the file check stage before the upload.
Conversational Data
For conversational fine-tuning, your data file must contain a messages
field on each line, with role
and content
specified for each message. Each sample should start with either a system
or user
message, followed by alternating user
and assistant
messages. The Together client will reject any dataset that does not follow this pattern.
{
"messages": [
{"role": "system", "content": "This is a system prompt."},
{"role": "user", "content": "Hello, how are you?"},
{"role": "assistant", "content": "I'm doing well, thank you! How can I help you?"},
{"role": "user", "content": "Can you explain machine learning?"},
{"role": "assistant", "content": "Machine learning is..."}
]
}
The resulting conversation dataset will be automatically formatted into the model's chat template if it is defined for that model, or into the default template otherwise. As a general rule, all instruction-finetuned models have their own chat templates, and base models do not have them.
By default, models will be trained to predict only assistant
messages. Use --train-on-inputs true
to include other messages in training. See the API Reference for details.
Example datasets:
Instruction Data
For instruction-based fine-tuning, your data file must contain prompt
and completion
fields:
{"prompt": "...", "completion": "..."}
{"prompt": "...", "completion": "..."}
By default, models will not be trained to predict the text from the prompt. Use --train-on-inputs true
to include prompts in training. See the API Reference for details.
Here are some examples with this format that you can download from the Hugging Face Hub:
Generic Text Data
If you have no need for instruction or conversational training, you can put the data in the text
field.
{"text": "..."}
{"text": "..."}
Here are some examples of datasets that you can download from the Hugging Face Hub:
Tokenized Data
You can also provide tokenized data for more advanced use cases. You may want to use this data format if you are:
- Using the same dataset for multiple experiments: this saves the tokenization step and accelerates your fine-tuning job.
- Using a custom tokenizer that's intentionally different than the base model tokenizer
- Masking out certain parts of your examples for the loss calculation (which are not covered by instruction or conversational dataset use cases above)
Your data file must meet the following requirements:
- The data file size must be under 5GB.
- The file format must be in the
.parquet
format. - Allowed fields:
input_ids
(required): List of token ids to be fed to a model.attention_mask
(required): List of indices specifying which tokens should be attended to by the model.labels
(optional): List of token ids to be used as target predictions. The default token ID to be ignored in the loss calculation is-100
. To ignore certain predictions in the loss, replace their corresponding values with-100
. If this field is not given,input_ids
will be used.
Example
You can find an example script that converts text data in Hugging Face Hub to the tokenized format.
In this example, we will use a toy dataset clam004/antihallucination_dataset in Hugging Face Hub with the tokenizer from NousResearch/Nous-Hermes-2-Mixtral-8x7B-SFT
model. The max sequence length of this model is 32768. To compare the differences between packing and padding, we will run the script twice with and without --packing
. When packing is not applied, each example will be (left-)padded with the tokenizer's own pad token to keep the length of all examples consistent. Note that packing is used during training by default, and we recommend to use packing during the tokenization step by passing --packing
in the example script. Also note that we shift labels internally for model training and you do not need to do this.
- With packing,
python tokenize_data.py --tokenizer="NousResearch/Nous-Hermes-2-Mixtral-8x7B-SFT" --max-seq-length=32768 --add-labels --packing --out-filename="processed_dataset_packed.parquet"
processed_dataset_packed.parquet
will be saved under the same directory.
- Without packing,
python tokenize_data.py --tokenizer="NousResearch/Nous-Hermes-2-Mixtral-8x7B-SFT" --max-seq-length=32768 --add-labels --out-filename="processed_dataset.parquet"
processed_dataset_padded.parquet
will be saved under the same directory.
Let's load the generated files to see the results. In python,
>>> from datasets import load_dataset
>>> dataset_packed = load_dataset("parquet", data_files={'train': 'processed_dataset_packed.parquet'})
>>> dataset_padded = load_dataset("parquet", data_files={'train': 'processed_dataset_padded.parquet'})
First, you will see the number of examples from the dataset with packing is only 6 while the one without packing has 238:
>>> dataset_packed['train']
Dataset({
features: ['input_ids', 'attention_mask', 'labels'],
num_rows: 6
})
>>> dataset_padded['train']
Dataset({
features: ['input_ids', 'attention_mask', 'labels'],
num_rows: 238
})
In the first example of dataset_padded
you will find the first 31140 tokens are padded and have -100
as their labels to be ignored during the loss mask. The pad token for this tokenizer is 32000
{
"input_ids": [32000, 32000, 32000, ..., 3409, 6898, 28767],
"attention_masks": [0, 0, 0, ..., 1, 1, 1],
"labels": [-100, -100, -100,..., 3409, 6898, 28767]
}
On the other hand, in the first example of dataset_packed
, no padding is used. And the first 1628 token ids match the last 1628 token ids from the first example of dataset_padded
.
{
"input_ids": [1, 523, 434, ..., 6549, 3805, 7457],
"attention_masks": [1, 1, 1, ..., 1, 1, 1],
"labels": [1, 523, 434,..., 6549, 3805, 7457]
}
File Check
To confirm that your dataset has the right format, run the following command. This step is optional, but we highly recommend to run this step before uploading the file and using it for fine-tuning.
together files check PATH_TO_DATA_FILE
Here's the output:
together files check joke_explanations.jsonl
{
"is_check_passed": true,
"message": "Checks passed",
"found": true,
"file_size": 781041,
"utf8": true,
"line_type": true,
"text_field": true,
"key_value": true,
"min_samples": true,
"num_samples": 238,
"load_json": true,
"filetype": "jsonl"
}
After your data is prepared, upload your file using either CLI or Python SDK.
Updated about 1 month ago