Data preparation

Together Fine-tuning API accepts two data formats for the training dataset.

Text Data

To use raw text data, your data file must be in the .jsonl format with a text field. Note that your data can have fields other than "text", but they will be ignored during training. To speed up the data uploading and processing steps and to maximize the number of examples per file, we recommend to remove the unused fields.

{"text": "..."}
{"text": "..."}

Here are some examples that you can download from HuggingFace:

If you are following the prompt format of your base model of choice (if exists), or want to fine-tune a pre-trained model with a certain prompt format, structure your example with the desired prompt format. See this page for more details on the prompt format.

The data file size must be under 5GB.

Tokenized Data

You can also provide tokenized data for more advanced use cases. You may want to use this data format if you are:

  1. Using the same dataset for multiple experiments: this saves the tokenization step and accelerates your fine-tuning job.
  2. Using a custom tokenizer that's intentionally different than the base model tokenizer
  3. Masking out certain part(s) of your example for the loss calculation

Your data file must meet the following requirements:

  • The data file size must be under 5GB.
  • The file format must be in the .parquet format.
  • Allowed fields:
    • input_ids(required): List of token ids to be fed to a model.
    • attention_mask(required): List of indices specifying which tokens should be attended to by the model.
    • labels(optional): List of token ids to be used as target predictions. The default token ID to be ignored in the loss calculation is -100. To ignore certain predictions in the loss, replace their corresponding values with -100. If this field is not given, input_ids will be used.

Example

You can find an example script that converts text data in Hugging Face Hub to the tokenized format.

In this example, we will use a toy dataset clam004/antihallucination_dataset in Hugging Face Hub with the tokenizer from NousResearch/Nous-Hermes-2-Mixtral-8x7B-SFTmodel. The max sequence length of this model is 32768. To compare the difference from packing and padding, we will run the script twice with and without --packing. When packing is not applied, each example will be (left-)padded with the tokenizer's own pad token to keep the length of all examples consistent. Note that packing is used during training by default, and we recommend to use packing during the tokenization step by passing --packing in the example script.

  • With packing,
python tokenize_data.py --tokenizer="NousResearch/Nous-Hermes-2-Mixtral-8x7B-SFT" --max-seq-length=32768 --add-labels --packing --out-filename="processed_dataset_packed.parquet"

processed_dataset_packed.parquet will be saved under the same directory.

  • Without packing,
python tokenize_data.py --tokenizer="NousResearch/Nous-Hermes-2-Mixtral-8x7B-SFT" --max-seq-length=32768 --add-labels --out-filename="processed_dataset.parquet"

processed_dataset_padded.parquet will be saved under the same directory.

Let's load the generated files to see the results. In python,

>>> from datasets import load_dataset
>>> dataset_packed = load_dataset("parquet", data_files={'train': 'processed_dataset_packed.parquet'})
>>> dataset_padded = load_dataset("parquet", data_files={'train': 'processed_dataset_padded.parquet'})

First, you will see the number of examples from the dataset with packing is only 6 while the one without packing has 238:

>>> dataset_packed['train']
Dataset({
    features: ['input_ids', 'attention_mask', 'labels'],
    num_rows: 6
})
>>> dataset_padded['train']
Dataset({
    features: ['input_ids', 'attention_mask', 'labels'],
    num_rows: 238
})

In the first example of dataset_padded you will find the first 31140 tokens are padded and have -100 as their labels to be ignored during the loss mask. The pad token for this tokenizer is 32000

{
  "input_ids": [32000, 32000, 32000, ..., 3409, 6898, 28767],
  "attention_masks": [0, 0, 0, ..., 1, 1, 1],
  "labels": [-100, -100, -100,..., 3409, 6898, 28767]
}

On the other hand, in the first example of dataset_packed, no padding is used. And the first 1628 token ids match the last 1628 token ids from the first example of dataset_padded.

{
  "input_ids": [1, 523, 434, ..., 6549, 3805, 7457],
  "attention_masks": [1, 1, 1, ..., 1, 1, 1],
  "labels": [1, 523, 434,..., 6549, 3805, 7457]
}

File Check

To confirm that your dataset has the right format, run the following command. This step is optional, but we highly recommend to run this step before uploading the file and using it for fine-tuning.

together files check PATH_TO_DATA_FILE

Here's the output:

together files check joke_explanations.jsonl
{
    "is_check_passed": true,
    "message": "Checks passed",
    "found": true,
    "file_size": 781041,
    "utf8": true,
    "line_type": true,
    "text_field": true,
    "key_value": true,
    "min_samples": true,
    "num_samples": 238,
    "load_json": true,
    "filetype": "jsonl"
}

After your data is prepared, upload your file using either CLI or Python SDK.