It depends. Factors that affect waiting time include the number of number of pending jobs from other customers, the number of jobs currently running, and available hardware. If there are no other pending jobs and there is available hardware, your job should start within a minute of submission. Typically jobs will start within an hour of submission. However, there is no guarantee on waiting time.
It depends. Factors that impact your job run time are model size, training data size, and network conditions when downloading/uploading model/training files. You can estimate how long your job will take to complete training by multiplying the number of epochs by the time to complete the first epoch.
Why am I getting an error when uploading a training file?
There are two common issues you may encounter,Your API key may be incorrect. If you get a 403 status code, this indicates your API Key is incorrect.Your balance may be less than the job minimum. We verify that you have sufficient balance on your account that is equal to the minimum job charge ($5). If you do not have sufficient balance, you can increase your account limit by adding a credit card to your account, adjusting your spending limit if you ready have a credit card, or paying your outstanding account balance. If you have sufficient balance on your account, contact support for assistance.
There are two reasons that a job may be automatically cancelled.You do not have sufficient balance on your account to cover the cost of the job.You have entered an incorrect WandB API keyYou can determine why your job was cancelled by:(1) checking the events list for your job via the together-CLI tool$ together list-events <job-fine-tune-id>(2) Via the web interface https://api.together.ai > Jobs > Cancelled Job > Events List
The following is an example of a job an event log in the web jobs tab where the billing limit was reached:
What should I do if my job is cancelled due to billing limits?
You can an add a credit card to your account to increase your spending limit. If you already have a credit card on your account, you can make a payment or adjust your spending limit. Contact support if you need assistance with your account balance.
If your job fails after downloading the training file, but before training starts, the most likely source of the error is the training data. For example, your event log might look like
You can verify the formatting of your input file with the Together CLI tool with the following command:$ together files check ~/Downloads/unified_joke_explanations.jsonl { "is_check_passed": true, "model_special_tokens": "we are not yet checking end of sentence tokens for this model", "file_present": "File found", "file_size": "File size 0.0 GB", "num_samples": 356 }Despite our best efforts, the file checker does not catch all errors. Please contact support if your training data file passes the checks, but you are still seeing the above error conditions.If you see an error during other steps in your training job, this may be due to internal errors in our training stack (e.g. hardware failure or bugs). We actively monitor job failures, and work as quickly as we can to resolve these issues. Once the issue has been resolved by our engineers, your job will be automatically or manually restarted. Charges for the restarted job will be refunded.
A job will be automatically or manually restarted if the job fails to complete due to an internal error. You can view the event log to see if the job was restarted, to determine the new fine tune ID of the restarted job, and check the refund amount (if applicable). Any charges from the failed job will be refunded when your job is restarted. An example event log for a restarted job is:
If you would like to download the weights of your model so that you can use your fine-tuned model outside of our platform, you are able to do this through the following:To download the weights of a fine-tuned model, run:together fine-tuning download <FT-ID>This command will download ZSTD compressed weights of the model. To extract the weights, run tar -xf filename.Other arguments:--output,-o (filename, optional) — Specify the output filename. Default: <MODEL-NAME>.tar.zst--step,-s (integer, optional) — Download a specific checkpoint’s weights. Defaults to download the latest weights. Default: -1