Deploy models on your own custom endpoints for improved reliability at scale
READY
. To let it run asynchronously, remove the --wait
flag.
To view the hardware options for specific model, run:
--gpu a100 --gpu-count 2
, --gpu a100 --gpu-count 4
, --gpu a100 --gpu-count 8
--gpu h100 --gpu-count 2
, --gpu h100 --gpu-count 4
, --gpu h100 --gpu-count 8
--wait
flag on creation or previously stopped the endpoint, you can start it again by running:
--min-replicas
and --max-replicas
options. The default min and max replica is set to 1. When the max replica is increased, the endpoint will automatically scale based on server load.
2x_nvidia_h100_80gb_sxm
When configuring the hardware on the CLI, you can specify which version of the hardware you would like by listing the --gpu
(or hardware), --gpu-count
and gpu-type
gpu-count
will increase the GPUs per replica. This will result in higher generation speed, lower time-to-first-token and higher max QPS.
--no-speculative-decoding
flag.
--no-prompt-cache
to the create command.