Skip to main content
Together GPU Clusters provides multiple storage options. It is critical to understand which storage is persistent and which is ephemeral so you can architect your workloads to avoid data loss.
Local NVMe disks and node-local storage are ephemeral. Data on these drives can be lost at any time during node migrations/recreations, maintenance, or other cluster operations. Always use shared volumes (PVC-backed storage) for any data you need to keep.

Storage Types at a Glance

Use this to decide where to store your data:
  • Shared volumes (PVC) — Persistent. Survives pod restarts, node reboots/migrations/recreations, cluster operations, and even cluster deletion. Use this for training data, checkpoints, model weights, and anything you cannot lose.
  • Local NVMe disks — Ephemeral. Fast local storage on each node. Data can be lost during node migrations/recreations or cluster operations. Use only for temporary scratch data (e.g., intermediate computation files).
  • /home directory — Persistence depends on cluster type (see below).

Persistent Storage: Shared Volumes

Shared volumes are remote-attached, high-speed filesystems. They are created during cluster setup (or attached from an existing volume) and are accessible from all nodes. Persists across:
  • Pod restarts and rescheduling
  • Node reboots, migrations, recreations, and maintenance
  • Cluster scaling operations
  • Cluster deletion (volumes persist independently — in case of reserved, they move to on-demand pricing and can be reattached to other clusters)
How to use shared volumes:
  • Kubernetes clusters — We provide a static PersistentVolume (PV) with the same name as your shared volume. Create a PersistentVolumeClaim (PVC) referencing it, then mount it in your pods. Step-by-step setup →
  • Slurm clusters — The shared volume is mounted and accessible from all compute and login nodes at /home directory path.
Best practice: Always store training data, checkpoints, model weights, logs, and application state on shared volumes. This ensures your data survives any cluster event.

Ephemeral Storage: Local NVMe Disks

Each node has local NVMe drives that provide high-speed read/write performance.
Data on local NVMe disks is not durable. It can be lost without warning during:
  • Node migrations/recreations (scheduled or unscheduled)
  • Cluster maintenance operations
  • Hardware failures
  • Pod rescheduling to a different node
Do not rely on local NVMe for any data you need to keep. Use it only for temporary scratch files that can be regenerated.

/home Directory

The behavior of /home differs between cluster types:

Slurm Clusters

On Slurm clusters, /home is a persistent NFS-backed file system shared across all nodes (compute and login). It is mounted from the head node and is suitable for:
  • Code and scripts
  • Configuration files
  • Logs
  • Small datasets
  • Model weights and training data
We recommend logging into the Slurm head node first to set up your user folder with the correct permissions.

Kubernetes Clusters

On Kubernetes clusters, /home is local to each node and ephemeral. It is not shared across nodes and is subject to the same data loss risks as local NVMe storage.
On Kubernetes clusters, do not store important data in /home. Use a shared volume (PVC) instead.

Which Storage Should I Use?

  • Training data, datasets → Shared volume (PVC), or /home on Slurm clusters
  • Checkpoints, model weights → Shared volume (PVC), or /home on Slurm clusters
  • Application state, databases → Shared volume (PVC), or /home on Slurm clusters
  • Code, configs → Shared volume (PVC), or /home on Slurm clusters
  • Temporary scratch files → Local NVMe (acceptable to lose)
  • Intermediate computation artifacts → Local NVMe (acceptable to lose)

Upload Your Data

For small datasets:
  1. Create a PVC using the shared volume name as the volumeName, and a pod to mount the volume
  2. Run kubectl cp LOCAL_FILENAME YOUR_POD_NAME:/data/
For large datasets: Schedule a pod on the cluster that downloads directly from S3 or your data source. See example → Learn more about GPU Clusters →