Skip to main content
Node repair restores GPU nodes that health checks have flagged as unhealthy. You can repair nodes through two paths: auto repair, where the system generates a recommendation based on detected issues, and manual repair, where you trigger a repair action directly from the UI.

Auto node repair

When passive or active health checks detect a node-level issue, the system generates a repair recommendation and surfaces it for your review. This is a human-in-the-loop process: Together handles detection and recommends a remediation, but you decide when to proceed.

How auto repair works

  1. Health checks detect an issue on a node and create an alert with supporting evidence.
  2. The system evaluates the alert and generates a repair recommendation with a suggested mode (for example, migrate to new host).
  3. The recommendation appears in the Repairs tab of your cluster.
  4. You review the recommendation and accept or dismiss it.
  5. Once accepted, Together executes the repair: cordon, graceful drain, remediation action, and node rejoin.
Auto repair accounts for in-flight work. Training jobs need to checkpoint before a node drains, and inference workloads need their replicas rebalanced. Review recommendations before accepting to confirm your workloads are ready for the disruption.
When the system generates a repair recommendation, it selects the appropriate action based on the detected issue:
Detected issueRecommended action
GPU fell off the busMigrate to new host
GPU thermal throttling (SmClockThermalThrottle)Migrate to new host
Xid errorReboot
Drained Slurm nodeMigrate to new host

Behavioral details

  • Auto-resolution mid-approval: Recommendations can disappear if the underlying alert clears before you accept (5-minute default CompactTTL).
  • Cooldown window: After a repair completes (succeeded, failed, or cancelled), no new recommendation is generated for ~30 minutes on the same node.
  • Mode escalation: A pending recommendation can change its suggested mode in-place if a higher-severity failure is detected while it’s waiting in the queue.

The Repairs tab

To view repair recommendations and history:
  1. Navigate to your cluster in the Together Cloud UI.
  2. Select the Repairs tab.
The Repairs table shows all repair events with the following columns:
  • Node: The affected node name.
  • State: The current status of the repair. Values include Auto Resolved (issue resolved before action was taken), Succeeded (repair completed), and in-progress states.
  • Mode: The remediation action (for example, Migrate to new host).
  • Trigger: How the repair was initiated. Automated (generated by health checks) or Manual (triggered by a user).
  • Created: When the repair recommendation was generated.

Repair details

Select any row in the Repairs table to view the full repair details:
  • Node: The affected node name.
  • State: The current repair state (for example, Succeeded).
  • Mode: The remediation action taken.
  • Created / Started: When the recommendation was generated and when the repair execution began.
  • Requested by: The source that initiated the repair. For auto repairs, this shows Together Health Checker.
  • Reviewed by: Who approved the repair (your user name or Auto-Approved for auto-approved repairs).
  • Review time: When the repair was approved.
  • Review comment: Any notes from the approval (for example, “auto-approved: approved”).
  • Repair ID: Unique identifier for tracking and support requests.
  • Alert evidence: Expandable section showing the underlying alerts that triggered the recommendation, including failure type and affected hardware.

Manual node repair

When you encounter node problems or want to trigger a repair without waiting for an automated recommendation, you can start a repair directly from the Worker Nodes UI.

How to trigger manual repair

  1. Navigate to your cluster in the Together Cloud UI.
  2. Go to the Worker Nodes section.
  3. Find the problematic node.
  4. Select the (three dots) menu in the State column.
  5. Select Repair from the dropdown.
  6. A repair dialog appears showing:
    • Node details (name, GPU configuration).
    • Issue detected (if applicable).
    • Impact warning.
  7. Choose one of the repair actions:
    • Reboot: For transient software issues (preserves local data).
    • Quick reprovision: For persistent software issues.
    • Migrate to new host: For hardware issues.
    • Remove: Permanently removes the node for RMA (return merchandise authorization).
    • Report an issue (optional): To notify support.
The repair process begins immediately and the node rejoins your cluster once complete.

Available repair actions

Reboot Reboots the VM in place on the same physical host.
  • When to use: Transient software issues (GPU driver hangs, stuck processes, kernel-level errors) where a restart is likely to clear the problem.
  • What happens: The node follows the Cordon → Drain → Reboot → Rejoin lifecycle. The VM restarts on the same physical hardware without reimaging. Local scratch and temporary data on /scratch and /tmp is preserved.
Reboot is the lightest repair action. Because the VM is not reimaged, it is faster than a reprovision and preserves local data. Try a reboot first for transient issues before escalating to a reprovision.
Quick reprovision Reprovisions the GPU node VM on a random underlying physical host.
  • When to use: Persistent software-level issues (driver crashes, library corruption), VM configuration problems, or application-level issues that a reboot did not resolve.
  • What happens: The node follows the Cordon → Drain → Reprovision lifecycle. The VM is recreated with a fresh software stack and rejoins the cluster automatically.
You lose all local VM data during reprovision. Store data on PersistentVolumes or back it up before proceeding. No new jobs are scheduled on this node until remediation completes.
Migrate to new host Provisions a new VM on a different underlying physical host.
  • When to use: Hardware-level issues (GPU failures, PCIe problems), issues that persist after a quick reprovision, or physical component failures.
  • What happens: The node follows the Cordon → Drain → Migrate lifecycle. A new VM is created on different physical hardware with different GPUs assigned, and rejoins the cluster automatically.
You lose all local VM data during migration. Store data on PersistentVolumes or back it up before proceeding. No new jobs are scheduled on this node until remediation completes.
Remove Permanently removes the node from the cluster. The cluster node count drops below the desired count.
  • When to use: Faulty GPU hardware that needs to be returned to the provider for RMA. Use this when the node has a confirmed hardware defect that cannot be resolved by migration.
  • What happens: The node follows the Cordon → Drain lifecycle, then is permanently removed from the cluster. The node is not replaced automatically.
Removing a node is irreversible from the cluster’s perspective. The node is taken out of service entirely and your cluster runs with fewer nodes until a replacement is provisioned. Only use this for confirmed hardware failures that require physical RMA.
Report an issue Use this option if:
  • You are unsure which repair action to use.
  • You want Together support to investigate before taking action.
  • The issue requires additional context or diagnosis.

Repair lifecycle

Both auto and manual repairs follow the same lifecycle:
Cordon → Drain → Reboot/Reprovision/Migrate/Remove → Rejoin (or permanent removal)
Cordon: The node is marked as unschedulable. No new workloads are placed on the node, but existing workloads continue running. Drain: Running workloads are gracefully terminated and pods are evicted from the node. Reboot/Reprovision/Migrate:
  • Reboot: The VM restarts in place on the same hardware. Local /scratch and /tmp data is preserved.
  • Quick reprovision: The VM is recreated on a random physical node (could be the same as the original host). Local data is lost.
  • Migrate to new host: A new VM is created on different physical hardware. Local data is lost.
  • Remove: The node is permanently removed from the cluster for RMA. No rejoin occurs.
Rejoin: The node automatically rejoins the cluster, becomes schedulable, and is ready to accept new workloads. You can monitor repair progress in the Repairs tab (for auto repairs) or the Worker Nodes section (for manual repairs). The node progresses through these states: Cordoning → Draining → Repairing/Migrating → Joining → Running.

Choosing a repair action

Use this table to determine which repair action fits your issue. Start with the lightest action (reboot) and escalate if the issue persists.
Issue typeRebootReprovisionMigrate to new host
GPU driver hang✓ Try first✓ If reboot fails
Stuck GPU processes✓ Try first✓ If reboot fails
GPU watchdog timeouts✓ Try first✓ If reboot fails
Stuck GPU contexts✓ Try first✓ If reboot fails
Recoverable Xid errors✓ Try first✓ If reboot fails
Application memory leaks✓ Try first✓ If reboot fails
Software-based throttling✓ Try first✓ If reboot fails
Driver crashes/corruption✓ Yes
CUDA/ROCm library issues✓ Yes
Incorrect GPU mode settings✓ Yes
GPU not attached to VM✓ Yes
Device permissions/cgroup issues✓ Yes
NUMA affinity problems✓ Yes
Single-bit ECC errors (occasional)✓ Yes
Complete GPU card failure✓ Yes
Persistent multi-bit ECC errors✓ Yes
GPU falling off PCIe bus✓ Yes
Fan failures✓ Yes
PCIe lane degradation✓ Yes
Power delivery (VRM) issues✓ Yes
Thermal/cooling problems✓ Yes
Persistent Xid errors✓ Yes
Physical connector damage✓ Yes
Backplane/riser issues✓ Yes
Escalation path: reboot → reprovision → migrate to new host. If the issue persists after reprovisioning the VM to a fresh instance on the same physical GPU, it is a hardware problem requiring migration to a new host.

Best practices

Before triggering a repair:
  • Store important data on PersistentVolumes, not local storage.
  • Optionally drain workloads manually for more control over migration.
  • Document symptoms for troubleshooting if the repair does not resolve the problem.
  • Check running jobs so you know what will be interrupted.
Choosing the right action:
  • Start with reboot: It is the fastest option, preserves local data, and resolves most transient software issues.
  • Escalate to quick reprovision: When a reboot did not fix the issue, or the problem is a corrupted driver, library, or VM configuration that requires a fresh software stack.
  • Use migrate to new host: When reprovision did not fix the issue, you see hardware error indicators (ECC errors, Xid errors, thermal warnings), or GPU diagnostics show hardware problems.
After a repair:
  • Verify the node shows as Running in the cluster.
  • Run a GPU workload to confirm operation.
  • Monitor for recurrence of the same issue.
  • Check GPU metrics to confirm normal operation.

Common diagnostic commands

Before triggering a repair, you can SSH into the node to diagnose issues:
# Check GPU status
nvidia-smi

# Check for Xid errors in system logs
sudo dmesg | grep -i xid

# Check GPU memory errors
nvidia-smi -q | grep -i ecc

# Check GPU temperature and throttling
nvidia-smi -q | grep -E 'Temperature|Throttle'

# Check PCIe link status
nvidia-smi -q | grep -E 'Link Width|Link Speed'

# Check running processes on GPU
nvidia-smi pmon

# Detailed GPU query
nvidia-smi -q
Learn how to SSH into nodes →

When to contact support

Contact [email protected] if:
  • Issues persist after all repair actions.
  • You see repeated failures on multiple nodes.
  • You need help diagnosing whether an issue is software or hardware.
  • Repair actions fail to complete.
  • You are unsure which repair action to use.
  • The node does not rejoin after repair completes.
Alternatively, use the Report an issue button in the repair dialog to notify support directly.

Next steps

Health checks

Monitor node health with active diagnostic tests and continuous passive monitoring.

Cluster management

Manage, monitor, and scale your GPU clusters.