Auto node repair
When passive or active health checks detect a node-level issue, the system generates a repair recommendation and surfaces it for your review. This is a human-in-the-loop process: Together handles detection and recommends a remediation, but you decide when to proceed.How auto repair works
- Health checks detect an issue on a node and create an alert with supporting evidence.
- The system evaluates the alert and generates a repair recommendation with a suggested mode (for example, migrate to new host).
- The recommendation appears in the Repairs tab of your cluster.
- You review the recommendation and accept or dismiss it.
- Once accepted, Together executes the repair: cordon, graceful drain, remediation action, and node rejoin.
Auto repair accounts for in-flight work. Training jobs need to checkpoint before a node drains, and inference workloads need their replicas rebalanced. Review recommendations before accepting to confirm your workloads are ready for the disruption.
Recommended repair actions
When the system generates a repair recommendation, it selects the appropriate action based on the detected issue:| Detected issue | Recommended action |
|---|---|
| GPU fell off the bus | Migrate to new host |
| GPU thermal throttling (SmClockThermalThrottle) | Migrate to new host |
| Xid error | Reboot |
| Drained Slurm node | Migrate to new host |
Behavioral details
- Auto-resolution mid-approval: Recommendations can disappear if the underlying alert clears before you accept (5-minute default CompactTTL).
- Cooldown window: After a repair completes (succeeded, failed, or cancelled), no new recommendation is generated for ~30 minutes on the same node.
- Mode escalation: A pending recommendation can change its suggested mode in-place if a higher-severity failure is detected while it’s waiting in the queue.
The Repairs tab
To view repair recommendations and history:- Navigate to your cluster in the Together Cloud UI.
- Select the Repairs tab.
- Node: The affected node name.
- State: The current status of the repair. Values include Auto Resolved (issue resolved before action was taken), Succeeded (repair completed), and in-progress states.
- Mode: The remediation action (for example, Migrate to new host).
- Trigger: How the repair was initiated. Automated (generated by health checks) or Manual (triggered by a user).
- Created: When the repair recommendation was generated.
Repair details
Select any row in the Repairs table to view the full repair details:- Node: The affected node name.
- State: The current repair state (for example, Succeeded).
- Mode: The remediation action taken.
- Created / Started: When the recommendation was generated and when the repair execution began.
- Requested by: The source that initiated the repair. For auto repairs, this shows Together Health Checker.
- Reviewed by: Who approved the repair (your user name or Auto-Approved for auto-approved repairs).
- Review time: When the repair was approved.
- Review comment: Any notes from the approval (for example, “auto-approved: approved”).
- Repair ID: Unique identifier for tracking and support requests.
- Alert evidence: Expandable section showing the underlying alerts that triggered the recommendation, including failure type and affected hardware.
Manual node repair
When you encounter node problems or want to trigger a repair without waiting for an automated recommendation, you can start a repair directly from the Worker Nodes UI.How to trigger manual repair
- Navigate to your cluster in the Together Cloud UI.
- Go to the Worker Nodes section.
- Find the problematic node.
- Select the ⋮ (three dots) menu in the State column.
- Select Repair from the dropdown.
- A repair dialog appears showing:
- Node details (name, GPU configuration).
- Issue detected (if applicable).
- Impact warning.
- Choose one of the repair actions:
- Reboot: For transient software issues (preserves local data).
- Quick reprovision: For persistent software issues.
- Migrate to new host: For hardware issues.
- Remove: Permanently removes the node for RMA (return merchandise authorization).
- Report an issue (optional): To notify support.
Available repair actions
Reboot Reboots the VM in place on the same physical host.- When to use: Transient software issues (GPU driver hangs, stuck processes, kernel-level errors) where a restart is likely to clear the problem.
- What happens: The node follows the Cordon → Drain → Reboot → Rejoin lifecycle. The VM restarts on the same physical hardware without reimaging. Local scratch and temporary data on
/scratchand/tmpis preserved.
Reboot is the lightest repair action. Because the VM is not reimaged, it is faster than a reprovision and preserves local data. Try a reboot first for transient issues before escalating to a reprovision.
- When to use: Persistent software-level issues (driver crashes, library corruption), VM configuration problems, or application-level issues that a reboot did not resolve.
- What happens: The node follows the Cordon → Drain → Reprovision lifecycle. The VM is recreated with a fresh software stack and rejoins the cluster automatically.
- When to use: Hardware-level issues (GPU failures, PCIe problems), issues that persist after a quick reprovision, or physical component failures.
- What happens: The node follows the Cordon → Drain → Migrate lifecycle. A new VM is created on different physical hardware with different GPUs assigned, and rejoins the cluster automatically.
- When to use: Faulty GPU hardware that needs to be returned to the provider for RMA. Use this when the node has a confirmed hardware defect that cannot be resolved by migration.
- What happens: The node follows the Cordon → Drain lifecycle, then is permanently removed from the cluster. The node is not replaced automatically.
- You are unsure which repair action to use.
- You want Together support to investigate before taking action.
- The issue requires additional context or diagnosis.
Repair lifecycle
Both auto and manual repairs follow the same lifecycle:- Reboot: The VM restarts in place on the same hardware. Local
/scratchand/tmpdata is preserved. - Quick reprovision: The VM is recreated on a random physical node (could be the same as the original host). Local data is lost.
- Migrate to new host: A new VM is created on different physical hardware. Local data is lost.
- Remove: The node is permanently removed from the cluster for RMA. No rejoin occurs.
Choosing a repair action
Use this table to determine which repair action fits your issue. Start with the lightest action (reboot) and escalate if the issue persists.| Issue type | Reboot | Reprovision | Migrate to new host |
|---|---|---|---|
| GPU driver hang | ✓ Try first | ✓ If reboot fails | |
| Stuck GPU processes | ✓ Try first | ✓ If reboot fails | |
| GPU watchdog timeouts | ✓ Try first | ✓ If reboot fails | |
| Stuck GPU contexts | ✓ Try first | ✓ If reboot fails | |
| Recoverable Xid errors | ✓ Try first | ✓ If reboot fails | |
| Application memory leaks | ✓ Try first | ✓ If reboot fails | |
| Software-based throttling | ✓ Try first | ✓ If reboot fails | |
| Driver crashes/corruption | ✓ Yes | ||
| CUDA/ROCm library issues | ✓ Yes | ||
| Incorrect GPU mode settings | ✓ Yes | ||
| GPU not attached to VM | ✓ Yes | ||
| Device permissions/cgroup issues | ✓ Yes | ||
| NUMA affinity problems | ✓ Yes | ||
| Single-bit ECC errors (occasional) | ✓ Yes | ||
| Complete GPU card failure | ✓ Yes | ||
| Persistent multi-bit ECC errors | ✓ Yes | ||
| GPU falling off PCIe bus | ✓ Yes | ||
| Fan failures | ✓ Yes | ||
| PCIe lane degradation | ✓ Yes | ||
| Power delivery (VRM) issues | ✓ Yes | ||
| Thermal/cooling problems | ✓ Yes | ||
| Persistent Xid errors | ✓ Yes | ||
| Physical connector damage | ✓ Yes | ||
| Backplane/riser issues | ✓ Yes |
Escalation path: reboot → reprovision → migrate to new host. If the issue persists after reprovisioning the VM to a fresh instance on the same physical GPU, it is a hardware problem requiring migration to a new host.
Best practices
Before triggering a repair:- Store important data on PersistentVolumes, not local storage.
- Optionally drain workloads manually for more control over migration.
- Document symptoms for troubleshooting if the repair does not resolve the problem.
- Check running jobs so you know what will be interrupted.
- Start with reboot: It is the fastest option, preserves local data, and resolves most transient software issues.
- Escalate to quick reprovision: When a reboot did not fix the issue, or the problem is a corrupted driver, library, or VM configuration that requires a fresh software stack.
- Use migrate to new host: When reprovision did not fix the issue, you see hardware error indicators (ECC errors, Xid errors, thermal warnings), or GPU diagnostics show hardware problems.
- Verify the node shows as Running in the cluster.
- Run a GPU workload to confirm operation.
- Monitor for recurrence of the same issue.
- Check GPU metrics to confirm normal operation.
Common diagnostic commands
Before triggering a repair, you can SSH into the node to diagnose issues:When to contact support
Contact [email protected] if:- Issues persist after all repair actions.
- You see repeated failures on multiple nodes.
- You need help diagnosing whether an issue is software or hardware.
- Repair actions fail to complete.
- You are unsure which repair action to use.
- The node does not rejoin after repair completes.
Next steps
Health checks
Monitor node health with active diagnostic tests and continuous passive monitoring.
Cluster management
Manage, monitor, and scale your GPU clusters.