Overview
This page covers two key features for maintaining healthy GPU nodes:- Health Checks - Proactive stress testing and validation of GPU nodes and underlying hardware
- Node Repair - User-triggered remediation actions for node health issues
Health Checks
Health Checks allow you to proactively stress test and validate the health of your GPU nodes and underlying hardware. You can run targeted diagnostic tests to ensure your GPUs, InfiniBand networking, and other components are functioning correctly.How to Run Health Checks
Quick Steps
- Navigate to your cluster in the Together Cloud UI
- Go to the Cluster Details tab and select the Health Checks sub-tab
- Click the Run a health check button (top right)
- In the “Run Health Checks” dialog:
- Select tests - Choose one or more health check tests to run:
- DCGM Diag - NVIDIA GPU diagnostics
- GPU Burn - GPU stress test
- Single-Node NCCL - Single-node GPU communication test
- NVBandwidth: CPU to GPU Bandwidth - PCIe bandwidth test
- NVBandwidth: GPU to CPU Bandwidth - PCIe bandwidth test
- NVBandwidth: GPU-CPU Latency - PCIe latency test
- InfiniBand Write Bandwidth - InfiniBand network performance test
- Select tests - Choose one or more health check tests to run:
- Click Next: Select Nodes
- Choose which nodes to test
- (Optional) Configure test parameters like duration or diagnostic level
- Click Run to start the health checks
Active Tests: These health checks require full GPU utilization from the node and will impact any running workloads during the test.
Available Health Check Tests
Each health check validates different aspects of your GPU infrastructure:GPU Diagnostics
DCGM Diag- Runs NVIDIA Data Center GPU Manager diagnostics
- Validates GPU compute capability, memory integrity, and thermal performance
- Configurable: Diagnostic level (1-3, where 3 is most comprehensive)
- Use for: Comprehensive GPU health validation
- Learn more: NVIDIA DCGM Documentation
- Stress tests GPUs with intensive compute workloads
- Validates stability under sustained high utilization
- Configurable: Test duration
- Use for: Identifying thermal issues, power problems, or instability
- Learn more: GPU Burn on GitHub
Network Performance
Single-Node NCCL- Tests NVIDIA Collective Communications Library on a single node
- Validates GPU-to-GPU communication within the node
- Use for: Multi-GPU training readiness
- Learn more: NVIDIA NCCL Documentation
- Measures InfiniBand network write throughput
- Validates high-speed interconnect performance
- Use for: Distributed training and multi-node workloads
PCIe Performance
NVBandwidth Tests- CPU to GPU Bandwidth - Host-to-device transfer rates
- GPU to CPU Bandwidth - Device-to-host transfer rates
- GPU-CPU Latency - Data transfer latency
- Use for: Identifying PCIe bottlenecks or degraded lanes
- Learn more: NVIDIA nvbandwidth Documentation
Understanding Test Results
Health check results are displayed in the Health Checks table:- Status - Passed (green) or Failed (red) indicator
- Last Run - Timestamp of test execution
- Node Tested - Which nodes were included in the test
- Details - Click “View details” to see:
- Full test output
- Detailed metrics and measurements
- Workflow CR (Custom Resource) with complete results
- Pass/fail criteria details
When to Run Health Checks
Proactive Testing:- Before deploying critical workloads
- After cluster scaling events
- On a regular schedule (weekly/monthly)
- After maintenance windows
- When experiencing unexplained job failures
- Before triggering node repair actions
- When investigating performance degradation
- After node repairs to validate fixes
- Training instability → Run GPU Burn, DCGM Diag
- Slow data loading → Run NVBandwidth tests
- Multi-GPU failures → Run Single-Node NCCL
- Distributed training issues → Run InfiniBand tests
Best Practices
- Schedule workload-free windows - Health checks require full GPU utilization
- Start with DCGM Diag - Provides comprehensive overview of GPU health
- Run baseline tests - Test new nodes immediately to establish performance baseline
- Document results - Keep records of passed tests for comparison
- Test after repair - Always validate node health after repair actions
- Use appropriate test levels - Higher DCGM diagnostic levels take longer but are more thorough
Node Repair
When health checks identify issues or you encounter node problems, you can trigger repair actions directly from the UI to restore node functionality.How to Trigger Node Repair
Quick Steps
- Navigate to your cluster in the Together Cloud UI
- Go to the Worker Nodes section
- Find the problematic node
- Click the ⋮ (three dots) menu in the State column
- Select Repair from the dropdown
- A repair dialog will appear showing:
- Node details (name, GPU configuration)
- Issue detected (if applicable)
- Impact warning
- Choose one of the repair actions:
- Quick reprovision - For software issues
- Migrate to new host - For hardware issues
- Report an issue (optional) - To notify support
Node Repair Lifecycle
When you trigger a repair action, the node goes through the following stages:- Node is marked as unschedulable
- No new workloads will be placed on the node
- Existing workloads continue running
- Running workloads are gracefully terminated
- Pods are evicted from the node
- Node becomes empty
- Quick Reprovision: VM recreated on a random physical node (could be the same as the original host)
- Migrate to New Host: New VM created on different physical hardware
- Node automatically rejoins the cluster
- Node becomes schedulable again
- Ready to accept new workloads
Available Repair Actions
1. Quick Reprovision Reprovisions the GPU node VM on a random underlying physical host. When to use:- Software-level issues (driver crashes, library corruption)
- VM configuration problems
- Application-level issues
- Node follows Cordon → Drain → Reprovision lifecycle
- VM is recreated with fresh software stack
- Node rejoins cluster automatically
- Hardware-level issues (GPU failures, PCIe problems)
- Issues persist after Quick Reprovision
- Physical component failures
- Node follows Cordon → Drain → Migrate lifecycle
- New VM created on different physical hardware
- Different GPU hardware assigned
- Node rejoins cluster automatically
- You’re unsure which repair action to use
- You want Together support to investigate before taking action
- The issue requires additional context or diagnosis
Decision Guide: Which Repair Action to Use
Use this table to determine whether Quick Reprovision can fix your issue or if you need to Migrate to New Host:| Issue Type | Can Reprovision Fix? | Needs Physical Repair? |
|---|---|---|
| Driver crashes/corruption | ✓ Yes | |
| CUDA/ROCm library issues | ✓ Yes | |
| GPU process hangs | ✓ Yes | |
| Application memory leaks | ✓ Yes | |
| Incorrect GPU mode settings | ✓ Yes | |
| GPU not attached to VM | ✓ Yes | |
| Device permissions/cgroup issues | ✓ Yes | |
| NUMA affinity problems | ✓ Yes | |
| Software-based throttling | ✓ Yes | |
| Recoverable Xid errors | ✓ Yes | |
| Single-bit ECC errors (occasional) | ✓ Yes | |
| GPU watchdog timeouts | ✓ Yes | |
| Stuck GPU contexts | ✓ Yes | |
| Complete GPU card failure | ✓ Yes | |
| Persistent multi-bit ECC errors | ✓ Yes | |
| GPU falling off PCIe bus | ✓ Yes | |
| Fan failures | ✓ Yes | |
| PCIe lane degradation | ✓ Yes | |
| Power delivery (VRM) issues | ✓ Yes | |
| Thermal/cooling problems | ✓ Yes | |
| Persistent Xid errors | ✓ Yes | |
| Physical connector damage | ✓ Yes | |
| Backplane/riser issues | ✓ Yes |
Key Diagnostic Rule: If the issue persists after reprovisioning the VM to a fresh instance on the same physical GPU, it’s a hardware problem requiring physical node repair (Migrate to New Host).
Monitoring Repair Progress
During the repair process, you’ll see the node progress through different states:- Cordoning - Node marked as unschedulable
- Draining - Workloads being evicted
- Repairing / Migrating - VM being recreated or migrated
- Joining - Node rejoining cluster
- Running - Node ready for workloads
Best Practices for Node Repair
Before Triggering Repair:- Save your data - Ensure important data is on PersistentVolumes, not local storage
- Drain workloads manually (optional) - For more control over workload migration
- Document the issue - Note symptoms for troubleshooting if repair doesn’t resolve the problem
- Check running jobs - Be aware of what will be interrupted
- Faster (same hardware)
- Resolves most software issues
- Can always escalate to migration if needed
- Quick Reprovision didn’t fix the issue
- You see hardware error indicators (ECC errors, Xid errors, thermal warnings)
- GPU diagnostics show hardware problems
- Verify node health - Check that the node shows as “Running” in cluster
- Test GPU functionality - Run a simple GPU workload to confirm operation
- Monitor for recurrence - Watch for the same issues returning
- Check node metrics - Ensure GPU metrics look normal
Common Diagnostic Commands
Before triggering repair, you can SSH into the node to diagnose issues:When to Contact Support
Contact [email protected] if:- Issues persist after both repair actions
- You see repeated failures on multiple nodes
- You need help diagnosing whether an issue is software or hardware
- Repair actions fail to complete
- You’re unsure which repair action to use
- The node doesn’t rejoin after repair completes