Overview
Health Checks allow you to proactively stress test and validate the health of your GPU nodes and underlying hardware. You can run targeted diagnostic tests to ensure your GPUs, InfiniBand networking, and other components are functioning correctly.How to Run Health Checks
Quick Steps
- Navigate to your cluster in the Together Cloud UI
- Go to the Cluster Details tab and select the Health Checks sub-tab
- Click the Run a health check button (top right)
- In the “Run Health Checks” dialog:
- Select tests - Choose one or more health check tests to run:
- DCGM Diag - NVIDIA GPU diagnostics
- GPU Burn - GPU stress test
- Single-Node NCCL - Single-node GPU communication test
- NVBandwidth: CPU to GPU Bandwidth - PCIe bandwidth test
- NVBandwidth: GPU to CPU Bandwidth - PCIe bandwidth test
- NVBandwidth: GPU-CPU Latency - PCIe latency test
- InfiniBand Write Bandwidth - InfiniBand network performance test
- Select tests - Choose one or more health check tests to run:
- Click Next: Select Nodes
- Choose which nodes to test
- (Optional) Configure test parameters like duration or diagnostic level
- Click Run to start the health checks
Active Tests: These health checks require full GPU utilization from the node and will impact any running workloads during the test.
Available Health Check Tests
Each health check validates different aspects of your GPU infrastructure:GPU Diagnostics
DCGM Diag- Runs NVIDIA Data Center GPU Manager diagnostics
- Validates GPU compute capability, memory integrity, and thermal performance
- Configurable: Diagnostic level (1-3, where 3 is most comprehensive)
- Use for: Comprehensive GPU health validation
- Learn more: NVIDIA DCGM Documentation
- Stress tests GPUs with intensive compute workloads
- Validates stability under sustained high utilization
- Configurable: Test duration
- Use for: Identifying thermal issues, power problems, or instability
- Learn more: GPU Burn on GitHub
Network Performance
Single-Node NCCL- Tests NVIDIA Collective Communications Library on a single node
- Validates GPU-to-GPU communication within the node
- Use for: Multi-GPU training readiness
- Learn more: NVIDIA NCCL Documentation
- Measures InfiniBand network write throughput
- Validates high-speed interconnect performance
- Use for: Distributed training and multi-node workloads
PCIe Performance
NVBandwidth Tests- CPU to GPU Bandwidth - Host-to-device transfer rates
- GPU to CPU Bandwidth - Device-to-host transfer rates
- GPU-CPU Latency - Data transfer latency
- Use for: Identifying PCIe bottlenecks or degraded lanes
- Learn more: NVIDIA nvbandwidth Documentation
Understanding Test Results
Health check results are displayed in the Health Checks table:- Status - Passed (green) or Failed (red) indicator
- Last Run - Timestamp of test execution
- Node Tested - Which nodes were included in the test
- Details - Click “View details” to see:
- Full test output
- Detailed metrics and measurements
- Workflow CR (Custom Resource) with complete results
- Pass/fail criteria details
When to Run Health Checks
Proactive Testing:- Before deploying critical workloads
- After cluster scaling events
- On a regular schedule (weekly/monthly)
- After maintenance windows
- When experiencing unexplained job failures
- Before triggering node repair actions
- When investigating performance degradation
- After node repairs to validate fixes
- Training instability → Run GPU Burn, DCGM Diag
- Slow data loading → Run NVBandwidth tests
- Multi-GPU failures → Run Single-Node NCCL
- Distributed training issues → Run InfiniBand tests
Best Practices
- Schedule workload-free windows - Health checks require full GPU utilization
- Start with DCGM Diag - Provides comprehensive overview of GPU health
- Run baseline tests - Test new nodes immediately to establish performance baseline
- Document results - Keep records of passed tests for comparison
- Test after repair - Always validate node health after repair actions
- Use appropriate test levels - Higher DCGM diagnostic levels take longer but are more thorough