Skip to main content

Overview

Health Checks allow you to proactively stress test and validate the health of your GPU nodes and underlying hardware. You can run targeted diagnostic tests to ensure your GPUs, InfiniBand networking, and other components are functioning correctly.

How to Run Health Checks

Quick Steps

  1. Navigate to your cluster in the Together Cloud UI
  2. Go to the Cluster Details tab and select the Health Checks sub-tab
  3. Click the Run a health check button (top right)
  4. In the “Run Health Checks” dialog:
    • Select tests - Choose one or more health check tests to run:
      • DCGM Diag - NVIDIA GPU diagnostics
      • GPU Burn - GPU stress test
      • Single-Node NCCL - Single-node GPU communication test
      • NVBandwidth: CPU to GPU Bandwidth - PCIe bandwidth test
      • NVBandwidth: GPU to CPU Bandwidth - PCIe bandwidth test
      • NVBandwidth: GPU-CPU Latency - PCIe latency test
      • InfiniBand Write Bandwidth - InfiniBand network performance test
  5. Click Next: Select Nodes
  6. Choose which nodes to test
  7. (Optional) Configure test parameters like duration or diagnostic level
  8. Click Run to start the health checks
Active Tests: These health checks require full GPU utilization from the node and will impact any running workloads during the test.

Available Health Check Tests

Each health check validates different aspects of your GPU infrastructure:

GPU Diagnostics

DCGM Diag
  • Runs NVIDIA Data Center GPU Manager diagnostics
  • Validates GPU compute capability, memory integrity, and thermal performance
  • Configurable: Diagnostic level (1-3, where 3 is most comprehensive)
  • Use for: Comprehensive GPU health validation
  • Learn more: NVIDIA DCGM Documentation
GPU Burn
  • Stress tests GPUs with intensive compute workloads
  • Validates stability under sustained high utilization
  • Configurable: Test duration
  • Use for: Identifying thermal issues, power problems, or instability
  • Learn more: GPU Burn on GitHub

Network Performance

Single-Node NCCL
  • Tests NVIDIA Collective Communications Library on a single node
  • Validates GPU-to-GPU communication within the node
  • Use for: Multi-GPU training readiness
  • Learn more: NVIDIA NCCL Documentation
InfiniBand Write Bandwidth
  • Measures InfiniBand network write throughput
  • Validates high-speed interconnect performance
  • Use for: Distributed training and multi-node workloads

PCIe Performance

NVBandwidth Tests
  • CPU to GPU Bandwidth - Host-to-device transfer rates
  • GPU to CPU Bandwidth - Device-to-host transfer rates
  • GPU-CPU Latency - Data transfer latency
  • Use for: Identifying PCIe bottlenecks or degraded lanes
  • Learn more: NVIDIA nvbandwidth Documentation

Understanding Test Results

Health check results are displayed in the Health Checks table:
  • Status - Passed (green) or Failed (red) indicator
  • Last Run - Timestamp of test execution
  • Node Tested - Which nodes were included in the test
  • Details - Click “View details” to see:
    • Full test output
    • Detailed metrics and measurements
    • Workflow CR (Custom Resource) with complete results
    • Pass/fail criteria details

When to Run Health Checks

Proactive Testing:
  • Before deploying critical workloads
  • After cluster scaling events
  • On a regular schedule (weekly/monthly)
  • After maintenance windows
Reactive Testing:
  • When experiencing unexplained job failures
  • Before triggering node repair actions
  • When investigating performance degradation
  • After node repairs to validate fixes
Specific Issue Investigation:
  • Training instability → Run GPU Burn, DCGM Diag
  • Slow data loading → Run NVBandwidth tests
  • Multi-GPU failures → Run Single-Node NCCL
  • Distributed training issues → Run InfiniBand tests

Best Practices

  1. Schedule workload-free windows - Health checks require full GPU utilization
  2. Start with DCGM Diag - Provides comprehensive overview of GPU health
  3. Run baseline tests - Test new nodes immediately to establish performance baseline
  4. Document results - Keep records of passed tests for comparison
  5. Test after repair - Always validate node health after repair actions
  6. Use appropriate test levels - Higher DCGM diagnostic levels take longer but are more thorough
Workload Impact: Health checks will fully utilize the GPU and may interfere with running workloads. Run tests during maintenance windows or on idle nodes.