Overview
All cluster management operations are available through multiple interfaces for programmatic control and automation:- tcloud CLI β Command-line tool for cluster operations
- REST API β Full HTTP API for custom integrations
- Terraform Provider β Infrastructure-as-code for reproducible deployments
- SkyPilot β Orchestrate AI workloads across clusters
tcloud CLI
The tcloud CLI provides a command-line interface for managing clusters, storage, and scaling.Installation
Download the CLI for your platform:Authentication
Authenticate via Google SSO:Common Commands
Create a cluster:REST API
All cluster management actions are available via REST API endpoints.API Reference
Complete API documentation is available at: GPU Cluster API Reference βExample: Create Cluster
Example: List Clusters
Example: Delete Cluster
Terraform Provider
Use the Together Terraform Provider to define clusters, storage, and scaling policies as code.Setup
Example: Define a Cluster
Benefits
- Version control β Track infrastructure changes in Git
- Reproducibility β Deploy identical clusters across environments
- Automation β Integrate with CI/CD pipelines
- State management β Terraform tracks cluster state automatically
SkyPilot Integration
Orchestrate AI workloads on GPU Clusters using SkyPilot for simplified cluster management and job scheduling.Installation
Setup
- Launch a Kubernetes cluster via Together Cloud
- Configure kubeconfig:
- Verify SkyPilot access:
- Check available GPUs:
Example: Launch a Workload
Create a SkyPilot task file (task.yaml):
Example: Fine-tune GPT OSS
Download the gpt-oss-20b.yaml configuration. Launch fine-tuning:Benefits
- Simplified orchestration β Abstract away Kubernetes complexity
- Multi-cloud support β Same workflow across different clouds
- Cost optimization β Auto-select cheapest available resources
- Job management β Easy monitoring and cancellation
Automation Patterns
CI/CD Integration
GitHub Actions example:Scheduled Jobs
Cron-based cluster creation:Auto-scaling Scripts
Best Practices
API Usage
- Use environment variables for API keys (never hardcode)
- Implement retry logic for transient failures
- Check cluster status before submitting jobs
- Clean up resources after completion
CLI Usage
- Authenticate once per session with
tcloud sso login - Use UUIDs for cluster references (more reliable than names)
- Script common operations for team consistency
- Version control your cluster configuration scripts
Terraform
- Use remote state for team collaboration
- Tag resources for cost tracking
- Use variables for environment-specific configs
- Test in dev before applying to production
Troubleshooting
Authentication issues
- Verify API key is set:
echo $TOGETHER_API_KEY - Re-authenticate with SSO:
tcloud sso login - Check token expiration
API rate limits
- Implement exponential backoff
- Batch operations when possible
- Contact support for higher limits
Terraform state conflicts
- Use remote state locking
- Coordinate with team on apply operations
- Use
terraform planbeforeapply