Skip to main content

Monitor servers

The monitoring feature is bundled with the AI Infrastructure - Metal Cloud service.

Collecting and visualizing metrics, logs, and events helps you identify potential issues and optimize future workloads. You can choose the observability solution that best fits your needs.

MetricsCluster (same VPC)Single server
Total nodes and nodes down
GPU model, driver, CUDA version
Control state
Uptime
Total GPUs and GPUs down
GPU utilization
GPU memory
CPU utilization
System memory
Root storage usage
Local disk usage
Per-GPU details: Power consumption, temperature, GPU utilization, VRAM usage
Network bandwidth Inbound/Outbound
Network packets sent and received
Network error rate receive/send
System fan speed
System voltage
Common alerts