Monitor servers

The monitoring feature is bundled with the AI Infrastructure - Metal Cloud service.

Collecting and visualizing metrics, logs, and events helps you identify potential issues and optimize future workloads. You can choose the observability solution that best fits your needs.

Metrics	Cluster (same VPC)	Single server
Total nodes and nodes down	✔
GPU model, driver, CUDA version		✔
Control state	✔
Uptime		✔
Total GPUs and GPUs down	✔	✔
GPU utilization	✔	✔
GPU memory	✔	✔
CPU utilization	✔	✔
System memory	✔	✔
Root storage usage	✔	✔
Local disk usage	✔	✔
Per-GPU details: Power consumption, temperature, GPU utilization, VRAM usage		✔
Network bandwidth Inbound/Outbound	✔	✔
Network packets sent and received	✔	✔
Network error rate receive/send		✔
System fan speed		✔
System voltage		✔
Common alerts	✔