Monitor servers
The monitoring feature is bundled with the AI Infrastructure - Metal Cloud service.
Collecting and visualizing metrics, logs, and events helps you identify potential issues and optimize future workloads. You can choose the observability solution that best fits your needs.
| Metrics | Cluster (same VPC) | Single server |
|---|---|---|
| Total nodes and nodes down | ✔ | |
| GPU model, driver, CUDA version | ✔ | |
| Control state | ✔ | |
| Uptime | ✔ | |
| Total GPUs and GPUs down | ✔ | ✔ |
| GPU utilization | ✔ | ✔ |
| GPU memory | ✔ | ✔ |
| CPU utilization | ✔ | ✔ |
| System memory | ✔ | ✔ |
| Root storage usage | ✔ | ✔ |
| Local disk usage | ✔ | ✔ |
| Per-GPU details: Power consumption, temperature, GPU utilization, VRAM usage | ✔ | |
| Network bandwidth Inbound/Outbound | ✔ | ✔ |
| Network packets sent and received | ✔ | ✔ |
| Network error rate receive/send | ✔ | |
| System fan speed | ✔ | |
| System voltage | ✔ | |
| Common alerts | ✔ |