Using GPU telemetry
FPT Cloud uses NVIDIA GPU Telemetry integrated with kube-prometheus-stack as the monitoring and observability toolset for GPU-based systems on Kubernetes. The monitoring stack includes a collector, a time-series database for storing metrics, and a visualization layer. It uses the popular open-source applications Prometheus and Grafana. Prometheus also includes Alertmanager for creating and managing alerts. Prometheus is deployed together with kube-state-metrics and node_exporter to display cluster-level metrics for Kubernetes API objects and node-level metrics such as GPU utilization.
Check GPU custom metrics with the following command:
kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1 | jq -r . | grep DCGM
Access Prometheus to check GPU DCGM metrics:
kubectl port-forward service/kube-prometheus-stack-1679-prometheus 9090:63090
http://localhost:63090/
On the Prometheus interface, follow the steps shown below to check GPU DCGM metrics:
Access Grafana Dashboard:
kubectl port-forward service/kube-prometheus-stack-1679050354-grafana 80:63080
http://localhost:63080/
Default credentials for logging in to Grafana:
- User: admin
- Password: prom-operator
Import a Grafana Dashboard for GPU:
To import a dashboard, go to the Grafana interface, navigate to Dashboards > Manage > Import. If using the FPT Cloud dashboard, paste the FPT Cloud GPU Dashboard JSON content and click Load.
FPT Cloud GPU Dashboard:


