Skip to main content

MPS GPU sharing

MPS is an NVIDIA GPU feature that allows multiple containers to share the same physical GPU.

MPS has advantages over MIG in terms of GPU resource partitioning — up to 48 containers can use the GPU simultaneously.

MPS is based on NVIDIA's Multi-Process Service for CUDA, which allows multiple CUDA applications to run concurrently on a single GPU.

With MPS, users can pre-define the number of replicas for a GPU. This value tells us the maximum number of containers that can access and use a GPU.

In addition, you can limit GPU resources for each container by setting the following environment variables in the container:

  • CUDA_MPS_ACTIVE_THREAD_PERCENTAGE
  • CUDA_MPS_PINNED_DEVICE_MEM_LIMIT

To learn more about how MPS works, visit: https://docs.nvidia.com/deploy/mps/

Configuring MPS on FPTCloud K8s GPU service

You can configure a GPU worker group to use MPS during worker group initialization as shown in the following illustration:

With this configuration, the GPU is "divided" into 48 parts, each with 1/48 of the compute capability and memory of the original physical GPU.

Verify MPS

You can check the MPS configuration on your GPU node with:

kubectl describe nodes $NODE_NAME

Output:

At this point, you can request up to 48 nvidia.com/gpu resources for your pods. Each nvidia.com/gpu resource corresponds to 1/48 of the compute capability and memory of the original physical GPU.

If your node uses 2 GPUs, 96 nvidia.com/gpu resources will be shown.

Important notes

  • The nvidia.com/gpu resource a container requests must equal 1.
  • The maximum number of clients is 48, the minimum is 2. Physical GPU resources are divided equally among all max clients.
  • A container runs a single process to ensure MPS sharing mode does not encounter errors.
  • The hostIPC: true field is required in the workload deployment manifest.
  • MPS has limitations regarding error containment and workload isolation. Please review and consider these before use.