Installing the NVIDIA GPU Operator¶
This guide explains how to install the NVIDIA GPU Operator on your Gardener Kubernetes cluster. The GPU Operator automates the management of all NVIDIA software components needed to provision and manage GPUs in Kubernetes.
Prerequisites¶
- A Kubernetes cluster on Gardener with GPU-enabled worker nodes (e.g., V100, A30, A100)
- Helm installed on your local machine
- kubectl configured to access your cluster
What the GPU Operator Provides¶
The NVIDIA GPU Operator automatically installs and manages:
- NVIDIA drivers
- NVIDIA Container Toolkit
- Kubernetes device plugin for GPUs
- GPU monitoring tools
This means you don't need to manually install drivers on your nodes.
Installation¶
Add the NVIDIA Helm repository:
Create the namespace and install the operator:
kubectl create ns gpu-operator
kubectl label --overwrite ns gpu-operator pod-security.kubernetes.io/enforce=privileged
helm install gpu-operator nvidia/gpu-operator \
--namespace gpu-operator \
--set toolkit.enabled=true \
--version v24.9.0 \
--wait
Verifying the Installation¶
Wait for all pods to be ready (this can take a few minutes):
All pods should eventually show Running or Completed status.
Verify the GPU is detected by checking the node labels:
You should see output similar to:
You can also check the allocatable resources:
Using GPUs in Your Pods¶
Once the operator is installed, you can request GPUs in your pod specifications:
To ensure your pods are scheduled on GPU nodes, use a node selector:
Troubleshooting¶
Pods stuck in Pending¶
Check if the GPU node is available and has allocatable GPUs:
Driver installation failing¶
Check the driver installer pod logs:
GPU not detected¶
Verify the GPU hardware is present on the node:
Upgrading¶
To upgrade the GPU Operator to a newer version:
helm repo update
helm upgrade gpu-operator nvidia/gpu-operator \
--namespace gpu-operator \
--set toolkit.enabled=true \
--version <new-version>
Uninstalling¶
To remove the GPU Operator: