Skip to content

Installing the NVIDIA GPU Operator

This guide explains how to install the NVIDIA GPU Operator on your Gardener Kubernetes cluster. The GPU Operator automates the management of all NVIDIA software components needed to provision and manage GPUs in Kubernetes.

Prerequisites

What the GPU Operator Provides

The NVIDIA GPU Operator automatically installs and manages:

  • NVIDIA drivers
  • NVIDIA Container Toolkit
  • Kubernetes device plugin for GPUs
  • GPU monitoring tools

This means you don't need to manually install drivers on your nodes.

Installation

Add the NVIDIA Helm repository:

helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update

Create the namespace and install the operator:

kubectl create ns gpu-operator
kubectl label --overwrite ns gpu-operator pod-security.kubernetes.io/enforce=privileged
helm install gpu-operator nvidia/gpu-operator \
  --namespace gpu-operator \
  --set toolkit.enabled=true \
  --version v24.9.0 \
  --wait

Verifying the Installation

Wait for all pods to be ready (this can take a few minutes):

kubectl get pods -n gpu-operator -w

All pods should eventually show Running or Completed status.

Verify the GPU is detected by checking the node labels:

kubectl describe nodes | grep -A5 "nvidia.com/gpu"

You should see output similar to:

nvidia.com/gpu.count=1
nvidia.com/gpu.present=true
nvidia.com/gpu.product=NVIDIA-A30

You can also check the allocatable resources:

kubectl get nodes -o json | jq '.items[].status.allocatable | select(.["nvidia.com/gpu"] != null)'

Using GPUs in Your Pods

Once the operator is installed, you can request GPUs in your pod specifications:

resources:
  limits:
    nvidia.com/gpu: 1

To ensure your pods are scheduled on GPU nodes, use a node selector:

nodeSelector:
  nvidia.com/gpu.present: "true"

Troubleshooting

Pods stuck in Pending

Check if the GPU node is available and has allocatable GPUs:

kubectl describe node <node-name> | grep -A10 "Allocatable"

Driver installation failing

Check the driver installer pod logs:

kubectl logs -n gpu-operator -l app=nvidia-driver-daemonset

GPU not detected

Verify the GPU hardware is present on the node:

kubectl debug node/<node-name> -it --image=ubuntu -- lspci | grep -i nvidia

Upgrading

To upgrade the GPU Operator to a newer version:

helm repo update
helm upgrade gpu-operator nvidia/gpu-operator \
  --namespace gpu-operator \
  --set toolkit.enabled=true \
  --version <new-version>

Uninstalling

To remove the GPU Operator:

helm uninstall gpu-operator -n gpu-operator
kubectl delete ns gpu-operator