How to build NVIDIA gpu driver image for rocky linux for gpu operator

4 min readNov 9, 2022

In the last article, we looked at a simple way to use the gpu operator in Rocky Linux. However, the centos8 image is not provided for gpu driver 470 or higher, so it is impossible to use the gpu driver in the same way. Therefore, in this article, we have summarized how to directly build the gpu driver image so that it can be used in 470 or higher. (Before reading the article, check the NVIDIA gitlab first, it is expected that it will be more helpful to understand the content.)

Prerequisite

Since we are going to create a GPU driver image, we need to prepare a container build environment.

Linux
- docker 20.10.12

Dockerfile

If you look at the driver project of NVIDIA Gitlab, there is a Dockerfile for generating the CentOS8 driver, and this file was used almost without modification.)

ARG BUNDLE_IMAGE=nvcr.io/nvidia/cuda
ARG CUDA_VERSION=11.6.0FROM ${BUNDLE_IMAGE}:${CUDA_VERSION}-base-ubi8ARG DONKEY_VERSION=v1.1.0ENV NVIDIA_VISIBLE_DEVICES=voidRUN NVIDIA_GPGKEY_SUM=d0664fbbdb8c32356d45de36c5984617217b2d0bef41b93ccecd326ba3b80c87 && \
    curl -fsSL https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/D42D0685.pub | sed '/^Version/d' > /etc/pki/rpm-gpg/RPM-GPG-KEY-NVIDIA && \
    echo "$NVIDIA_GPGKEY_SUM  /etc/pki/rpm-gpg/RPM-GPG-KEY-NVIDIA" | sha256sum -c --strict -RUN dnf install -y \
        ca-certificates \
        curl \
        gcc \
        glibc.i686 \
        make \
        dnf-utils \
        kmod && \
    rm -rf /var/cache/dnf/*RUN curl -fsSL -o /usr/local/bin/donkey https://github.com/3XX0/donkey/releases/download/${DONKEY_VERSION}/donkey && \
    curl -fsSL -o /usr/local/bin/extract-vmlinux https://raw.githubusercontent.com/torvalds/linux/master/scripts/extract-vmlinux && \
    chmod +x /usr/local/bin/donkey /usr/local/bin/extract-vmlinux#ARG BASE_URL=http://us.download.nvidia.com/XFree86/Linux-x86_64
ARG BASE_URL=https://us.download.nvidia.com/tesla
ARG DRIVER_VERSION
ENV DRIVER_VERSION=$DRIVER_VERSION# Install the userspace components and copy the kernel module sources.
RUN cd /tmp && \
    curl -fSsl -O $BASE_URL/$DRIVER_VERSION/NVIDIA-Linux-x86_64-$DRIVER_VERSION.run && \
    sh NVIDIA-Linux-x86_64-$DRIVER_VERSION.run -x && \
    cd NVIDIA-Linux-x86_64-$DRIVER_VERSION && \
    ./nvidia-installer --silent \
                       --no-kernel-module \
                       --install-compat32-libs \
                       --no-nouveau-check \
                       --no-nvidia-modprobe \
                       --no-rpms \
                       --no-backup \
                       --no-check-for-alternate-installs \
                       --no-libglx-indirect \
                       --no-install-libglvnd \
                       --x-prefix=/tmp/null \
                       --x-module-path=/tmp/null \
                       --x-library-path=/tmp/null \
                       --x-sysconfig-path=/tmp/null && \
    mkdir -p /usr/src/nvidia-$DRIVER_VERSION && \
    mv LICENSE mkprecompiled kernel /usr/src/nvidia-$DRIVER_VERSION && \
    sed '9,${/^\(kernel\|LICENSE\)/!d}' .manifest > /usr/src/nvidia-$DRIVER_VERSION/.manifest && \
    rm -rf /tmp/*ADD --chown=root:root https://gitlab.com/nvidia/container-images/driver/-/raw/master/centos8/nvidia-driver /usr/local/bin/nvidia-driverRUN chmod 755 /usr/local/bin/nvidia-driver
#COPY nvidia-driver /usr/local/binWORKDIR /usr/src/nvidia-$DRIVER_VERSION#ARG PUBLIC_KEY=empty
#COPY ${PUBLIC_KEY} kernel/pubkey.x509#ARG PRIVATE_KEY# Remove cuda repository to avoid GPG errors
RUN rm -f /etc/yum.repos.d/cuda.repo# Add NGC DL license from the CUDA image
# chmod of nvidia-driver
RUN mkdir /licenses && \
    mv /NGC-DL-CONTAINER-LICENSE /licenses/NGC-DL-CONTAINER-LICENSE && \
    chmod 755 /usr/local/bin/nvidia-driverENTRYPOINT ["nvidia-driver", "init"]

Makefile

I also wrote a simple makefile to make image building easier.

include $(CURDIR)/version.mkCRI ?= docker
CRI_ARGS ?= --no-cache --force-rm --progress=plainDOCKERFILE ?= $(CURDIR)/rocky/DockerfileBUNDLE_IMAGE ?= nvcr.io/nvidia/cuda
OUT_IMAGE_NAME := docker.io/awslife/nvidia/driver
OUT_IMAGE_TAG := $(DRIVER_VERSION)-$(OS_NAME)$(OS_VERSION).DEFAULT_GOAL := defaultdefault: .build_rocky .tag .push.PHONY: build
build: .build_rocky.PHONY: tag
tag: .tag.PHONY: push
push: .push.PHONY: clean
clean: .clean.build_rocky:
 DOCKER_BUILDKIT=1
 $(CRI) build $(CRI_ARGS) \
  --tag $(OUT_IMAGE_NAME):$(OUT_IMAGE_TAG) \
  --build-arg BUNDLE_IMAGE=$(BUNDLE_IMAGE) \
  --build-arg CUDA_VERSION=$(CUDA_VERSION) \
  --build-arg DRIVER_VERSION=$(DRIVER_VERSION) \
  --build-arg DONKEY_VERSION=$(DONKEY_VERSION) \
  --file $(DOCKERFILE) $(CURDIR)/rocky.tag:
 @echo "tag".push:
 @echo "push".clean:
 $(CRI) rmi -f $(BUNDLE_IMAGE):$(CUDA_VERSION)-base-ubi8
 $(CRI) rmi -f $(OUT_IMAGE_NAME):$(OUT_IMAGE_TAG)

Build

Let’s build the image using the Dockerfile and Makefile we wrote.

$ make
DOCKER_BUILDKIT=1
docker build --no-cache --force-rm --progress=plain \
        --tag docker.io/awslife/nvidia/driver:470.141.03-rocky8.6 \
        --build-arg BUNDLE_IMAGE=nvcr.io/nvidia/cuda \
        --build-arg CUDA_VERSION=11.6.0 \
        --build-arg DRIVER_VERSION=470.141.03 \
        --build-arg DONKEY_VERSION=v1.1.0 \
        --file /home/awslife/Projects/nvidia-driver/rocky/Dockerfile /home/awslife/Projects/nvidia-driver/rocky
Sending build context to Docker daemon  4.608kB
Step 1/18 : ARG BUNDLE_IMAGE=nvcr.io/nvidia/cuda
Step 2/18 : ARG CUDA_VERSION=11.6.0
Step 3/18 : FROM ${BUNDLE_IMAGE}:${CUDA_VERSION}-base-ubi8
11.6.0-base-ubi8: Pulling from nvidia/cuda
...
Removing intermediate container 8a415a844bdc
 ---> 27189e8ebee5
Step 13/18 : ADD --chown=root:root https://gitlab.com/nvidia/container-images/driver/-/raw/master/centos8/nvidia-driver /usr/local/bin/nvidia-driver
Downloading  13.74kB
 ---> 6982d9b062c1
Step 14/18 : RUN chmod 755 /usr/local/bin/nvidia-driver
 ---> Running in 0baff614ffcd
Removing intermediate container 0baff614ffcd
 ---> 9bdb330f425f
Step 15/18 : WORKDIR /usr/src/nvidia-$DRIVER_VERSION
 ---> Running in fc2a2237c03e
Removing intermediate container fc2a2237c03e
 ---> 794e04df2175
Step 16/18 : RUN rm -f /etc/yum.repos.d/cuda.repo
 ---> Running in 6b90324539ed
Removing intermediate container 6b90324539ed
 ---> ae616fdb0e2f
Step 17/18 : RUN mkdir /licenses &&     mv /NGC-DL-CONTAINER-LICENSE /licenses/NGC-DL-CONTAINER-LICENSE &&     chmod 755 /usr/local/bin/nvidia-driver
 ---> Running in 0d3452d00e20
Removing intermediate container 0d3452d00e20
 ---> b2c1615c4237
Step 18/18 : ENTRYPOINT ["nvidia-driver", "init"]
 ---> Running in 14275c9999d5
Removing intermediate container 14275c9999d5
 ---> 2a944aa508a6
Successfully built 2a944aa508a6
Successfully tagged awslife/nvidia/driver:470.141.03-rocky8.6

Install NVIDIA GPU Operator for A100 MIG and Rocky

All ready to use MIG and Rocky Linux on the A100 GPU. Try redeploying it using helm.

$ helm upgrade \
    -n gpu-operator-resources \
    --version v1.11.1 \
    --set mig.strategy=mixed \
    --set driver.repository=docker.io/awslife/nvidia/driver \
    --set driver.version=470.141.03 \
    gpu-operator \
    nvidia/gpu-operator

Conclusion

I tried to create an NVIDIA GPU Driver image for Rocky Linux. All contents were created with reference to NVIDIA gitlab, so if you need more detailed information, please visit NVIDIA gitlab.

the full source can be found on github at the link below.

https://github.com/awslife/nvidia-driver

References

nvidia / container-images / driver · GitLab

GitLab.com

gitlab.com

Platform Support - NVIDIA Cloud Native Technologies documentation

This documents provides an overview of the GPUs and system Platform configurations supported. To understand the NVIDIA…

docs.nvidia.com