Version v1.7 of the documentation is for the Talos version being developed. For the latest stable version of Talos, see the latest version.

NVIDIA Fabric Manager

In this guide we’ll follow the procedure to enable NVIDIA Fabric Manager.

NVIDIA GPUs that have nvlink support (for eg: A100) will need the nvidia-fabricmanager system extension also enabled in addition to the NVIDIA drivers. For more information on Fabric Manager refer https://docs.nvidia.com/datacenter/tesla/fabric-manager-user-guide/index.html

The published versions of the NVIDIA fabricmanager system extensions is available here

The nvidia-fabricmanager extension version has to match with the NVIDIA driver version in use.

Enabling the NVIDIA fabricmanager system extension

Create the boot assets or a custom installer and perform a machine upgrade which include the following system extensions:

ghcr.io/siderolabs/nvidia-open-gpu-kernel-modules:535.129.03-v1.7.0-alpha.0
ghcr.io/siderolabs/nvidia-container-toolkit:535.129.03-v1.13.5
ghcr.io/siderolabs/nvidia-fabricmanager:535.129.03

Patch the machine configuration to load the required modules:

machine:
  kernel:
    modules:
      - name: nvidia
      - name: nvidia_uvm
      - name: nvidia_drm
      - name: nvidia_modeset
  sysctls:
    net.core.bpf_jit_harden: 1