NVIDIA Fabric Manager

In this guide we’ll follow the procedure to enable NVIDIA Fabric Manager.

NVIDIA GPUs that have nvlink support (for eg: A100) will need the nvidia-fabricmanager system extension also enabled in addition to the NVIDIA drivers. For more information on Fabric Manager refer https://docs.nvidia.com/datacenter/tesla/fabric-manager-user-guide/index.html

The published versions of the NVIDIA fabricmanager system extensions is available here

The nvidia-fabricmanager extension version has to match with the NVIDIA driver version in use.

Upgrading Talos and enabling the NVIDIA fabricmanager system extension

In addition to the patch defined in the NVIDIA drivers guide, we need to add the nvidia-fabricmanager system extension to the patch yaml gpu-worker-patch.yaml:

- op: add
  path: /machine/install/extensions
  value:
    - image: ghcr.io/siderolabs/nvidia-open-gpu-kernel-modules:515.65.01-v1.2.3
    - image: ghcr.io/siderolabs/nvidia-container-toolkit:515.65.01-v1.10.0
    - image: ghcr.io/siderolabs/nvidia-fabricmanager:515.65.01
- op: add
  path: /machine/kernel
  value:
    modules:
      - name: nvidia
      - name: nvidia_uvm
      - name: nvidia_drm
      - name: nvidia_modeset
- op: add
  path: /machine/sysctls
  value:
    net.core.bpf_jit_harden: 1