GPU VM passthrough - Rocky9 VM

Hi there,

i have two GPUs (both Titan-XPs, i love them). One for TrueNas Scale Dragonfish and an isolated Titan-XP for CUDA-Development under Python and Fortran to a dedicated Rocky9 VM.

Rocky9 VM is starting up, so i installed the NVIDIA- and CUDA-Packages via a normal “dnf” with no issues

[nynros@rocky9entw ~]$ dnf list installed |grep nvidia-driver |grep x86
nvidia-driver.x86_64 3:560.35.03-1.el9 @cuda-rhel9-x86_64
nvidia-driver-cuda.x86_64 3:560.35.03-1.el9 @cuda-rhel9-x86_64
nvidia-driver-cuda-libs.x86_64 3:560.35.03-1.el9 @cuda-rhel9-x86_64
nvidia-driver-libs.x86_64 3:560.35.03-1.el9 @cuda-rhel9-x86_64
[nynros@rocky9entw ~]$ dnf list installed |grep cuda-driver |grep x86
cuda-driver-devel-12-6.x86_64 12.6.68-1 @cuda-rhel9-x86_64
[nynros@rocky9entw ~]$

Everything seems fine even IOMMU and the other stuff

IOMMU Group * 00:00.0 Host bridge [0600]: Intel Corporation 440FX - 82441FX PMC [Natoma] [8086:1237] (rev 02)
00:01.0 ISA bridge [0601]: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II] [8086:7000]
00:01.1 IDE interface [0101]: Intel Corporation 82371SB PIIX3 IDE [Natoma/Triton II] [8086:7010]
00:01.3 Bridge [0680]: Intel Corporation 82371AB/EB/MB PIIX4 ACPI [8086:7113] (rev 03)
00:02.0 VGA compatible controller [0300]: Red Hat, Inc. QXL paravirtual graphic card [1b36:0100] (rev 05)
00:03.0 Ethernet controller [0200]: Red Hat, Inc. Virtio network device [1af4:1000]
00:04.0 USB controller [0c03]: NEC Corporation uPD720200 USB 3.0 Host Controller [1033:0194] (rev 03)
00:05.0 Communication controller [0780]: Red Hat, Inc. Virtio console [1af4:1003]
00:06.0 SCSI storage controller [0100]: Red Hat, Inc. Virtio block device [1af4:1001]
00:07.0 VGA compatible controller [0300]: NVIDIA Corporation GP102 [TITAN Xp] [10de:1b02] (rev a1)
00:08.0 Unclassified device [00ff]: Red Hat, Inc. Virtio memory balloon [1af4:1002]

[root@rocky9entw ~]# grubby --info=DEFAULT
index=0
kernel=“/boot/vmlinuz-5.14.0-427.33.1.el9_4.x86_64”
args=“ro crashkernel=1G-4G:192M,4G-64G:256M,64G-:512M resume=UUID=6bdb4f8c-e8ff-4b5a-9d72-b03d59687b41 rhgb quiet $tuned_params rd.driver.blacklist=nouveau modprobe.blacklist=nouveau intel_iommu=on iommu=pt”
root=“UUID=cbd44d25-e5be-4a46-8c4f-32d448d386c9”
initrd=“/boot/initramfs-5.14.0-427.33.1.el9_4.x86_64.img $tuned_initrd”
title=“Rocky Linux (5.14.0-427.33.1.el9_4.x86_64) 9.4 (Blue Onyx)”
id=“e1ea827cb91d4815ba486c4d2b405f62-5.14.0-427.33.1.el9_4.x86_64”
[root@rocky9entw ~]#

[root@rocky9entw ~]# lspci -vnnn | grep -i nvidia
00:07.0 VGA compatible controller [0300]: NVIDIA Corporation GP102 [TITAN Xp] [10de:1b02] (rev a1) (prog-if 00 [VGA controller])
Subsystem: NVIDIA Corporation Device [10de:123f]
Kernel modules: nouveau, nvidia_drm, nvidia
[root@rocky9entw ~]#

[root@rocky9entw ~]# dkms status
nvidia-open/560.35.03, 5.14.0-427.33.1.el9_4.x86_64, x86_64: installed
[root@rocky9entw ~]#

[root@rocky9entw ~]# sestatus
SELinux status: disabled
[root@rocky9entw ~]#

[root@rocky9entw ~]# dnf repolist
repo id repo name
appstream Rocky Linux 9 - AppStream
baseos Rocky Linux 9 - BaseOS
crb Rocky Linux 9 - CRB
cuda-rhel9-x86_64 cuda-rhel9-x86_64
epel Extra Packages for Enterprise Linux 9 - x86_64
epel-cisco-openh264 Extra Packages for Enterprise Linux 9 openh264 (From Cisco) - x86_64
extras Rocky Linux 9 - Extras
nvhpc NVIDIA HPC SDK
[root@rocky9entw ~]#

[nynros@rocky9entw ~]$ lsmod|grep nv
nvidia 9760768 1
libnvdimm 245760 1 nfit
drm 741376 7 drm_kms_helper,qxl,nvidia,drm_ttm_helper,ttm
[nynros@rocky9entw ~]$

but the driver cant communicate with the card and i dunno why …

[Sat Aug 31 15:49:49 2024] nvidia: probe of 0000:00:07.0 failed with error -1
[Sat Aug 31 15:49:49 2024] nvidia-nvlink: Unregistered Nvlink Core, major device number 236
[Sat Aug 31 15:49:49 2024] nvidia-nvlink: Nvlink Core is being initialized, major device number 236
[Sat Aug 31 15:49:49 2024] nvidia 0000:00:07.0: vgaarb: VGA decodes changed: olddecodes=none,decodes=none:owns=io+mem
NVRM: nvidia.ko because it does not include the required GPU
NVRM: www.nvidia.com.
[Sat Aug 31 15:49:49 2024] nvidia: probe of 0000:00:07.0 failed with error -1
[Sat Aug 31 15:49:49 2024] nvidia-nvlink: Unregistered Nvlink Core, major device number 236
[Sat Aug 31 15:49:49 2024] nvidia-nvlink: Nvlink Core is being initialized, major device number 236
[Sat Aug 31 15:49:49 2024] nvidia 0000:00:07.0: vgaarb: VGA decodes changed: olddecodes=none,decodes=none:owns=io+mem

[nynros@rocky9entw ~]$ nvidia-smi
NVIDIA-SMI has failed because it couldn’t communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

Please help, thx

Solved it with the following document from NVIDIA and the post-installation instructions