NVIDIA driver failing to install

jacob6707 · November 14, 2024, 9:27am

Hello, I’ve recently upgraded from 23.10 to 24.04 then lastly 24.10 and now Jellyfin won’t use my NVIDIA GPU. I’ve searched through the forums for other people with the same issue, attempted to use the UI to install the driver, failed. Then I tried to use the command:

midclt call -job docker.update '{"nvidia": true}'

which failed with the same message.
nvidia-smi sometimes returns:

NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

and other times it shows my GTX 1060 working, but even then it doesn’t show up as an option in apps.

Below is the error that the driver installation showed and the nvidia-installer.log file:

[EFAULT] Command /root/tmpnveu_ubz/NVIDIA-Linux-x86_64-550.127.05-no-compat32.run --tmpdir /root/tmpnveu_ubz -s failed (code 1): 
Verifying archive integrity... OK 
Uncompressing NVIDIA Accelerated Graphics Driver for Linux-x86_64 550.127.05
ERROR: Unable to load the kernel module 'nvidia.ko'. This happens most frequently when this kernel module was built against the wrong or improperly configured kernel sources, with a version of gcc that differs from the one used to build the target kernel, or if another driver, such as nouveau, is present and prevents the NVIDIA kernel module from obtaining ownership of the NVIDIA device(s), or no NVIDIA device installed in this system is supported by this NVIDIA Linux graphics driver release. Please see the log entries 'Kernel module load error' and 'Kernel messages' at the end of the file '/var/log/nvidia-installer.log' for more information. 
ERROR: Installation has failed. Please see the file '/var/log/nvidia-installer.log' for details. You may find suggestions on fixing installation problems in the README available on the Linux driver download page at www.nvidia.com.

-> Kernel module compilation complete.
-> Unable to determine if Secure Boot is enabled: No such file or directory
ERROR: Unable to load the kernel module 'nvidia.ko'.  This happens most frequently when this kernel module was built against the wrong or improperly configured kernel sources, with a version of gcc that differs from the one used to build the target kernel, or if another driver, such as nouveau, is present and prevents the NVIDIA kernel module from obtaining ownership of the NVIDIA device(s), or no NVIDIA device installed in this system is supported by this NVIDIA Linux graphics driver release.

Please see the log entries 'Kernel module load error' and 'Kernel messages' at the end of the file '/var/log/nvidia-installer.log' for more information.
-> Kernel module load error: No such device
-> Kernel messages:
               NVRM: reconfigure your kernel without the conflicting
               NVRM: driver(s)), then try loading the NVIDIA kernel module
               NVRM: again.
[  501.161113] NVRM: No NVIDIA devices probed.
[  501.166152] nvidia-nvlink: Unregistered Nvlink Core, major device number 236
[  524.516640] br-691cc3a43119: port 1(vethaf62654) entered blocking state
[  524.516649] br-691cc3a43119: port 1(vethaf62654) entered disabled state
[  524.516665] vethaf62654: entered allmulticast mode
[  524.516727] vethaf62654: entered promiscuous mode
[  524.516888] br-691cc3a43119: port 1(vethaf62654) entered blocking state
[  524.516894] br-691cc3a43119: port 1(vethaf62654) entered forwarding state
[  524.520037] br-691cc3a43119: port 1(vethaf62654) entered disabled state
[  524.752306] eth0: renamed from vethd64b6f1
[  524.789116] br-691cc3a43119: port 1(vethaf62654) entered blocking state
[  524.789120] br-691cc3a43119: port 1(vethaf62654) entered forwarding state
[  636.141028] nvidia-nvlink: Nvlink Core is being initialized, major device number 236
[  636.141038] NVRM: The NVIDIA probe routine was not called for 1 device(s).
[  636.141880] NVRM: This can occur when another driver was loaded and 
               NVRM: obtained ownership of the NVIDIA device(s).
[  636.141881] NVRM: Try unloading the conflicting kernel module (and/or
               NVRM: reconfigure your kernel without the conflicting
               NVRM: driver(s)), then try loading the NVIDIA kernel module
               NVRM: again.
[  636.141882] NVRM: No NVIDIA devices probed.
[  636.142076] nvidia-nvlink: Unregistered Nvlink Core, major device number 236
ERROR: Installation has failed.  Please see the file '/var/log/nvidia-installer.log' for details.  You may find suggestions on fixing installation problems in the README available on the Linux driver download page at www.nvidia.com.

Also, I have checked my apps to see if any one of them was using the GPU and that the GPU isn’t isolated in the UI. Even without all that, the driver fails to install and the apps don’t show it as an option.
If anyone knows how to fix the issue, I would gladly appreciate the help.

NugentS · November 14, 2024, 12:41pm

It looks to me that you are trying to manually install the Nvidia driver - don’t
Install it from the Apps menu

jacob6707 · November 14, 2024, 4:40pm

No, I mentioned I tried both ways, first from the Apps menu and manually with the command and both failed with the same message. When running lspci -v I see that the kernel modules are specified but they don’t exist in lsmod, which is super weird to me.

HoneyBadger · November 14, 2024, 4:56pm

Check to ensure that your GPU is not isolated for VM use, and see if Jellyfin has been impacted by the UUID issue outlined here:

scyto · November 14, 2024, 5:14pm

what if any messages at the same time period do you see in dmesg that may relate (it will have additional information if the kernel module tried to load at end of compilation and couldn’t)

you should also look in /var/log/nvidia-installer.log as the error says and post anything useful / different from there

how is your GPU connected?

jacob6707 · November 14, 2024, 7:18pm

During installation nothing showed up except nvidia-nvlink failing to probe the GPU but upon inspection I did see that during boot time for some reason i915 was taking control of my NVIDIA GPU? (I have an intel iGPU in my system and the NVIDIA GPU is connected to PCI-E (0000:00:02.0)):

[    8.765593] i915 0000:00:02.0: [drm] fb0: i915drmfb frame buffer device
[    9.119204] nvidia-nvlink: Nvlink Core is being initialized, major device number 238
[    9.119212] NVRM: The NVIDIA probe routine was not called for 1 device(s).
[    9.127125] NVRM: This can occur when another driver was loaded and
               NVRM: obtained ownership of the NVIDIA device(s).
[    9.139225] NVRM: Try unloading the conflicting kernel module (and/or
               NVRM: reconfigure your kernel without the conflicting
               NVRM: driver(s)), then try loading the NVIDIA kernel module
               NVRM: again.
[    9.160872] NVRM: No NVIDIA devices probed.
[    9.165419] nvidia-nvlink: Unregistered Nvlink Core, major device number 238

I have checked and also removed plex in the meantime as I only just installed jellyfin but no apps are erroring because of the NVIDIA GPU and it’s not isolated within the UI either. Also, midclt call app.gpu_choices | jq only returns the Intel iGPU.

I’ve checked the log and only found the nvidia-prober errors as shown in the initial post, nothing else of note except a bunch of objtool warnings:

   /root/tmpnveu_ubz/selfgz19786/NVIDIA-Linux-x86_64-550.127.05-no-compat32/kernel/nvidia.o: warning: objtool: _nv047488rm+0x9e: 'naked' return found in RETHUNK build
   /root/tmpnveu_ubz/selfgz19786/NVIDIA-Linux-x86_64-550.127.05-no-compat32/kernel/nvidia.o: warning: objtool: _nv047378rm+0x73: 'naked' return found in RETHUNK build
   /root/tmpnveu_ubz/selfgz19786/NVIDIA-Linux-x86_64-550.127.05-no-compat32/kernel/nvidia.o: warning: objtool: _nv047368rm+0x4b: 'naked' return found in RETHUNK build

I will check if the i915 module is the culprit of my problems but I doubt it as I did manage to get my NVIDIA card working at one point but still not showing in the apps UI.

scyto · November 15, 2024, 3:03am

i would be surprised if the i915 affected it, i have never see that probe message despite all the issue i had, might be tempted to take it at face value

can you provide the output of lspci -v and paste the entry in for the VGA adapter, lets see if there is some other module loaded…

jacob6707 · November 15, 2024, 10:46am

Yeah, it wasn’t the i915 module, but still confusing why it probed the NVIDIA GPU

01:00.0 VGA compatible controller: NVIDIA Corporation GP106 [GeForce GTX 1060 6GB] (rev a1) (prog-if 00 [VGA controller])
        Subsystem: Gigabyte Technology Co., Ltd GP106 [GeForce GTX 1060 6GB]
        Flags: fast devsel, IRQ 11
        Memory at f6000000 (32-bit, non-prefetchable) [disabled] [size=16M]
        Memory at e0000000 (64-bit, prefetchable) [disabled] [size=256M]
        Memory at f0000000 (64-bit, prefetchable) [disabled] [size=32M]
        I/O ports at e000 [disabled] [size=128]
        Expansion ROM at f7000000 [disabled] [size=512K]
        Capabilities: [60] Power Management version 3
        Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
        Capabilities: [78] Express Legacy Endpoint, MSI 00
        Capabilities: [100] Virtual Channel
        Capabilities: [250] Latency Tolerance Reporting
        Capabilities: [128] Power Budgeting <?>
        Capabilities: [420] Advanced Error Reporting
        Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
        Capabilities: [900] Secondary PCI Express
        Kernel modules: nouveau, nvidia_drm, nvidia

The weird part here is that none of the three kernel modules show up in lsmod, like if they don’t exist at all.

scyto · November 16, 2024, 12:11am

I was going to ask are you sure your motherboard has SRIOV enabled due to the difference below, then i remembered we are on the host so that shouldn’t matter.

Can you check in the BIOS that you don’t have the feature enabled where the i915 and dGPU can work together (i forget what its called). I am also concerned you show no IOMMU group but i don’t know if that should or shouldn’t matter - could be because i am an eGPU…

At this point i think all we can do is look what dmesg says where you modprobe nvidia… what you see at boot time?

39:00.0 VGA compatible controller: NVIDIA Corporation TU102 [GeForce RTX 2080 Ti Rev. A] (rev a1) (prog-if 00 [VGA controller])
        Subsystem: eVga.com. Corp. TU102 [GeForce RTX 2080 Ti Rev. A]
        Flags: bus master, fast devsel, latency 0, IRQ 347, IOMMU group 39
        Memory at 52000000 (32-bit, non-prefetchable) [size=16M]
        Memory at 6000000000 (64-bit, prefetchable) [size=256M]
        Memory at 6010000000 (64-bit, prefetchable) [size=32M]
        I/O ports at c000 [size=128]
        Expansion ROM at 53000000 [virtual] [disabled] [size=512K]
        Capabilities: [60] Power Management version 3
        Capabilities: [68] MSI: Enable+ Count=1/1 Maskable- 64bit+
        Capabilities: [78] Express Legacy Endpoint, MSI 00
        Capabilities: [100] Virtual Channel
        Capabilities: [250] Latency Tolerance Reporting
        Capabilities: [258] L1 PM Substates
        Capabilities: [128] Power Budgeting <?>
        Capabilities: [420] Advanced Error Reporting
        Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
        Capabilities: [900] Secondary PCI Express
        Capabilities: [bb0] Physical Resizable BAR
        Kernel driver in use: nvidia
        Kernel modules: nouveau, nvidia_drm, nvidia
``

jacob6707 · November 16, 2024, 1:56am

While I doubt my motherboard supports hybrid graphics, I will have a look at all the settings in there to see if anything points towards the nvidia GPU not getting priority.

It doesn’t show an IOMMU group because my CPU (i7 2600k) doesn’t support IOMMU (VT-d). It can still pass through the GPU to a docker container but not to a virtual machine.

I just tried to modprobe the kernel module and in dmesg the same error with nvidia probe comes up (exactly the same thing at boot time too):

[146558.720836] nvidia-nvlink: Nvlink Core is being initialized, major device number 236
[146558.720843] NVRM: The NVIDIA probe routine was not called for 1 device(s).
[146558.729343] NVRM: This can occur when another driver was loaded and
                NVRM: obtained ownership of the NVIDIA device(s).
[146558.742245] NVRM: Try unloading the conflicting kernel module (and/or
                NVRM: reconfigure your kernel without the conflicting
                NVRM: driver(s)), then try loading the NVIDIA kernel module
                NVRM: again.
[146558.765542] NVRM: No NVIDIA devices probed.
[146558.770627] nvidia-nvlink: Unregistered Nvlink Core, major device number 236

Fleshmauler · November 16, 2024, 2:23am

I’m not on 24., but I’ve had nvidia driver issues in the past when updating. Each time it was solved by saving the config, doing a clean install, and re-uploading the config.

That’d be my go to… on the other hand I’m also the guy that was goofing around with his drivers & corrupted them less than 30 minutes ago - so grain of salt & all that.

scyto · November 18, 2024, 2:50am

it seems very convinced about there is a driver loaded somewhere in the kernel, is there anything before this in dmesg about the PCIE address the nvidia car is on

it sounds like from @Fleshmauler the driver package can get itself into a bad state, sounds like his approach is likely to be the right one,