Nvidia-related crash causing kernel panic

Hey there,

I have been having a problem for a few months. When I deploy a container, sometimes my entire NAS will crash and reboot, and it’s bothered me enough to try and get to the bottom of it. My specs are:

Ryzen 7 3700X
Gigabyte A350M
24GB RAM
GTX 1070
Booting TrueNAS Scale from a 256GB SSD

I am on TrueNAS 25.04 RC, but I only just changed the update train to see if it would fix it, as the same behaviour was observed on stable branches before.

When I deploy a container, sometimes the entirety of TrueNAS crashes. There’s no consistency, and it’s happening even if a GPU device isn’t passed to the container explicitly. I noticed that the default Docker runtime is nvidia in the Docker daemon config.

I obtained the below log by following dmesg as I deployed a container, and this is the last thing output before the system went down and then came back up.

[45080.231532] list_add corruption. next->prev should be prev (ffff97e8a3573788), but was 0000000000000000. (next=ffff97ebe6e1d650).
[45080.231801] NVRM: GPU 0000:07:00.0: RmInitAdapter failed! (0x22:0x56:890)
[45080.231809] ------------[ cut here ]------------
[45080.231969] NVRM: GPU 0000:07:00.0: rm_init_adapter failed, device minor number 0
[45080.232082] kernel BUG at lib/list_debug.c:29!
[45080.232374] Oops: invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
[45080.232523] CPU: 2 UID: 0 PID: 4169131 Comm: nvidia-containe Tainted: P           OE      6.12.15-production+truenas #1
[45080.232825] Tainted: [P]=PROPRIETARY_MODULE, [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
[45080.232984] Hardware name: Gigabyte Technology Co., Ltd. AB350M-Gaming 3/AB350M-Gaming 3-CF, BIOS F50a 11/27/2019
[45080.233320] RIP: 0010:__list_add_valid_or_report+0x61/0xa0
[45080.233500] Code: c7 c7 70 d7 f6 ad e8 ce 55 aa ff 0f 0b 48 c7 c7 98 d7 f6 ad e8 c0 55 aa ff 0f 0b 48 89 c1 48 c7 c7 c0 d7 f6 ad e8 af 55 aa ff <0f> 0b 48 89 d1 48 89 c6 4c 89 c2 48 c7 c7 18 d8 f6 ad e8 98 55 aa
[45080.234049] RSP: 0018:ffffb27c70277ba0 EFLAGS: 00010246
[45080.234244] RAX: 0000000000000075 RBX: ffff97ebeba7e000 RCX: 0000000000000000
[45080.234443] RDX: 0000000000000000 RSI: ffff97ed86b21780 RDI: ffff97ed86b21780
[45080.234643] RBP: ffff97eabd59a3c0 R08: 0000000000000000 R09: 0000000000000003
[45080.234846] R10: ffffb27c70277a40 R11: ffffffffae6cc0a8 R12: ffff97e937bcee58
[45080.235053] R13: ffff97e8a3573000 R14: ffff97e8a3573770 R15: ffff97ebeba7e140
[45080.235260] FS:  00007f926a744040(0000) GS:ffff97ed86b00000(0000) knlGS:0000000000000000
[45080.235482] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[45080.235704] CR2: 00007f926a809040 CR3: 000000046486c000 CR4: 0000000000350ef0

I have also noticed that the containers that were up previously at the time of crashing will come back up without a problem.

If there’s any more information needed, let me know. I am considering changing the default runtime as a temporary fix, as I don’t really use GPU acceleration aside from Jellyfin.

Thanks!

Also, here’s my dmesg | grep nvidia from a fresh boot, but nothing stands out as out of the ordinary. nvidia-smi also works.

[   33.395915] nvidia-nvlink: Nvlink Core is being initialized, major device number 240
[   33.397200] nvidia 0000:07:00.0: vgaarb: VGA decodes changed: olddecodes=io+mem,decodes=none:owns=io+mem
[   33.608796] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  550.142  Wed Dec 11 04:55:04 UTC 2024
[   33.695584] [drm] [nvidia-drm] [GPU ID 0x00000700] Loading driver
[   33.695589] [drm] Initialized nvidia-drm 0.0.0 for 0000:07:00.0 on minor 0
[   92.798116] audit: type=1400 audit(1744645197.445:4): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe" pid=2899 comm="apparmor_parser"
[   92.798935] audit: type=1400 audit(1744645197.445:5): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe//kmod" pid=2899 comm="apparmor_parser"
[  114.262682] nvidia_uvm: module uses symbols nvUvmInterfaceDisableAccessCntr from proprietary module nvidia, inheriting taint.
[  114.308039] nvidia-uvm: Loaded the UVM driver, major device number 238.

Update on this, sorry for the follow up again.

Changing my Docker runtime to runc fixed it. Attempting to deploy Jellyfin with the GPU attached to it killed TrueNAS again with a similar Nvidia-related error. Previously I could deploy Jellyfin (sometimes) and it would transcode once it was up and running, so I’ll have to do without transcoding in the meantime.

Here’s the crash log from that:

Apr 14 18:43:16 truenas kernel: NVRM: GPU 0000:07:00.0: RmInitAdapter failed! (0x22:0x56:890)
Apr 14 18:43:16 truenas kernel: NVRM: GPU 0000:07:00.0: rm_init_adapter failed, device minor number 0
Apr 14 18:43:16 truenas kernel: NVRM: GPU 0000:07:00.0: RmInitAdapter failed! (0x22:0x56:890)
Apr 14 18:43:16 truenas kernel: NVRM: GPU 0000:07:00.0: rm_init_adapter failed, device minor number 0
Apr 14 18:43:18 truenas kernel: eth0: renamed from veth96416a9
Apr 14 18:43:18 truenas kernel: br-3bddb97e85aa: port 1(vethcf6399e) entered blocking state
Apr 14 18:43:18 truenas kernel: br-3bddb97e85aa: port 1(vethcf6399e) entered forwarding state
Apr 14 18:44:31 truenas kernel: br-3bddb97e85aa: port 1(vethcf6399e) entered disabled state
Apr 14 18:44:31 truenas kernel: veth96416a9: renamed from eth0
Apr 14 18:44:31 truenas kernel: br-3bddb97e85aa: port 1(vethcf6399e) entered disabled state
Apr 14 18:44:31 truenas kernel: vethcf6399e (unregistering): left allmulticast mode
Apr 14 18:44:31 truenas kernel: vethcf6399e (unregistering): left promiscuous mode
Apr 14 18:44:31 truenas kernel: br-3bddb97e85aa: port 1(vethcf6399e) entered disabled state
Apr 14 18:44:58 truenas kernel: br-82faa5538128: port 1(veth7841ac3) entered blocking state
Apr 14 18:44:58 truenas kernel: br-82faa5538128: port 1(veth7841ac3) entered disabled state
Apr 14 18:44:58 truenas kernel: veth7841ac3: entered allmulticast mode
Apr 14 18:44:58 truenas kernel: veth7841ac3: entered promiscuous mode
Apr 14 18:44:58 truenas kernel: eth0: renamed from vethde65334
Apr 14 18:44:58 truenas kernel: br-82faa5538128: port 1(veth7841ac3) entered blocking state
Apr 14 18:44:58 truenas kernel: br-82faa5538128: port 1(veth7841ac3) entered forwarding state
Apr 14 18:46:31 truenas kernel: br-82faa5538128: port 1(veth7841ac3) entered disabled state
Apr 14 18:46:31 truenas kernel: vethde65334: renamed from eth0
Apr 14 18:46:31 truenas kernel: br-82faa5538128: port 1(veth7841ac3) entered disabled state
Apr 14 18:46:31 truenas kernel: veth7841ac3 (unregistering): left allmulticast mode
Apr 14 18:46:31 truenas kernel: veth7841ac3 (unregistering): left promiscuous mode
Apr 14 18:46:31 truenas kernel: br-82faa5538128: port 1(veth7841ac3) entered disabled state
Apr 14 18:47:01 truenas kernel: br-bb43ea34fcc0: port 1(veth2edbbd6) entered blocking state
Apr 14 18:47:01 truenas kernel: br-bb43ea34fcc0: port 1(veth2edbbd6) entered disabled state
Apr 14 18:47:01 truenas kernel: veth2edbbd6: entered allmulticast mode
Apr 14 18:47:01 truenas kernel: veth2edbbd6: entered promiscuous mode
Apr 14 18:47:01 truenas kernel: list_add corruption. next->prev should be prev (ffffa0fce2288788), but was 0000000000000000. (next=ffffa10085850650).
Apr 14 18:47:01 truenas kernel: ------------[ cut here ]------------
Apr 14 18:47:01 truenas kernel: NVRM: GPU 0000:07:00.0: RmInitAdapter failed! (0x22:0x56:890)
Apr 14 18:47:01 truenas kernel: kernel BUG at lib/list_debug.c:29!
AB350M-Gaming 3/AB350M-Gaming 3-CF, BIOS F50a 11/27/2019

That’s a pretty old BIOS - are you able to try updating it? Random searching seems to correlate the RmInitAdapter fault with BIOS/UEFI funkiness.

1 Like

Aha, thank you! I had come across some people mentioning it because of a PCIe related issue around that time, but it seemed like none of those applied to me because this only started happening like a year after I had started using TrueNAS, and I saw others having the same issue for other, software-related reasons. Decided to just go for it and update and it seems it went away. That’s my bad, should have tried it sooner! Thank you so much :slight_smile:

1 Like

Unfortunately it happened again when I deployed Uptime Kuma.

Apr 15 21:29:42.826919 truenas kernel: NVRM: GPU 0000:07:00.0: RmInitAdapter failed! (0x22:0x56:890)
Apr 15 21:29:42.827039 truenas kernel: NVRM: GPU 0000:07:00.0: rm_init_adapter failed, device minor number 0

Really at a loss here. It seemed to have been fixed after the BIOS update, but just got the exact same problem again. I’ve SSHed into the server and monitoring with dmesg -W, so hopefully I can have a stack trace when it does happen. It happened while deploying Uptime Kuma, and since rebooting the container won’t deploy as it says the port is allocated. I’ve since deployed it on another port and it’s working.

An update on this for anyone who may come across this thread:

As a last resort, I just took apart my system and built it back up again. I had previously reseated the GPU, and it hadn’t seemed to fix it, but it’s been up now since that last crash without any issues after I essentially rebuilt the entire thing. Hoping it sticks, and thank you @HoneyBadger for the help as I definitely think the BIOS was most of the problem :smiley:

What PSU is that system running?

Speaking of BIOS things, I know the earlier Ryzen processors would sometimes have issues with the C6/“deep sleep” CPU states. (They would “sleep” really well, but “wake” was the issue.) Check to see if those are enabled in your BIOS (C6/Cool-N-Quiet) and potentially consider disabling them if they’re active and you still see instability.

Reseating the GPU will definitely help if there was intermittent contact on a data pin. Hoping it was something as “low-tech” as that. :slight_smile: