Arc GPU memory issue again, NAS-129492, VM not starting on 24.10.1

azzurro · January 16, 2025, 10:51pm

Hi,

this looks very much like the issue in NAS-129492 to me.

On 24.10.1, a VM with an Arc A380 fires up without issues, when the hardware has just started. Then I had the VM turned off overnight, and when I tried to start it the next day, I got the following errors (my NAS is an iSCSI target, so “stuff” was happening the whole night).

Jan 16 11:07:12 NNET25NAS209 kernel: ? syscall_exit_to_user_mode+0x22/0x40
Jan 16 11:07:12 NNET25NAS209 kernel: ? srso_alias_return_thunk+0x5/0xfbef5
Jan 16 11:07:12 NNET25NAS209 kernel: ? do_syscall_64+0x65/0xb0
Jan 16 11:07:12 NNET25NAS209 kernel: ? srso_alias_return_thunk+0x5/0xfbef5
Jan 16 11:07:12 NNET25NAS209 kernel: ? syscall_exit_to_user_mode+0x22/0x40
Jan 16 11:07:12 NNET25NAS209 kernel: ? srso_alias_return_thunk+0x5/0xfbef5
Jan 16 11:07:12 NNET25NAS209 kernel: ? do_syscall_64+0x65/0xb0
Jan 16 11:07:12 NNET25NAS209 kernel: ? srso_alias_return_thunk+0x5/0xfbef5
Jan 16 11:07:12 NNET25NAS209 kernel: ? syscall_exit_to_user_mode+0x22/0x40
Jan 16 11:07:12 NNET25NAS209 kernel: ? srso_alias_return_thunk+0x5/0xfbef5
Jan 16 11:07:12 NNET25NAS209 kernel: ? do_syscall_64+0x65/0xb0
Jan 16 11:07:12 NNET25NAS209 kernel: ? do_syscall_64+0x65/0xb0
Jan 16 11:07:12 NNET25NAS209 kernel: ? do_syscall_64+0x65/0xb0
Jan 16 11:07:12 NNET25NAS209 kernel: ? do_syscall_64+0x65/0xb0
Jan 16 11:07:12 NNET25NAS209 kernel: entry_SYSCALL_64_after_hwframe+0x78/0xe2
Jan 16 11:07:12 NNET25NAS209 kernel: RIP: 0033:0x7f583c934c5b
Jan 16 11:07:12 NNET25NAS209 kernel: Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24 10 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 1c 48 8b 44 24 18 64 48 2b 04 25 28 00 00
Jan 16 11:07:12 NNET25NAS209 kernel: RSP: 002b:00007ffeba809cf0 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
Jan 16 11:07:12 NNET25NAS209 kernel: RAX: ffffffffffffffda RBX: 00005644c69eedf0 RCX: 00007f583c934c5b
Jan 16 11:07:12 NNET25NAS209 kernel: RDX: 00007ffeba809d50 RSI: 0000000000003b71 RDI: 000000000000002c
Jan 16 11:07:12 NNET25NAS209 kernel: RBP: 0000000000100000 R08: 0000000000000000 R09: 0000000000000000
Jan 16 11:07:12 NNET25NAS209 kernel: R10: 00000000bff00000 R11: 0000000000000246 R12: 00000000bff00000
Jan 16 11:07:12 NNET25NAS209 kernel: R13: 00005644c69eeff0 R14: 00007ffeba809d50 R15: 00005644c69eedf0
Jan 16 11:07:12 NNET25NAS209 kernel:
Jan 16 11:07:12 NNET25NAS209 kernel: —[ end trace 0000000000000000 ]—
Jan 16 11:07:13 NNET25NAS209 libvirtd[4595]: Unable to read from monitor: Connection reset by peer
Jan 16 11:07:13 NNET25NAS209 libvirtd[4595]: internal error: qemu unexpectedly closed the monitor: 2025-01-16T10:07:12.993365Z qemu-system-x86_64: -device {“driver”:“vfio-pci”,“host”:“0000:08:00.0”,“id”:“hostdev0”,“bus”:“pci.0”,“addr”:“0x9”}: VFIO_MAP_DMA failed: Bad address
2025-01-16T10:07:13.160218Z qemu-system-x86_64: -device {“driver”:“vfio-pci”,“host”:“0000:08:00.0”,“id”:“hostdev0”,“bus”:“pci.0”,“addr”:“0x9”}: vfio 0000:08:00.0: failed to setup container for group 29: memory listener initialization failed: Region pc.ram: vfio_dma_map(0x>
Jan 16 11:07:13 NNET25NAS209 systemd[1]: machine-qemu\x2d5\x2d3NNET2538RDS83.scope: Deactivated successfully.
Jan 16 11:07:13 NNET25NAS209 systemd[1]: machine-qemu\x2d5\x2d3NNET2538RDS83.scope: Consumed 4.264s CPU time.
Jan 16 11:07:13 NNET25NAS209 systemd-machined[4591]: Machine qemu-5-3NNET2538RDS83 terminated.
Jan 16 11:07:13 NNET25NAS209 nscd[321327]: 321327 monitoring file /etc/nsswitch.conf (4)
Jan 16 11:07:13 NNET25NAS209 nscd[321327]: 321327 monitoring directory /etc (2)
Jan 16 11:07:13 NNET25NAS209 nscd[321327]: 321327 monitoring file /etc/resolv.conf (3)
Jan 16 11:07:13 NNET25NAS209 nscd[321327]: 321327 monitoring directory /etc (2)
Jan 16 11:07:13 NNET25NAS209 audit[339633]: AVC apparmor=“STATUS” operation=“profile_remove” profile=“unconfined” name=“libvirt-81d75922-fd41-4c0b-98cf-3e78904128bf” pid=339633 comm=“apparmor_parser”
Jan 16 11:07:13 NNET25NAS209 kernel: audit: type=1400 audit(1737022033.788:54): apparmor=“STATUS” operation=“profile_remove” profile=“unconfined” name=“libvirt-81d75922-fd41-4c0b-98cf-3e78904128bf” pid=339633 comm=“apparmor_parser”
Jan 16 11:07:13 NNET25NAS209 middlewared[1764]: libvirt: QEMU Driver error : internal error: qemu unexpectedly closed the monitor: 2025-01-16T10:07:12.993365Z qemu-system-x86_64: -device {“driver”:“vfio-pci”,“host”:“0000:08:00.0”,“id”:“hostdev0”,“bus”:“pci.0”,“addr”:“0x9”}: VFIO_MAP_DMA failed: Bad address
Jan 16 11:07:13 NNET25NAS209 middlewared[1764]: 2025-01-16T10:07:13.160218Z qemu-system-x86_64: -device {“driver”:“vfio-pci”,“host”:“0000:08:00.0”,“id”:“hostdev0”,“bus”:“pci.0”,“addr”:“0x9”}: vfio 0000:08:00.0: failed to setup container for group 29: memory listener initialization failed: Region pc.ram: vfio_dma_map>

Hardware:
Gigabyte MC12-LE0
Ryzen 5700X
128GB DDR4
Arc A380

After I rebooted the system, the VM was starting again.
Any logs I can provide to help hunt down the issue?
Thanks

NickF1227 · January 17, 2025, 12:34am

I suspect that the A380 may potentially be having driver issues.

In the past in VMWare I’d run into weird issues with NVIDIA and AMD cards with power resets, as an example. If I rebooted the guest, I would have to reboot the host or the VM wouldn’t start.

I don’t think its the same issue, but wanted to provide some history that PCIE passthru of gpus is sometimes weird. Ive had a 2080ti passed through to a guest in TrueNAS without issues for about 18 months.

From the logs you logs you posted:

entry_SYSCALL_64_after_hwfram

I found similar reports in Ubuntu forums
https://community.intel.com/t5/Graphics/Intel-Arc-770-ubuntu-22-04-oem-kernel-issue-BAR-
failed-to-assign/m-p/1439171

Sounds like an issue with BAR, can you post the output of

cat /var/log/messages | grep -i BAR

This message is also of concern

Jan 16 11:07:13 NNET25NAS209 libvirtd[4595]: internal error: qemu unexpectedly closed the monitor: 2025-01-16T10:07:12.993365Z qemu-system-x86_64: -device {“driver”:“vfio-pci”,“host”:“0000:08:00.0”,“id”:“hostdev0”,“bus”:“pci.0”,“addr”:“0x9”}: VFIO_MAP_DMA failed: Bad address

This sounds like user land was trying to access memory in kernel space? which given the rest of the logging would make sense.

azzurro · January 17, 2025, 1:00am

Jan 16 13:31:22 NNET25NAS209 kernel: Booting paravirtualized kernel on bare hardware
Jan 16 13:31:22 NNET25NAS209 kernel: Spectre V1 : Mitigation: usercopy/swapgs barriers and __user pointer sanitization
Jan 16 13:31:22 NNET25NAS209 kernel: Spectre V2 : mitigation: Enabling conditional Indirect Branch Prediction Barrier
Jan 16 13:31:22 NNET25NAS209 kernel: pci 0000:01:00.0: VF(n) BAR2 space: [mem 0xfe02000000-0xfe21ffffff 64bit pref] (contains BAR2 for 16 VFs)
Jan 16 13:31:22 NNET25NAS209 kernel: pci 0000:05:00.0: BAR 0: assigned to efifb
Jan 16 13:31:22 NNET25NAS209 kernel: pci 0000:0a:00.0: working around ROM BAR overlap defect
Jan 16 13:31:22 NNET25NAS209 kernel: pci 0000:0b:00.0: working around ROM BAR overlap defect

no entries during the time where I tried to start the VM.

i also have to admit, that I had to set the following kernel options (not the aspm ones) to make mmio work:

“kernel_extra_options”: “pcie_aspm=force pcie_aspm.policy=powersave quiet pcie_acs_override=downstream,multifunction”

I definitely had these issues with an RX560 i tried before. It somehow wasn’t able to reset. With the Arc card, I can shutdown, power on and reboot the VM as much as I want, as long there wasn’t too much happening on the system between turning the VM off and on.
For example right now, I had the VM powered off, moved all my VMs back to the TrueNAS pool and now the VM doesn’t power back on again:

[EFAULT] internal error: qemu unexpectedly closed the monitor: 2025-01-17T00:57:10.953380Z qemu-system-x86_64: -device {“driver”:“vfio-pci”,“host”:“0000:09:00.0”,“id”:“hostdev0”,“bus”:“pci.0”,“addr”:“0x8”}: VFIO_MAP_DMA failed: Bad address 2025-01-17T00:57:10.998228Z qemu-system-x86_64: -device {“driver”:“vfio-pci”,“host”:“0000:09:00.0”,“id”:“hostdev0”,“bus”:“pci.0”,“addr”:“0x8”}: VFIO_MAP_DMA failed: Bad address 2025-01-17T00:57:10.998444Z qemu-system-x86_64: -device {“driver”:“vfio-pci”,“host”:“0000:09:00.0”,“id”:“hostdev0”,“bus”:“pci.0”,“addr”:“0x8”}: vfio 0000:09:00.0: failed to setup container for group 30: memory listener initialization failed: Region pc.ram: vfio_dma_map(0x556df2979500, 0x100000, 0xbff00000, 0x7f8dd3500000) = -2 (No such file or directory)

Traceback (most recent call last):
File “/usr/lib/python3/dist-packages/middlewared/plugins/vm/supervisor/supervisor.py”, line 189, in start
if self.domain.create() < 0:
^^^^^^^^^^^^^^^^^^^^
File “/usr/lib/python3/dist-packages/libvirt.py”, line 1373, in create
raise libvirtError(‘virDomainCreate() failed’)
libvirt.libvirtError: internal error: qemu unexpectedly closed the monitor: 2025-01-17T00:57:10.953380Z qemu-system-x86_64: -device {“driver”:“vfio-pci”,“host”:“0000:09:00.0”,“id”:“hostdev0”,“bus”:“pci.0”,“addr”:“0x8”}: VFIO_MAP_DMA failed: Bad address
2025-01-17T00:57:10.998228Z qemu-system-x86_64: -device {“driver”:“vfio-pci”,“host”:“0000:09:00.0”,“id”:“hostdev0”,“bus”:“pci.0”,“addr”:“0x8”}: VFIO_MAP_DMA failed: Bad address
2025-01-17T00:57:10.998444Z qemu-system-x86_64: -device {“driver”:“vfio-pci”,“host”:“0000:09:00.0”,“id”:“hostdev0”,“bus”:“pci.0”,“addr”:“0x8”}: vfio 0000:09:00.0: failed to setup container for group 30: memory listener initialization failed: Region pc.ram: vfio_dma_map(0x556df2979500, 0x100000, 0xbff00000, 0x7f8dd3500000) = -2 (No such file or directory)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File “/usr/lib/python3/dist-packages/middlewared/main.py”, line 211, in call_method
result = await self.middleware.call_with_audit(message[‘method’], serviceobj, methodobj, params, self)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/lib/python3/dist-packages/middlewared/main.py”, line 1529, in call_with_audit
result = await self._call(method, serviceobj, methodobj, params, app=app,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/lib/python3/dist-packages/middlewared/main.py”, line 1460, in _call
return await methodobj(*prepared_call.args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/lib/python3/dist-packages/middlewared/schema/processor.py”, line 179, in nf
return await func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/lib/python3/dist-packages/middlewared/schema/processor.py”, line 49, in nf
res = await f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/lib/python3/dist-packages/middlewared/plugins/vm/vm_lifecycle.py”, line 58, in start
await self.middleware.run_in_thread(self._start, vm[‘name’])
File “/usr/lib/python3/dist-packages/middlewared/main.py”, line 1367, in run_in_thread
return await self.run_in_executor(io_thread_pool_executor, method, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/lib/python3/dist-packages/middlewared/main.py”, line 1364, in run_in_executor
return await loop.run_in_executor(pool, functools.partial(method, *args, **kwargs))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/lib/python3.11/concurrent/futures/thread.py”, line 58, in run
result = self.fn(*self.args, **self.kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/lib/python3/dist-packages/middlewared/plugins/vm/vm_supervisor.py”, line 68, in _start
self.vms[vm_name].start(vm_data=self._vm_from_name(vm_name))
File “/usr/lib/python3/dist-packages/middlewared/plugins/vm/supervisor/supervisor.py”, line 198, in start
raise CallError(‘\n’.join(errors))
middlewared.service_exception.CallError: [EFAULT] internal error: qemu unexpectedly closed the monitor: 2025-01-17T00:57:10.953380Z qemu-system-x86_64: -device {“driver”:“vfio-pci”,“host”:“0000:09:00.0”,“id”:“hostdev0”,“bus”:“pci.0”,“addr”:“0x8”}: VFIO_MAP_DMA failed: Bad address
2025-01-17T00:57:10.998228Z qemu-system-x86_64: -device {“driver”:“vfio-pci”,“host”:“0000:09:00.0”,“id”:“hostdev0”,“bus”:“pci.0”,“addr”:“0x8”}: VFIO_MAP_DMA failed: Bad address
2025-01-17T00:57:10.998444Z qemu-system-x86_64: -device {“driver”:“vfio-pci”,“host”:“0000:09:00.0”,“id”:“hostdev0”,“bus”:“pci.0”,“addr”:“0x8”}: vfio 0000:09:00.0: failed to setup container for group 30: memory listener initialization failed: Region pc.ram: vfio_dma_map(0x556df2979500, 0x100000, 0xbff00000, 0x7f8dd3500000) = -2 (No such file or directory)

Also here is a stack trace of the time when the VM didn’t want to start:

[44772.242276] WARNING: CPU: 15 PID: 130313 at mm/gup.c:1313 __get_user_pages+0x5f9/0x6e0
[44772.242283] Modules linked in: mei_pxp(E) mei_hdcp(E) snd_hda_codec_hdmi(E) mei_gsc(E) mei_me(E) mei(E) vhost_net(E) tun(E) vhost(E) vhost_iotlb(E) macvtap(E) macvlan(E) tap(E) xt_conntrack(E) nft_chain_nat(E) xt_MASQUERADE(E) nf_nat(E) nf_conntrack_netlink(E) nf_conntrack(E) nf_defrag_ipv6(E) nf_defrag_ipv4(E) xfrm_user(E) xfrm_algo(E) xt_addrtype(E) nft_compat(E) nf_tables(E) nfnetlink(E) br_netfilter(E) bridge(E) scst_vdisk(OE) isert_scst(OE) iscsi_scst(OE) scst(OE) rdma_cm(E) iw_cm(E) ib_cm(E) dlm(E) libcrc32c(E) crc32c_generic(E) nvme_fabrics(E) overlay(E) sunrpc(E) binfmt_misc(E) 8021q(E) garp(E) stp(E) mrp(E) llc(E) ntb_netdev(E) ntb_transport(E) ntb_split(E) ntb(E) ioatdma(E) ipmi_ssif(E) intel_rapl_msr(E) intel_rapl_common(E) edac_mce_amd(E) kvm_amd(E) kvm(E) ghash_clmulni_intel(E) sha512_ssse3(E) sha256_ssse3(E) sha1_ssse3(E) aesni_intel(E) crypto_simd(E) cryptd(E) i915(E) evdev(E) rapl(E) snd_hda_intel(E) snd_intel_dspcfg(E) drm_buddy(E) snd_hda_codec(E) snd_hda_core(E) snd_hwdep(E) drm_display_helper(E)
[44772.242328] wmi_bmof(E) sp5100_tco(E) ccp(E) cec(E) pcspkr(E) watchdog(E) k10temp(E) acpi_cpufreq(E) snd_pcm(E) rc_core(E) ast(E) snd_timer(E) ttm(E) drm_shmem_helper(E) snd(E) acpi_ipmi(E) video(E) soundcore(E) drm_kms_helper(E) ipmi_si(E) ipmi_devintf(E) ipmi_msghandler(E) button(E) sg(E) drm(E) loop(E) efi_pstore(E) configfs(E) ip_tables(E) x_tables(E) autofs4(E) zfs(POE) spl(OE) efivarfs(E) mlx4_ib(E) ib_uverbs(E) sd_mod(E) ib_core(E) mlx4_en(E) nvme(E) ahci(E) nvme_core(E) xhci_pci(E) ahciem(E) t10_pi(E) libahci(E) xhci_hcd(E) crc32_pclmul(E) crc64_rocksoft(E) libata(E) crc64(E) crc32c_intel(E) crc_t10dif(E) i2c_piix4(E) igb(E) crct10dif_generic(E) i2c_algo_bit(E) crct10dif_pclmul(E) crct10dif_common(E) dca(E) usbcore(E) scsi_mod(E) mlx4_core(E) scsi_common(E) usb_common(E) wmi(E) gpio_amdpt(E) gpio_generic(E) vfio_pci(E) vfio_pci_core(E) irqbypass(E) vfio_iommu_type1(E) vfio(E)
[44772.242372] CPU: 15 PID: 130313 Comm: qemu-system-x86 Tainted: P OE 6.6.44-production+truenas #1
[44772.242374] Hardware name: GIGABYTE MC12-LE0-00/MC12-LE0-00, BIOS F18 10/18/2024
[44772.242376] RIP: 0010:__get_user_pages+0x5f9/0x6e0
[44772.242378] Code: 45 89 e0 0f 84 61 fe ff ff 41 83 fc 01 76 58 41 8d 74 24 ff 89 da 4c 89 f7 e8 73 e9 ff ff 48 85 c0 0f 85 13 fe ff ff 4c 89 f7 <0f> 0b 49 8b 46 08 a8 01 0f 85 a5 00 00 00 66 90 89 da be 01 00 00
[44772.242379] RSP: 0018:ffffaf60011efb50 EFLAGS: 00010246
[44772.242381] RAX: 0000000000000000 RBX: 00000000000d0101 RCX: 0000000000000030
[44772.242382] RDX: 0000000000000400 RSI: 0000000000000002 RDI: ffffec2084160000
[44772.242383] RBP: 00007f8e01200000 R08: 0000000000000100 R09: 80000001058008e7
[44772.242384] R10: 00000000000395c0 R11: ffff99a8ff354000 R12: 0000000000000100
[44772.242385] R13: ffff998b37478f18 R14: ffffec2084160000 R15: ffff998a700f6930
[44772.242386] FS: 00007f8fd8dc9ec0(0000) GS:ffff99a87edc0000(0000) knlGS:0000000000000000
[44772.242387] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[44772.242389] CR2: 00007f2a8c000020 CR3: 00000006ca9ca000 CR4: 0000000000750ee0
[44772.242390] PKRU: 55555554
[44772.242391] Call Trace:
[44772.242392]
[44772.242393] ? __get_user_pages+0x5f9/0x6e0
[44772.242395] ? __warn+0x81/0x130
[44772.242400] ? __get_user_pages+0x5f9/0x6e0
[44772.242402] ? report_bug+0x171/0x1a0
[44772.242406] ? handle_bug+0x41/0x70
[44772.242409] ? exc_invalid_op+0x17/0x70
[44772.242411] ? asm_exc_invalid_op+0x1a/0x20
[44772.242416] ? __get_user_pages+0x5f9/0x6e0
[44772.242420] __gup_longterm_locked+0x246/0xc10
[44772.242422] ? put_pages_list+0xd7/0x100
[44772.242427] pin_user_pages_remote+0x7f/0xb0
[44772.242430] vaddr_get_pfns+0x78/0x2a0 [vfio_iommu_type1]
[44772.242435] ? srso_alias_return_thunk+0x5/0xfbef5
[44772.242438] vfio_pin_pages_remote+0x386/0x500 [vfio_iommu_type1]
[44772.242443] vfio_iommu_type1_ioctl+0xfec/0x18e0 [vfio_iommu_type1]
[44772.242449] __x64_sys_ioctl+0x97/0xd0
[44772.242453] do_syscall_64+0x59/0xb0
[44772.242455] ? __x64_sys_ioctl+0xaf/0xd0
[44772.242456] ? srso_alias_return_thunk+0x5/0xfbef5
[44772.242458] ? syscall_exit_to_user_mode+0x22/0x40
[44772.242460] ? srso_alias_return_thunk+0x5/0xfbef5
[44772.242462] ? do_syscall_64+0x65/0xb0
[44772.242463] ? do_syscall_64+0x65/0xb0
[44772.242465] entry_SYSCALL_64_after_hwframe+0x78/0xe2
[44772.242467] RIP: 0033:0x7f8fd9975c5b
[44772.242469] Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24 10 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 1c 48 8b 44 24 18 64 48 2b 04 25 28 00 00
[44772.242470] RSP: 002b:00007ffd22d8b7f0 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[44772.242472] RAX: ffffffffffffffda RBX: 0000556df2979500 RCX: 00007f8fd9975c5b
[44772.242473] RDX: 00007ffd22d8b850 RSI: 0000000000003b71 RDI: 000000000000002a
[44772.242474] RBP: 0000000000100000 R08: 0000000000000000 R09: 0000000000000000
[44772.242474] R10: 00000000bff00000 R11: 0000000000000246 R12: 00000000bff00000
[44772.242475] R13: 0000556df2979700 R14: 00007ffd22d8b850 R15: 0000556df2979500
[44772.242479]
[44772.242479] —[ end trace 0000000000000000 ]—

NickF1227 · January 17, 2025, 1:28am

Ha! similar, mine was a 480

For now you may want to file a new bug report as it was requested by iX in the NAS ticket you linked.

But I suspect that perhaps Kernel 6.12 in Fangtooth may ultimately resolve the issue. The beta should be here soon. I could be wrong, but I believe the Alchemist and Battlemage cards both now use the new xe driver instead of the ancient i915 one.

As much as I hate that NVIDIA has proprietary drivers on Linux, its kind of a blessing because you aren’t beholden to kernel updates.

azzurro · January 17, 2025, 1:35am

This thread looks interesting, although most of it exceeds my brain capabilities: https://forum.proxmox.com/threads/amd-ryzen-5600g-igpu-code-43-error.138665/

I tried, but Jira says I’m not allowed to do anything and if I click login (hoping it would allow me to register an account), it just reloads the page

currently it seems to be using the i915 driver. that being said, I was wondering whether maybe the card might be “in access” by the host driver somehow and how I’d be able to stop that… could that be an/the issue?

yeah true, but I had to change to amd on my main desktop, cause I couldn’t stand the nvidia glitches anymore. i am not willing to pay that price anymore

azzurro · January 17, 2025, 1:47am

i probably really should read that part of the arch wiki:
https://wiki.archlinux.org/title/PCI_passthrough_via_OVMF

NickF1227 · January 17, 2025, 3:25am

I don’t suspect that this is possible, because the GPU does seem to be isolated from the host. After you rebooted your system presumably, we see some initialization stuff for BAR, and none of it has the same PCI-E addresses as the device that is alarming in the logs from KVM.

Jan 16 13:31:22 NNET25NAS209 kernel: Booting paravirtualized kernel on bare hardware
Jan 16 13:31:22 NNET25NAS209 kernel: Spectre V1 : Mitigation: usercopy/swapgs barriers and __user pointer sanitization
Jan 16 13:31:22 NNET25NAS209 kernel: Spectre V2 : mitigation: Enabling conditional Indirect Branch Prediction Barrier
Jan 16 13:31:22 NNET25NAS209 kernel: pci 0000:01:00.0: VF(n) BAR2 space: [mem 0xfe02000000-0xfe21ffffff 64bit pref] (contains BAR2 for 16 VFs)
Jan 16 13:31:22 NNET25NAS209 kernel: pci 0000:05:00.0: BAR 0: assigned to efifb
Jan 16 13:31:22 NNET25NAS209 kernel: pci 0000:0a:00.0: working around ROM BAR overlap defect
Jan 16 13:31:22 NNET25NAS209 kernel: pci 0000:0b:00.0: working around ROM BAR overlap defect

I see:
0000:01:00.0
0000:05:00.0
0000:0a:00.0
0000:0b:00.0

Whereas you seem to be passing thru
0000:08:00.0

I only suggested the Fangtooth/Kernel 6.12 thing because for whatever reason, after I upgraded from Dragonfish to Electric Eel, I had a problem. TrueNAS used to ship with the NVIDIA driver, but in EE its an optional package. the VM with my 2080ti would boot, but the graphics card didn’t work properly and the VM was expereicing weird undefined issues. After I installed the NVIDIA drivers in the host TrueNAS install everything started working again in the VM. I never dug into this, but it was weird.

Ironically, I had weird display issues with an RX 480, RX 580 and 5700 XT that evaporated when I switched to NVIDIA.

What specifically are you calling out?

EDIT: I also edited the title of this to call out Arc GPU, so other readers don’t get this topic confused with a memory issue in ZFS ARC cache.

azzurro · January 17, 2025, 10:33am

fair. maybe my kernel options “pcie_acs_override=downstream,multifunction” are contributing to the issue as well. The Arc GPU is connected to the onboard NVMe x1 slot, which is provided by the PCH and not directly by the CPU. Without the ACS patch, i have all sorts of hardware in the same IOMMU group alongside with the Arc GPU. I suspect, that this board can’t do proper isolation of that NVMe slot. Unfortunately I need the other slots for SATA controllers, NVMe drives and a dualport 10G NIC.

i’ll definitely try the new kernel once it ships with a truenas release.

And so it comes that Linux creates unique experiences for everyone of us

nothing specific, just the fact that I should probably know what I’m dealing with, when messing around with kernel options, pasting random stuff into my config

thanks, that was a big ambiguous by me.

NickF1227 · January 17, 2025, 2:39pm

yes…I think you may be onto something here. May be worth testing swapping the GPU with another PCIE card in your system to see if the behavior improves at least.

azzurro · January 17, 2025, 3:55pm

weird stuff is happening again. slowly I’m getting a bit sick of it.
I have a ASRock Rack PCIe x16 to 4x M.2 x4 card installed in the PCIe x16 slot of the board, where normally I have 2x M.2 ASM1166 6 port controllers and 2x M.2 Optane P1600X installed, which work flawlessly.

I’ve now moved one of the Optanes to the onboard PCH-backed M.2 slot and the M.2 to PCIe riser adapter with the Arc A380 in it, I moved to the ASRock PCIe to M.2 adapter.

Result: no dice. The moved Optane shows up, but the Intel Arc is just gone. Doesn’t show up in lspci. The x16 slot is obviously set to x4x4x4x4 bifurcation.

I’m gonna try a different M.2 slot of the ASRock adapter card later.

NickF1227 · January 17, 2025, 4:08pm

Unfortunately, it sounds like ultimately your problem is going to be that you need more pcie lanes than you actually have.

azzurro · January 17, 2025, 4:28pm

idk man, shouldn’t the M.2 to PCIe riser adapter work? It had only x1 in the onboard M.2 slot before and now even has x4 in the x16-to-4x M.2 x4 adapter
https://www.asrockrack.com/general/productdetail.asp?Model=RB4M2

I might try to force PCIe 3.0 speed on that x16 slot, since it can do 4.0 i think, as well as the Arc A380 can, afaik. They might not get a stable signal over all those adapters (especially since they both say they’re just PCIe 3.0).

NickF1227 · January 17, 2025, 4:39pm

If you have a PCI-E x16 card, even if that bifurcates that slot, they still might all be in the same IOMMU group. lspci -vvv will verify. You’re certainly coloring outside of the lines a bit here.

azzurro · January 17, 2025, 6:15pm

fair enough and i was worried about that too, but all 4 devices were in different iommu groups, so i had hope.
even if they were in the same iommu group, they’d still show up in lspci, which the A380 didn’t do, so there must be some other issue, maybe PCIe 4.0.

annoyingly enough, this board doesn’t seem to allow forcung lower PCIe speeds.

azzurro · January 17, 2025, 11:26pm

Unbelievable.
My Arc A380 apparently doesn’t work at all, at least in this mainboard, when the PCIe slot is bifurcated.
SO, I moved the ConnectX-3 card from the onboard x4 slot to the x16 to 4x M.2 adapter, via the M.2 to PCIe x4 riser adapter cable and put the Arc A380 directly into the onboard x4 PCIe slot.

Guess what - ConnectX-3 vomits all over the console output, with “PCIe Bus Error, Bad DLLP” error messages. So, once again, a new challenge. I disabled ASPM, no dice. I swapped the ConnectX-3 for a ConnectX-4 → voilá, no more errors. Not even with ASPM enabled.

Luckily, the onboard x4 slot is in its own IOMMU group, so apparently, I finally lucked out. I’ll now proceed with validating and testing the setup for the initial issue, as well as performance and stability.

azzurro · January 17, 2025, 11:49pm

welp, rejoiced too soon.
as soon as there is any significant load on the NIC, I get the PCIe Bus Errors again. They say “corrected”, but i still don’t like it. I guess signal integrity gets butchered across the x16 to M.2 adapter + the M.2 to PCIe riser adapter.

Since I need both, the Mellanox card and the x16 to M.2 adapter card in this system, I don’t see another option than either aiming for a platform with more lanes or build a dedicated machine for the Arc A380 thing.

I’ve got a leftover Q370 i5 8400T box (Fujitsu Esprimo P758), maybe that’ll work.

azzurro · February 5, 2025, 2:50pm

spoiler: it didn’t work. I’ve built a dedicated machine out of another MC12-LE0 board I had and that’s working absolutely flawlessly now.
lane issue, tried to cram to much into it.