PCI passthrough in VM no longer working in Dragonfish 24.04.0

DaAnda · April 25, 2024, 10:03am

I have a VM with GPU passthrough as well a NVME passthrough and neither is working after the upgrade to 24.04.0. I read about some changes in this regard and thus have already removed and re-added the GPU to the isolated GPU list. I also removed all PCI passthrough devices of the VM and added them back.

When I try to start the VM, I get this error for any PCI device I try to passthrough (if I remove the device in the error, same error shows up for next device etc).

Error: Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/middlewared/plugins/vm/supervisor/supervisor.py", line 182, in start
    if self.domain.create() < 0:
       ^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/libvirt.py", line 1373, in create
    raise libvirtError('virDomainCreate() failed')
libvirt.libvirtError: internal error: qemu unexpectedly closed the monitor: 2024-04-25T09:58:42.781242Z qemu-system-x86_64: -device {"driver":"vfio-pci","host":"0000:03:00.0","id":"hostdev0","bus":"pci.0","addr":"0x6"}: VFIO_MAP_DMA failed: Bad address
2024-04-25T09:58:42.816518Z qemu-system-x86_64: -device {"driver":"vfio-pci","host":"0000:03:00.0","id":"hostdev0","bus":"pci.0","addr":"0x6"}: vfio 0000:03:00.0: failed to setup container for group 11: memory listener initialization failed: Region pc.ram: vfio_dma_map(0x55d3cfe74f30, 0x100000000, 0x1c0000000, 0x7fb107e00000) = -2 (No such file or directory)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/middlewared/main.py", line 198, in call_method
    result = await self.middleware.call_with_audit(message['method'], serviceobj, methodobj, params, self)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/middlewared/main.py", line 1466, in call_with_audit
    result = await self._call(method, serviceobj, methodobj, params, app=app,
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/middlewared/main.py", line 1417, in _call
    return await methodobj(*prepared_call.args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/middlewared/schema/processor.py", line 187, in nf
    return await func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/middlewared/schema/processor.py", line 47, in nf
    res = await f(*args, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/middlewared/plugins/vm/vm_lifecycle.py", line 58, in start
    await self.middleware.run_in_thread(self._start, vm['name'])
  File "/usr/lib/python3/dist-packages/middlewared/main.py", line 1324, in run_in_thread
    return await self.run_in_executor(self.thread_pool_executor, method, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/middlewared/main.py", line 1321, in run_in_executor
    return await loop.run_in_executor(pool, functools.partial(method, *args, **kwargs))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/middlewared/plugins/vm/vm_supervisor.py", line 68, in _start
    self.vms[vm_name].start(vm_data=self._vm_from_name(vm_name))
  File "/usr/lib/python3/dist-packages/middlewared/plugins/vm/supervisor/supervisor.py", line 191, in start
    raise CallError('\n'.join(errors))
middlewared.service_exception.CallError: [EFAULT] internal error: qemu unexpectedly closed the monitor: 2024-04-25T09:58:42.781242Z qemu-system-x86_64: -device {"driver":"vfio-pci","host":"0000:03:00.0","id":"hostdev0","bus":"pci.0","addr":"0x6"}: VFIO_MAP_DMA failed: Bad address
2024-04-25T09:58:42.816518Z qemu-system-x86_64: -device {"driver":"vfio-pci","host":"0000:03:00.0","id":"hostdev0","bus":"pci.0","addr":"0x6"}: vfio 0000:03:00.0: failed to setup container for group 11: memory listener initialization failed: Region pc.ram: vfio_dma_map(0x55d3cfe74f30, 0x100000000, 0x1c0000000, 0x7fb107e00000) = -2 (No such file or directory)

When I try to add my isolated GPU to the VM in the VM config, I get this error:

 Error: Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/middlewared/main.py", line 198, in call_method
    result = await self.middleware.call_with_audit(message['method'], serviceobj, methodobj, params, self)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/middlewared/main.py", line 1466, in call_with_audit
    result = await self._call(method, serviceobj, methodobj, params, app=app,
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/middlewared/main.py", line 1417, in _call
    return await methodobj(*prepared_call.args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/middlewared/schema/processor.py", line 187, in nf
    return await func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/middlewared/schema/processor.py", line 47, in nf
    res = await f(*args, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/middlewared/plugins/system_advanced/gpu.py", line 44, in update_gpu_pci_ids
    verrors.check()
  File "/usr/lib/python3/dist-packages/middlewared/service_exception.py", line 70, in check
    raise self
middlewared.service_exception.ValidationErrors: [EINVAL] gpu_settings.isolated_gpu_pci_ids: pci_0000_01_00_1, pci_0000_00_01_0, pci_0000_01_00_0 GPU pci slot(s) are not available or a GPU is not configured.

Anybody any idea?

DaAnda · April 26, 2024, 8:17am

odd, seems to have fixed itself after another reboot of the server.

DaAnda · May 11, 2024, 9:41am

and it’s broken again after a reboot

mikevipe · June 23, 2024, 5:31am

I am having a similar issue since 24.04.0, with an Nvidia T1000 and a GTX 1650.

Trying to start the vps hangs and any attempt to check its status or restart again gives the error that the vps is currently paused and must be stopped first.

Ryan1 · July 9, 2024, 4:38pm

I have been having a similar issue. When I pass my Arc 380 to my VM in the device list and boot the vm the whole server crashes. I haven’t been able to figure out why the isolated gpu device is being used by truenas if it’s clearly been isolated already. The VM attempts to start but then my fans go crazy and server is inoperable. Can’t connect through shell or otherwise.

millerwissen · August 29, 2024, 5:43pm

I have the same problem with an AMD Radeon Pro W6600, now here’s something interesting I was running my system on a threadripper 1950x X399 and i just moved it to a TRX40 with a 3960X both slots directly connected to the cpu, Gen 4 X16 same installation same boot drive was working fine on X399 but refuses to work on TRX40, just keep getting Error 43 over and over. I’m on the latest version of Dragonfish.

millerwissen · August 30, 2024, 12:42pm

So to anyone that may come across this, I’ve given up on TrueNAS Scale as a main hypervisor due to the issues here, I will continue to use TrueNAS as a NAS but i will run that as a VM and passthrough a sas controller + whatever other virtual drives I need for cache, I also have other problems with the windows vms randomly freezing during heavy IO all of these problems I don’t have with proxmox, it just isn’t as good for mananing ZFS as TrueNAS it also doesn’t require a second gpu meaning I can free up one very much needed pcie slot due to my current motherboard.

So right now that’s the best ‘fix’ for this problem, hopefully in the future I can come back to SCALE as a all-in-one solution.

Stux · August 30, 2024, 7:40pm

Is there a reason that you’re not using 24.04.2?

Soleous · August 31, 2024, 1:54am

I’ve been running PCI passthough with an Nvidia M2000 since early version of SCALE, not had any problems between updates, including Electric Eel Beta.

Early days with SCALE I had PCI passthough issues, I found a fix with changing the CPU mode to Host Model or Host Passthrough.

millerwissen · September 1, 2024, 7:47pm

@Stux idk if you’re reffering to me or the person who started the thread but i was using 24.04.02/latest available i checked for updates and everything tried to reboot the server etc no good.

DaAnda · September 2, 2024, 6:07am

My issue still persists in 20.04.02. I provided them a debug and they at least found one bug that they are going to fix. According to dev that had a look at my debug, for some reason it says that a critical PCI device and my GPU should share the same IOMMU - but they do not according to lspci and never did. Also, the second I go back to 23.10.x everything is fine again.

One super odd thing is though, that under a very rare circumvention it seems to work again once and every subsequent attempts fail. Was not able to reliably reproduce what I need to do to get it working once. I happened directly after I had switched versions/boot environment back to 24.04.02, but this is not always the case.

Execcr · September 21, 2024, 12:27am

I have the same problem with a little different error in the lastest 24.04.2.2 version. I have a supermicro motherboard with epyc 3151 CPU and an Intel arc a310 card. GPU isolation seems to be active but as soon as I add the GPU to a VM the VM won’t start anymore. Adding the GPU as PCI passthrough device doesn’t help, the error is the same.

Does using allow_unsafe_interrupts=1 option could resolve this? It works for debian based qemu systems