GPU VM passthrough

Hi,

I just want to be sure: I read in some places, that it is possible, to pass a GPU to a VM even if only one GPU is installed.

The UI says otherwise.

I have a Ryzen 4650G on a B450 board. I can enable SR-IOV and I do not need any display output for TrueNAS itself.

Is there any way to forward the iGPU of the CPU to a VM for hardware transcoding?

A dedicated GPU is not an option as my build is running in a small case on an mATX board.
The only PCIe x16 slot is in use by my HBA.

It is not possible, as to my knowlegde noone actually got it to work.

Maybe sometime we will be able to use vGPU.

There exists a m2 slot VGA card:

Maybe that would be an option.

Thanks. I couldn’t find the Innodisk one for an acceptable price.
I also couldn’t find the ASRock Rack M2_VGA anywhere.

I tried a SUNIX VGA0411 which only cost about 10 bucks.
My board did not recognize the card …

How did you connect this ? Do you have a 1x pci slot free ? If yes, get a 1x to 16x riser (with power) and use any old gpu.

I now ordered an old Nvidia NVS 300 with a PCIE x1 connector.
I don’t have the space in my Silverstone PS07 to use a riser. I would probably rather have ā€œopenedā€ the back of the slotšŸ˜…

I also originally wanted to use such a low power VGA to not increase the power consumption unnecessarily.
I hope the NVS 300 won’t but we’ll see.

That would have been my next suggestion :sweat_smile:

1 Like

Update: it doesn’t work.

To be more precise: the Nvidia NVS 300 works. I have basic display output and I could now isolate my integrated GPU.
I also have SR-IOV enabled in UEFI.

When trying to add the GPU to my VM I get this error:
[EINVAL] gpu_settings.isolated_gpu_pci_ids: pci_0000_0a_00_6, pci_0000_0a_00_0, pci_0000_0a_00_4, pci_0000_0a_00_1, pci_0000_0a_00_2, pci_0000_0a_00_3 GPU pci slot(s) are not available or a GPU is not configured.

More info
Error: Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/middlewared/main.py", line 198, in call_method
    result = await self.middleware.call_with_audit(message['method'], serviceobj, methodobj, params, self)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/middlewared/main.py", line 1466, in call_with_audit
    result = await self._call(method, serviceobj, methodobj, params, app=app,
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/middlewared/main.py", line 1417, in _call
    return await methodobj(*prepared_call.args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/middlewared/schema/processor.py", line 187, in nf
    return await func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/middlewared/schema/processor.py", line 47, in nf
    res = await f(*args, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/middlewared/plugins/system_advanced/gpu.py", line 44, in update_gpu_pci_ids
    verrors.check()
  File "/usr/lib/python3/dist-packages/middlewared/service_exception.py", line 70, in check
    raise self
middlewared.service_exception.ValidationErrors: [EINVAL] gpu_settings.isolated_gpu_pci_ids: pci_0000_0a_00_6, pci_0000_0a_00_0, pci_0000_0a_00_4, pci_0000_0a_00_1, pci_0000_0a_00_2, pci_0000_0a_00_3 GPU pci slot(s) are not available or a GPU is not configured.

Interestingly enough though, the GPU then shows up as being added.
When I go to the VM’s devices there are lots of PCIE passthrough devices shown, although some are not available:
image

When I try to power the VM on, I get lots of errors in /var/log/messages:

/var/log/messages
Aug  1 23:22:16 truenas1 kernel: pcieport 0000:00:08.1: broken device, retraining non-functional downstream link at 2.5GT/s
Aug  1 23:22:17 truenas1 kernel: pcieport 0000:00:08.1: retraining failed
Aug  1 23:22:18 truenas1 kernel: pcieport 0000:00:08.1: broken device, retraining non-functional downstream link at 2.5GT/s
Aug  1 23:22:19 truenas1 kernel: pcieport 0000:00:08.1: retraining failed
Aug  1 23:22:19 truenas1 kernel: vfio-pci 0000:0a:00.0: not ready 1023ms after bus reset; waiting
Aug  1 23:22:20 truenas1 kernel: vfio-pci 0000:0a:00.0: not ready 2047ms after bus reset; waiting
Aug  1 23:22:22 truenas1 kernel: vfio-pci 0000:0a:00.0: not ready 4095ms after bus reset; waiting
Aug  1 23:22:27 truenas1 kernel: vfio-pci 0000:0a:00.0: not ready 8191ms after bus reset; waiting
Aug  1 23:22:35 truenas1 kernel: vfio-pci 0000:0a:00.0: not ready 16383ms after bus reset; waiting
Aug  1 23:22:52 truenas1 kernel: vfio-pci 0000:0a:00.0: not ready 32767ms after bus reset; waiting
Aug  1 23:23:27 truenas1 kernel: vfio-pci 0000:0a:00.0: not ready 65535ms after bus reset; giving up
Aug  1 23:23:27 truenas1 kernel: vfio-pci 0000:0a:00.6: vfio_bar_restore: reset recovery - restoring BARs
Aug  1 23:23:27 truenas1 kernel: vfio-pci 0000:0a:00.4: vfio_bar_restore: reset recovery - restoring BARs
Aug  1 23:23:27 truenas1 kernel: vfio-pci 0000:0a:00.1: vfio_bar_restore: reset recovery - restoring BARs
Aug  1 23:23:27 truenas1 kernel: vfio-pci 0000:0a:00.2: vfio_bar_restore: reset recovery - restoring BARs
Aug  1 23:23:27 truenas1 kernel: vfio-pci 0000:0a:00.3: vfio_bar_restore: reset recovery - restoring BARs
Aug  1 23:23:28 truenas1 kernel: vfio-pci 0000:0a:00.0: vfio_bar_restore: reset recovery - restoring BARs
Aug  1 23:23:28 truenas1 kernel: vfio-pci 0000:0a:00.0: vfio_bar_restore: reset recovery - restoring BARs
Aug  1 23:23:28 truenas1 kernel: vfio-pci 0000:0a:00.3: vfio_bar_restore: reset recovery - restoring BARs
Aug  1 23:23:28 truenas1 kernel: vfio-pci 0000:0a:00.2: vfio_bar_restore: reset recovery - restoring BARs
Aug  1 23:23:28 truenas1 kernel: vfio-pci 0000:0a:00.1: vfio_bar_restore: reset recovery - restoring BARs
Aug  1 23:23:28 truenas1 kernel: vfio-pci 0000:0a:00.4: vfio_bar_restore: reset recovery - restoring BARs
Aug  1 23:23:28 truenas1 kernel: vfio-pci 0000:0a:00.6: vfio_bar_restore: reset recovery - restoring BARs
Aug  1 23:23:30 truenas1 kernel: vfio-pci 0000:0a:00.0: vfio_bar_restore: reset recovery - restoring BARs
Aug  1 23:23:30 truenas1 kernel: vfio-pci 0000:0a:00.0: vfio_bar_restore: reset recovery - restoring BARs
Aug  1 23:23:30 truenas1 kernel: vfio-pci 0000:0a:00.3: vfio_bar_restore: reset recovery - restoring BARs
Aug  1 23:23:30 truenas1 kernel: vfio-pci 0000:0a:00.3: vfio_bar_restore: reset recovery - restoring BARs
Aug  1 23:23:30 truenas1 kernel: vfio-pci 0000:0a:00.2: vfio_bar_restore: reset recovery - restoring BARs
Aug  1 23:23:30 truenas1 kernel: vfio-pci 0000:0a:00.2: vfio_bar_restore: reset recovery - restoring BARs
Aug  1 23:23:30 truenas1 kernel: vfio-pci 0000:0a:00.1: vfio_bar_restore: reset recovery - restoring BARs
Aug  1 23:23:30 truenas1 kernel: vfio-pci 0000:0a:00.1: vfio_bar_restore: reset recovery - restoring BARs
Aug  1 23:23:30 truenas1 kernel: vfio-pci 0000:0a:00.4: vfio_bar_restore: reset recovery - restoring BARs
Aug  1 23:23:30 truenas1 kernel: vfio-pci 0000:0a:00.4: vfio_bar_restore: reset recovery - restoring BARs
Aug  1 23:23:30 truenas1 kernel: vfio-pci 0000:0a:00.6: vfio_bar_restore: reset recovery - restoring BARs
Aug  1 23:23:30 truenas1 kernel: vfio-pci 0000:0a:00.6: vfio_bar_restore: reset recovery - restoring BARs
Aug  1 23:24:18 truenas1 kernel: pcieport 0000:00:08.1: broken device, retraining non-functional downstream link at 2.5GT/s
Aug  1 23:24:19 truenas1 kernel: pcieport 0000:00:08.1: retraining failed
Aug  1 23:24:20 truenas1 kernel: pcieport 0000:00:08.1: broken device, retraining non-functional downstream link at 2.5GT/s
Aug  1 23:24:21 truenas1 kernel: pcieport 0000:00:08.1: retraining failed
Aug  1 23:24:21 truenas1 kernel: vfio-pci 0000:0a:00.0: not ready 1023ms after bus reset; waiting
Aug  1 23:24:22 truenas1 kernel: vfio-pci 0000:0a:00.0: not ready 2047ms after bus reset; waiting
Aug  1 23:24:24 truenas1 kernel: vfio-pci 0000:0a:00.0: not ready 4095ms after bus reset; waiting
Aug  1 23:24:29 truenas1 kernel: vfio-pci 0000:0a:00.0: not ready 8191ms after bus reset; waiting
Aug  1 23:24:37 truenas1 kernel: vfio-pci 0000:0a:00.0: not ready 16383ms after bus reset; waiting
Aug  1 23:24:55 truenas1 kernel: vfio-pci 0000:0a:00.0: not ready 32767ms after bus reset; waiting
Aug  1 23:25:30 truenas1 kernel: vfio-pci 0000:0a:00.0: not ready 65535ms after bus reset; giving up

I found this: [v2,1/1] PCI: Fix link activation wait logic - Patchwork

So this miight be a kernel-issue?
Has anyone faced (and fixed) this?

How are u adding the GPU for passthrough ? Currently it seems not to work if you try to add a gpu to the VM during VM creation.
You have to do it under devices.

What does this script show you ?

#!/bin/bash
for d in /sys/kernel/iommu_groups/*/devices/*; do
  n=${d#*/iommu_groups/*}; n=${n%%/*}
  printf 'IOMMU Group %s ' "$n"
  lspci -nns "${d##*/}"
done

My steps are

  1. Create VM without GPU, uncheck ā€œensure display deviceā€
  2. Isolate GPU
  3. Reboot
  4. Add GPU and audio device
  5. Start VM

The GPU and it’s audio device have their own IOMMU Groups if that’s what you’re looking for:

IOMMU groups

IOMMU Group 0 00:01.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Renoir PCIe Dummy Host Bridge [1022:1632]
IOMMU Group 1 00:01.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Renoir PCIe GPP Bridge [1022:1633]
IOMMU Group 10 02:00.0 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD] 400 Series Chipset USB 3.1 xHCI Compliant Host Controller [1022:43d5] (rev 01)
IOMMU Group 10 02:00.2 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] 400 Series Chipset PCIe Bridge [1022:43c6] (rev 01)
IOMMU Group 10 03:00.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] 400 Series Chipset PCIe Port [1022:43c7] (rev 01)
IOMMU Group 10 03:04.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] 400 Series Chipset PCIe Port [1022:43c7] (rev 01)
IOMMU Group 10 03:05.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] 400 Series Chipset PCIe Port [1022:43c7] (rev 01)
IOMMU Group 10 03:06.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] 400 Series Chipset PCIe Port [1022:43c7] (rev 01)
IOMMU Group 10 03:07.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] 400 Series Chipset PCIe Port [1022:43c7] (rev 01)
IOMMU Group 10 05:00.0 Ethernet controller [0200]: Realtek Semiconductor Co., Ltd. RTL8125 2.5GbE Controller [10ec:8125] (rev 05)
IOMMU Group 10 07:00.0 VGA compatible controller [0300]: NVIDIA Corporation GT218 [NVS 300] [10de:10d8] (rev a2)
IOMMU Group 10 07:00.1 Audio device [0403]: NVIDIA Corporation High Definition Audio Controller [10de:0be3] (rev a1)
IOMMU Group 10 08:00.0 Ethernet controller [0200]: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller [10ec:8168] (rev 15)
IOMMU Group 11 09:00.0 Non-Volatile memory controller [0108]: Intel Corporation NVMe Optane Memory Series [8086:2522]
IOMMU Group 12 0a:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Renoir [1002:1636] (rev d9)
IOMMU Group 13 0a:00.1 Audio device [0403]: Advanced Micro Devices, Inc. [AMD/ATI] Renoir Radeon High Definition Audio Controller [1002:1637]
IOMMU Group 14 0a:00.2 Encryption controller [1080]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 10h-1fh) Platform Security Processor [1022:15df]
IOMMU Group 15 0a:00.3 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne USB 3.1 [1022:1639]
IOMMU Group 16 0a:00.4 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne USB 3.1 [1022:1639]
IOMMU Group 17 0a:00.6 Audio device [0403]: Advanced Micro Devices, Inc. [AMD] Family 17h/19h HD Audio Controller [1022:15e3]
IOMMU Group 2 00:02.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Renoir PCIe Dummy Host Bridge [1022:1632]
IOMMU Group 3 00:02.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne PCIe GPP Bridge [1022:1634]
IOMMU Group 4 00:02.2 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne PCIe GPP Bridge [1022:1634]
IOMMU Group 5 00:08.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Renoir PCIe Dummy Host Bridge [1022:1632]
IOMMU Group 6 00:08.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Renoir Internal PCIe GPP Bridge to Bus [1022:1635]
IOMMU Group 7 00:14.0 SMBus [0c05]: Advanced Micro Devices, Inc. [AMD] FCH SMBus Controller [1022:790b] (rev 51)
IOMMU Group 7 00:14.3 ISA bridge [0601]: Advanced Micro Devices, Inc. [AMD] FCH LPC Bridge [1022:790e] (rev 51)
IOMMU Group 8 00:18.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Renoir Device 24: Function 0 [1022:1448]
IOMMU Group 8 00:18.1 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Renoir Device 24: Function 1 [1022:1449]
IOMMU Group 8 00:18.2 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Renoir Device 24: Function 2 [1022:144a]
IOMMU Group 8 00:18.3 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Renoir Device 24: Function 3 [1022:144b]
IOMMU Group 8 00:18.4 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Renoir Device 24: Function 4 [1022:144c]
IOMMU Group 8 00:18.5 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Renoir Device 24: Function 5 [1022:144d]
IOMMU Group 8 00:18.6 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Renoir Device 24: Function 6 [1022:144e]
IOMMU Group 8 00:18.7 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Renoir Device 24: Function 7 [1022:144f]
IOMMU Group 9 01:00.0 RAID bus controller [0104]: Broadcom / LSI SAS2008 PCI-Express Fusion-MPT SAS-2 [Falcon] [1000:0072] (rev 03)

I already tried to add it to an existing VM, but under edit → GPU.
I now disabled ā€œEnsure Display Deviceā€ and added PCIe devices 0a:00.0 and 0a:00.1 manually as passthrough devices. This does not create a warning in the UI, but I still get these errors in messages:

Aug  2 08:14:37 truenas1 kernel: pcieport 0000:00:08.1: broken device, retraining non-functional downstream link at 2.5GT/s
Aug  2 08:14:38 truenas1 kernel: pcieport 0000:00:08.1: retraining failed
Aug  2 08:14:40 truenas1 kernel: pcieport 0000:00:08.1: broken device, retraining non-functional downstream link at 2.5GT/s
Aug  2 08:14:41 truenas1 kernel: pcieport 0000:00:08.1: retraining failed
Aug  2 08:14:41 truenas1 kernel: vfio-pci 0000:0a:00.0: not ready 1023ms after bus reset; waiting
Aug  2 08:14:42 truenas1 kernel: vfio-pci 0000:0a:00.0: not ready 2047ms after bus reset; waiting
Aug  2 08:14:44 truenas1 kernel: vfio-pci 0000:0a:00.0: not ready 4095ms after bus reset; waiting
Aug  2 08:14:48 truenas1 kernel: vfio-pci 0000:0a:00.0: not ready 8191ms after bus reset; waiting
Aug  2 08:14:57 truenas1 kernel: vfio-pci 0000:0a:00.0: not ready 16383ms after bus reset; waiting
Aug  2 08:15:14 truenas1 kernel: vfio-pci 0000:0a:00.0: not ready 32767ms after bus reset; waiting
Aug  2 08:15:49 truenas1 kernel: vfio-pci 0000:0a:00.0: not ready 65535ms after bus reset; giving up
Aug  2 08:15:50 truenas1 kernel: vfio-pci 0000:0a:00.0: vgaarb: VGA decodes changed: olddecodes=io+mem,decodes=io+mem:owns=none
Aug  2 08:15:50 truenas1 kernel: [drm] initializing kernel modesetting (RENOIR 0x1002:0x1636 0x1043:0x87E1 0xD9).
Aug  2 08:15:50 truenas1 kernel: [drm] register mmio base: 0xF5F00000
Aug  2 08:15:50 truenas1 kernel: [drm] register mmio size: 524288
Aug  2 08:15:50 truenas1 kernel: amdgpu 0000:0a:00.0: amdgpu: amdgpu: finishing device.
Aug  2 08:15:50 truenas1 kernel: amdgpu: probe of 0000:0a:00.0 failed with error -22
Aug  2 08:15:50 truenas1 kernel: snd_hda_intel 0000:0a:00.1: Handle vga_switcheroo audio client
Aug  2 08:15:50 truenas1 kernel: snd_hda_intel 0000:0a:00.1: number of I/O streams is 30, forcing separate stream tags

Also no idea why it says snd_hda_intel.

The machine can’t be turned on and logs this:

2024-08-02T06:14:36.043655Z qemu-system-x86_64: VFIO_MAP_DMA failed: Bad address
2024-08-02T06:14:36.044566Z qemu-system-x86_64: vfio_dma_map(0x5608b31db850, 0xf4000000, 0x4000000, 0x7fa423e00000) = -14 (Bad address)
qemu: hardware error: vfio: DMA mapping failed, unable to continue

Yeah, if it doesnt recognise the device by name it never worked for me either.

Sometimes a reboot solved it and the devices would magically show up.

Also, there seems to be going something on since dragonfish. Can you downgrade to Cobia ?

I just used the addresses in my message to make it clear which devices I passed through. The names did show up.

I have no Idea, if downgrades are a thing and which problems could arise… I think I may have enabled some new ZFS feature flags which would probably be a showstopper.

Pity, I dont have any more input :worried:

Thanks none the less…

I also, just for fun, tried it without enabling SR-IOV and Resizable BAR (as the log talked a lot about ā€œbar_restoreā€) - no changes whatsoever.

Maybe someone else sees this and has an idea…
I also filed a bug report.

Update: my bug report was closed because it looked like a hardware issue to them.
As I was trying to forward the APUs integrated GPU and this GPU has worked finde I don’t understand how this could be a hardware issue.

For different reasons I’ll soon change my mainboard to a Gigabyte MC12-LE0 (great board with IPMI for under 50€…or at least I hope so). I don’t think it’ll make a difference but it has an integrated VGA through the IPMI and I’ll try it again with this.

Btw, with the Intel iGPUs it can be important to pass through the CPU host model. Maybe form AMD too

If you continue to have issues, it may work between the gpu in a sandbox or docker app (via jailmaker), where you don’t have to use PCIe pass through and instead can pass the actual device.

CPU mode is already on Host Model.

I already thought about that, but as TrueNAS 24.10 will support Docker Compose natively I think if I can’t get the GPU passed through I may just deploy the compose stack directly on TrueNAS when it becomes available.

I originally wanted to leave it in a VM to make it independent from any TrueNAS features though… As it is now I could run that VM basically anywhere as long as I could access my TrueNAS via NFS which makes this very flexible.

I’m just a bit disappointed that my bug report was just dismissed like that… I really can’t understand why this should be a hardware defect. :neutral_face:

With so many other threads where people seem to have problems passing PCIe devices through to VMs on TrueNAS Scale I would’ve thought such problems would be taken more seriously.

aaand another update:
I now switched to the Gigabyte MC12-LE0.

I can now (with a beta BIOS) passthrough my GPU to my VM.
I can also passthrough the audio device but it warns me, that the device cannot be reset and indeed, when I shut the VM back down, the same PCIe related errors appear as with my old mainboard.

No matter if I passthrough only the GPU or also an audio device and the multimedia controller, I don’t get a render device in /dev/dri though.
I think thats probably an Ubuntu-problem but I have no idea how to fix it.
Installing the AMDGPU driver didn’t help…

I also found this in syslog:

2024-08-08T00:21:12.453464+02:00 mediastack kernel: [drm] initializing kernel modesetting (RENOIR 0x1002:0x1636 0x1458:0x1000 0xD9).
2024-08-08T00:21:12.453464+02:00 mediastack kernel: [drm] register mmio base: 0xFC000000
2024-08-08T00:21:12.453464+02:00 mediastack kernel: [drm] register mmio size: 524288
2024-08-08T00:21:12.453464+02:00 mediastack kernel: [drm] add ip block number 0 <soc15_common>
2024-08-08T00:21:12.453468+02:00 mediastack kernel: [drm] add ip block number 1 <gmc_v9_0>
2024-08-08T00:21:12.453468+02:00 mediastack kernel: [drm] add ip block number 2 <vega10_ih>
2024-08-08T00:21:12.453469+02:00 mediastack kernel: [drm] add ip block number 3 <psp>
2024-08-08T00:21:12.453469+02:00 mediastack kernel: [drm] add ip block number 4 <smu>
2024-08-08T00:21:12.453469+02:00 mediastack kernel: [drm] add ip block number 5 <dm>
2024-08-08T00:21:12.453470+02:00 mediastack kernel: [drm] add ip block number 6 <gfx_v9_0>
2024-08-08T00:21:12.453473+02:00 mediastack kernel: [drm] add ip block number 7 <sdma_v4_0>
2024-08-08T00:21:12.453474+02:00 mediastack kernel: [drm] add ip block number 8 <vcn_v2_0>
2024-08-08T00:21:12.453474+02:00 mediastack kernel: [drm] add ip block number 9 <jpeg_v2_0>
2024-08-08T00:21:12.453474+02:00 mediastack kernel: [drm] BIOS signature incorrect 0 0
2024-08-08T00:21:12.453475+02:00 mediastack kernel: [drm] BIOS header is broken
2024-08-08T00:21:12.453475+02:00 mediastack kernel: [drm] BIOS signature incorrect 0 0
2024-08-08T00:21:12.453475+02:00 mediastack kernel: amdgpu 0000:00:07.0: amdgpu: Unable to locate a BIOS ROM
2024-08-08T00:21:12.453479+02:00 mediastack kernel: amdgpu 0000:00:07.0: amdgpu: Fatal error during GPU init
2024-08-08T00:21:12.453479+02:00 mediastack kernel: amdgpu 0000:00:07.0: amdgpu: amdgpu: finishing device.

isn’t that related to the amd gpu reset issue?

It seems so, yes

Does anyone have an idea, if this might be resolved with 25.04 as VMs get switched to Incus instead of KVM?

I don’t supopse so, as Incus still uses qemu for VMs but I don’t know if Incus might do something special about this.