Fangtooth VM or app with Blackwell?

I mean, RC1’s in a pretty good spot now - you could upgrade the system, but just not your ZFS pool version.

Could be that overly paranoid “critical PCI device” validation that’s getting in the way - but again, I would’ve expected it to throw that back at you in the output of one of those midclt commands. Unless something is stuck in the install itself - and possibly a reinstall of 25.04 would fix it - but glad to hear it seems to be sorting itself out.

Installed Gold…

GPU isolation shows 3 options now (instead of zero). Set it for the 5060TI bridge and rebooted. My Oracle VM says this after adding the Isolated GPU…

Error Name: EFAULT
Error Code: 14
Reason: [EFAULT] internal error: Non-endpoint PCI devices cannot be assigned to guests
Error Class: CallError
Trace: Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/middlewared/plugins/vm/supervisor/supervisor.py", line 191, in start
    if self.domain.create() < 0:
       ^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/libvirt.py", line 1373, in create
    raise libvirtError('virDomainCreate() failed')
libvirt.libvirtError: internal error: Non-endpoint PCI devices cannot be assigned to guests

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/middlewared/api/base/server/ws_handler/rpc.py", line 360, in process_method_call
    result = await method.call(app, id_, params)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/middlewared/api/base/server/method.py", line 57, in call
    result = await self.middleware.call_with_audit(self.name, self.serviceobj, methodobj, params, app,
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/middlewared/main.py", line 954, in call_with_audit
    result = await self._call(method, serviceobj, methodobj, params, app=app,
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/middlewared/main.py", line 771, in _call
    return await methodobj(*prepared_call.args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/middlewared/api/base/decorator.py", line 108, in wrapped
    result = await func(*args)
             ^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/middlewared/plugins/vm/vm_lifecycle.py", line 57, in start
    await self.middleware.run_in_thread(self._start, vm['name'])
  File "/usr/lib/python3/dist-packages/middlewared/main.py", line 668, in run_in_thread
    return await self.run_in_executor(io_thread_pool_executor, method, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/middlewared/main.py", line 665, in run_in_executor
    return await loop.run_in_executor(pool, functools.partial(method, *args, **kwargs))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/middlewared/plugins/vm/vm_supervisor.py", line 68, in _start
    self.vms[vm_name].start(vm_data=self._vm_from_name(vm_name))
  File "/usr/lib/python3/dist-packages/middlewared/plugins/vm/supervisor/supervisor.py", line 201, in start
    raise CallError('\n'.join(errors))
middlewared.service_exception.CallError: [EFAULT] internal error: Non-endpoint PCI devices cannot be assigned to guests

root@nas[~]# sh ./gpu.sh
IOMMU Group 0:
00:02.0 VGA compatible controller: Intel Corporation CometLake-S GT2 [UHD Graphics 630] (rev 05)

IOMMU Group 2:
00:01.0 PCI bridge: Intel Corporation 6th-10th Gen Core Processor PCIe Controller (x16) (rev 05)
01:00.0 VGA compatible controller: NVIDIA Corporation Device 2d04 (rev a1)
01:00.1 Audio device: NVIDIA Corporation Device 22eb (rev a1)

IOMMU Group 20:
0a:00.0 PCI bridge: ASPEED Technology, Inc. AST1150 PCI-to-PCI Bridge (rev 04)
0b:00.0 VGA compatible controller: ASPEED Technology, Inc. ASPEED Graphics Family (rev 41)

So seems to need the GPU endpoint, but the GPU isolation offers the bridge above it. Trying to add it via device PCI passthrough fails for both the bridge and the GPU pci id.

The entire IOMMU group is passed through.
For it to work the GPU/audio device needs to be in its own group.
So IMO its still the same problem you had with the old mobo.

The bridge, gpu and audio are in their own group. And the isolation grabs the PCI id of the bridge. So as far as I can see, it’s doing what it’s supposed to at the isolation layer. But at the VM layer it expects an “endpoint” (I presume) to add a GPU. So if the VM adds the bridge, and not the GPU PCI id? Not sure what it would look like if it was actually working. Google Gemini seems to indicate that things look the way they are supposed to in TrueNAS. The old mobo had one IOMMU group for everything. The Supermicro has a separate IOMMU for the x16 bridge and the Nvidia gpu/audio attached to it.

Im no expert, but from my quick reading around proxmox forums and this forum, your best chances are with a system that seperates all the device seperatly.

This means the PCI-bridges outside of the “endpoint” device group.

this is from my system, it has 93 IOMMU groups !

IOMMU Group 0:
        b2:00.0 PCI bridge [0604]: Intel Corporation Sky Lake-E PCI Express Root Port A [8086:2030] (rev 07)
IOMMU Group 1:
        b3:00.0 VGA compatible controller [0300]: NVIDIA Corporation AD104 [GeForce RTX 4070] [10de:2786] (rev a1)
        b3:00.1 Audio device [0403]: NVIDIA Corporation Device [10de:22bc] (rev a1)
IOMMU Group 2:
        64:00.0 PCI bridge [0604]: Intel Corporation Sky Lake-E PCI Express Root Port A [8086:2030] (rev 07)
IOMMU Group 3:
        64:02.0 PCI bridge [0604]: Intel Corporation Sky Lake-E PCI Express Root Port C [8086:2032] (rev 07)
IOMMU Group 4:
        66:00.0 Non-Volatile memory controller [0108]: Intel Corporation Optane SSD 900P Series [8086:2700]
IOMMU Group 5:
        16:00.0 PCI bridge [0604]: Intel Corporation Sky Lake-E PCI Express Root Port A [8086:2030] (rev 07)
IOMMU Group 6:
        16:02.0 PCI bridge [0604]: Intel Corporation Sky Lake-E PCI Express Root Port C [8086:2032] (rev 07)
IOMMU Group 7:
        16:03.0 PCI bridge [0604]: Intel Corporation Sky Lake-E PCI Express Root Port D [8086:2033] (rev 07)

Right-O, so here is an update.

If you reserve the GPU (or not), adding it to the VM adds the bridge, audio and video to devices for you. But unless you manually remove the bridge it will fail to boot. I’m still testing, but it appears with the bridge removed this may work. I am building an Oracle VM, and then installing the Blackwell drivers to see how far I get. I already saw the two devices in dmesg, but don’t have drivers yet. Then I’ll install the CUDA stuff and see if I can run an LLM. I think the default behavior of adding the bridge may just be wrong. #shrug

I did manage to get an OpenWebUI container to use the GPU finally. I’ll abandon efforts to use a VM for now.

Jepp - that why I just bought an X570 Taichi today, as it seems this has great IOMMU-goups…
I had my fair share of issues with my B550-A Gaming.
No way to have a RTX4000 Blackwell working reliably.

(And in addition, the NIC has some issues with random sleep during boot, which costed my 3-4 nights of sleep, since I could not access the network at all, but only claude found this issue online.)