HBA gets assigned vfio-pci driver after NIC replacement

I replaced my NIC in a R730Xd and my HBA was caught as an Isolated GPU Device originally.

root@truenas[~]# lspci |grep -i lsi
02:00.0 RAID bus controller: Broadcom / LSI MegaRAID SAS-3 3108 [Invader] (rev 02)
04:00.0 Serial Attached SCSI controller: Broadcom / LSI SAS3224 PCI-Express Fusion-MPT SAS-3 (rev 01)


root@truenas[~]# lspci -nnk -s 04:00.0 
04:00.0 Serial Attached SCSI controller [0107]: Broadcom / LSI SAS3224 PCI-Express Fusion-MPT SAS-3 [1000:00c4] (rev 01) 
  Subsystem: Broadcom / LSI SAS9305-16i [1000:3190] 
  Kernel driver in use: vfio-pci 
  Kernel modules: mpt3sas

I was able to remove the PCI device with:

midclt call datastore.update system.advanced 1 ā€˜{ā€œadv_isolated_gpu_pci_idsā€: }’

and fix the driver associated with:

echo 0000:04:00.0 > /sys/bus/pci/drivers/vfio-pci/unbind
echo mpt3sas > /sys/bus/pci/devices/0000:04:00.0/driver_override
echo 0000:04:00.0 > /sys/bus/pci/drivers_probe

Then I see the disks associated to the HBA.
But this doesn’t work persist across reboots.

Verifying IOMMU grouping shows 04:00.0 is alone in the group 22 :


root@truenas[~]# readlink /sys/bus/pci/devices/0000:04:00.0/iommu_group
../../../../kernel/iommu_groups/22
root@truenas[~]#  ls /sys/kernel/iommu_groups/22/devices
0000:04:00.0

I don’t know what to do to fix this permanently.

Hey, I really need some help here, as I have no idea on how to solve that in a permanent fashion.
It’s quite a hassle to change the NIC, I’d like to make sure I have a strategy to avoid this issue before doing so.

AI bots says I could use a systemd unit for this, but it didn’t work.
Besides, it sounds like something that should be configurable from TN itself.

I think most of us don’t clearly understand the problem.

In my case, even if I understood the problem, not sure if I would have any suggestions because I don’t have an isolated GPU device.

Isolated GPU Device(s) is a new feature offered from 25.10 onward I think.

It allows you to select a specific PCI device, isolate it from Truenas host (using IOMMU group if I’m not mistaken) so that it can be allocated to a VM.

I had previously made an attempt at adding an Isolated GPU Device. It failed (Using an NVIDIA Tesla T4 with Truenas VMs on a Dell R730xd)
And it was apparently not properly removed, or at least there was a device still listed in Isolated GPU.

And it seemed to have messed up with TN in an unexpected way:
When I tried to install a new NIC, I guess the system reassigned PCI bus address differently and my HBA got assigned the PCI bus address of the originally isolated GPU. Or at least that’s the way I understand it.
That’d explain why the wrong driver was assigned to it. (vfio-pci instead of mpt3sas)

It wasn’t possible to deal with the leftover Isolated GPU from the UI, but using this very specific midclt command given by an AI chat, I was able to unregister it.
(I realize now that using midclt call datastore.update is undocumented and reserved for internal usage)
Anyway it worked, the Isolated GPU list is now empty.

But even after that and a reboot, the new NIC would keep messing up with the existing PCI bus addresses in the same way (HBA getting the vfio-pci driver).

I had to give up on that new 10Gb NIC and revert to the existing one.

1 Like