After upgrading to 25.10 yesterday, I finally tried to add an NVIDIA Tesla T4 that I have had running on a spare server (R640) that is to expensive to keep up 24h/7.
I thought I’ve add it to my main Truenas storage server and passthrough it to a VM and start playing with something available 24/7.
It went well at first, lspci dispalyed it, I could pick it up in the VM Edit dialog.
I had an issue adding it the first time, UI spinned for a very long time, then it wasn’t added in the end.
Did it again, and then start the VM errored. I realized I had 2 passthrough device added and thought it was the cause.
I couldn’t remove the extra passthrough device, UI errored with python trace, and then refreshing the VM page led to a red message and no VMs.
uhuh, I rebooted TN, hoping it would help. It rarely does, but it did this time.
Everythig looked good, after I removed an extra passthrough device, and decided to boot the VM.
Then bam, it hard-crashed the server, and after reboot only an unknow device was shown under TN VM GPU Edit dialog.
lspci didn’t show anything for the NVIDIA device anymore.
I cleanup the device from the VM and started it, it went ok.
Today incidentally I asked ChatGPT about power draw and how to save some watts with my main box, and as it proved resourceful in advice, I decided to give a go at tweaking the BIOS settings (thinking I needed to at least reboot anyway and maybe remove the NVIDIA GPU and put it again to solve the issue)
That’s when the BIOS started to freeze, oddly.
Shared that with ChatGPT and it actually told me this, which is what I want to share ultimately:
You didn’t hit a random PCIe incompatibility.
You hit a known failure mode of GPU passthrough on Dell 13G + NVIDIA T4.Let’s unpack what happened and what it means.
Why it worked… until the VM started
What you observed
T4 enumerates (
lspcisees it)
TrueNAS boots fine
Crash only when VM starts with GPU attached
This tells us:
Basic PCIe enumeration is fine
The failure happens when the GPU transitions into “active / bus-master / DMA” mode
That transition is exactly what happens when:
VFIO binds the device
VM powers on
GPU firmware is initialized
DMA engines start
PCIe AER is exercised
What actually triggered the crash
When the VM booted:
VFIO took ownership of the T4
GPU enabled:
Bus mastering
DMA
MSI/MSI-X interrupts
GPU tried to:
Access host memory
Negotiate PCIe power states
R730xd firmware detected an uncorrectable PCIe error
iDRAC escalated → NMI
BIOS halted system
This is exactly what your SEL shows:
“PCIe fatal error on component at slot 6”
Root causes (this combo is the problem)
T4 + Dell 13G + passthrough = unsupported
Dell 13G firmware has buggy PCIe AER handling
NVIDIA T4 expects:
Newer PCIe error handling
Better ACS isolation
When passthrough activates full DMA → fatal error
This is a known issue on:
R730 / R730xd
R630
Some R720
ACS / IOMMU limitations on 13G
PCIe slots often share IOMMU groups
GPU passthrough stresses ACS routing
Dell firmware reacts badly to GPU DMA remapping
This usually shows up only when VM starts.
Power & reset behavior of the T4
The T4:
Does not tolerate incomplete PCIe resets well
Expects modern hot-reset semantics
Dell 13G does not fully support that
Result:
First VM boot might work
Subsequent resets → fatal error
Why this escalates to BIOS corruption / NMI
After a PCIe fatal error, Dell does NOT gracefully recover.
Instead:
PCIe fabric enters error state
Firmware flags hardware exception
Subsequent BIOS access → hang
Watchdog → NMI halt
This is why things got worse after the crash.
Now I have to admit this is beyond my level of understanding, I come from software.
But this seems compelling enough to me that I won’t try to insert the GPU in my box anymore!
Though I badly need a GPU and I don’t want to have to server running 24/7…
Could someone knowledgeable enough in this field validates that?