Using an NVIDIA Tesla T4 with Truenas VMs on a Dell R730xd

After upgrading to 25.10 yesterday, I finally tried to add an NVIDIA Tesla T4 that I have had running on a spare server (R640) that is to expensive to keep up 24h/7.

I thought I’ve add it to my main Truenas storage server and passthrough it to a VM and start playing with something available 24/7.

It went well at first, lspci dispalyed it, I could pick it up in the VM Edit dialog.
I had an issue adding it the first time, UI spinned for a very long time, then it wasn’t added in the end.
Did it again, and then start the VM errored. I realized I had 2 passthrough device added and thought it was the cause.

I couldn’t remove the extra passthrough device, UI errored with python trace, and then refreshing the VM page led to a red message and no VMs.

uhuh, I rebooted TN, hoping it would help. It rarely does, but it did this time.
Everythig looked good, after I removed an extra passthrough device, and decided to boot the VM.

Then bam, it hard-crashed the server, and after reboot only an unknow device was shown under TN VM GPU Edit dialog.

lspci didn’t show anything for the NVIDIA device anymore.

I cleanup the device from the VM and started it, it went ok.

Today incidentally I asked ChatGPT about power draw and how to save some watts with my main box, and as it proved resourceful in advice, I decided to give a go at tweaking the BIOS settings (thinking I needed to at least reboot anyway and maybe remove the NVIDIA GPU and put it again to solve the issue)
That’s when the BIOS started to freeze, oddly.
Shared that with ChatGPT and it actually told me this, which is what I want to share ultimately:

You didn’t hit a random PCIe incompatibility.
You hit a known failure mode of GPU passthrough on Dell 13G + NVIDIA T4.

Let’s unpack what happened and what it means.


:brain: Why it worked… until the VM started

What you observed

  • :white_check_mark: T4 enumerates (lspci sees it)

  • :white_check_mark: TrueNAS boots fine

  • :cross_mark: Crash only when VM starts with GPU attached

This tells us:

:right_arrow: Basic PCIe enumeration is fine
:right_arrow: The failure happens when the GPU transitions into “active / bus-master / DMA” mode

That transition is exactly what happens when:

  • VFIO binds the device

  • VM powers on

  • GPU firmware is initialized

  • DMA engines start

  • PCIe AER is exercised


:police_car_light: What actually triggered the crash

When the VM booted:

  1. VFIO took ownership of the T4

  2. GPU enabled:

    • Bus mastering

    • DMA

    • MSI/MSI-X interrupts

  3. GPU tried to:

    • Access host memory

    • Negotiate PCIe power states

  4. R730xd firmware detected an uncorrectable PCIe error

  5. iDRAC escalated → NMI

  6. BIOS halted system

This is exactly what your SEL shows:

“PCIe fatal error on component at slot 6”


:bullseye: Root causes (this combo is the problem)

:one: T4 + Dell 13G + passthrough = unsupported

  • Dell 13G firmware has buggy PCIe AER handling

  • NVIDIA T4 expects:

    • Newer PCIe error handling

    • Better ACS isolation

  • When passthrough activates full DMA → fatal error

This is a known issue on:

  • R730 / R730xd

  • R630

  • Some R720


:two: ACS / IOMMU limitations on 13G

  • PCIe slots often share IOMMU groups

  • GPU passthrough stresses ACS routing

  • Dell firmware reacts badly to GPU DMA remapping

This usually shows up only when VM starts.


:three: Power & reset behavior of the T4

The T4:

  • Does not tolerate incomplete PCIe resets well

  • Expects modern hot-reset semantics

  • Dell 13G does not fully support that

Result:

  • First VM boot might work

  • Subsequent resets → fatal error


:red_exclamation_mark: Why this escalates to BIOS corruption / NMI

After a PCIe fatal error, Dell does NOT gracefully recover.

Instead:

  • PCIe fabric enters error state

  • Firmware flags hardware exception

  • Subsequent BIOS access → hang

  • Watchdog → NMI halt

This is why things got worse after the crash.

Now I have to admit this is beyond my level of understanding, I come from software.
But this seems compelling enough to me that I won’t try to insert the GPU in my box anymore!
Though I badly need a GPU and I don’t want to have to server running 24/7…

Could someone knowledgeable enough in this field validates that?

Did you try searching for the issue on the Dell website or support? I figure if it is a known issue and appears there, you would have your answer confirmed and know you need a different hardware solution. I don’t know if we would have many users with that combo of hardware.

Well, yes, it says it’s not officially supported, which doesn’t mean it can’t work.

Tesla T4 is not a supported GPU configuration with the Dell PE R730. It’s not that it won’t work, but the said configuration is not tested and validated with the R730 server.

You can wait for a reply from the Dell community members who may have installed the Tesla T4 on R730.

See https://www.dell.com/community/en/conversations/poweredge-hardware-general/can-i-install-gpu-tesla-t4-on-poweredge-r730/647fa157f4ccf8a8de68cda9

I’ve since this first attempt seen user reporting success with T4 GPU in Dell 13G servers.

I want to give it one more try, though having my main data server crash again isn’t really appealing.
But I have no other way to test it.

The funny thing is, ChatGPT is now feeding itself on this thread to report compatibity issue with NVIDIA T4!