Intel Arc GPU's A380 and A750 do not show in my new build

After a (very) long read I am currently building a new NAS to replace my 15-year old and quite reliable Ubuntu server. The main items have arrived and I am testing with the various aspects of TrueNAS that I intend to use.
The purpose of the system is documents and media for family and friends, VM’s for myself to produce and experiment and maybe some more apps that prove to be useful (like Passbolt).

The system consists of:

  • case: Jonsbo N5
  • motherboard: Supermicro H12SSL-NT
  • processor: AMD EPYC 7642 (48 cores)
  • cooler: Arctic Kühler Freezer 4U-M
  • memory: 8x Samsung M393A8G40AB2-CWE 64GB
  • power supply: Seasonic Prime TX 850
  • boot pool: 2x Kingston KC3000 PCIe 4.0 NVMe M.2 SSD 512GB (mirror)
  • data pool: 2x Asus Hyper M.2 x16 Gen 4 with 8x WD Black SN850X 2TB (RAIDZ2)
  • backup pool: 2x Seagate HDD 3.5" EXOS X16 16TB (mirror)
  • apps GPU: Sparkle Intel Arc A380 ELF 6GB
  • VM GPU: Sparkle Intel Arc A750 ROC OC 8GB

Nextcloud is running, yet have Plex to try, but am currently stuck at the two GPU’s. Via lspci | grep VGA both appear to be there, but via the GUI I cannot find them.

43:00.0 VGA compatible controller: ASPEED Technology, Inc. ASPEED Graphics Family (rev 41)
c3:00.0 VGA compatible controller: Intel Corporation DG2 [Arc A750] (rev 08)
c7:00.0 VGA compatible controller: Intel Corporation DG2 [Arc A380] (rev 05)

image

Also the BMC of the board shows nothing under GPU. But since I am a bit disappointed by the functionality of this BMC, maybe it is just what it is.

I know that there are quite a few that succesfully use an Intel Arc GPU, so I hope the solution is quite trivial.
Any help is appreciated.

Earlier Scale versions would show your GPU, but may not allow it to actually be isolated if there were certain other devices on the PCIe bus that prevented it. Perhaps that was “fixed” by not showing those GPUs in the first place.

If you’re just looking for hardware transcoding, you should be able to pass through the devices to your plex container to still utilize the GPU. You don’t need to isolate the GPU completely.

What devices show up when you run the command:
ls -la /dev/dri

Thank you very much for your respons.
The output you asked for seems to indicate that both GPU’s are identified:

Screenshot from 2025-02-05 19-33-09

I managed to activate the Plex app and both GPU’s turn up succesfully:

Screenshot from 2025-02-05 19-31-35

Screenshot from 2025-02-05 19-31-23

So for apps all is well. But the Arc A750 is meant to service VM’s and should be isolated for that reason. Any suggestions?

@Zjoz Have you enabled IOMMU in the BIOS of your system under Advanced → NB Configuration?

You can output your IOMMU groups with the following block of code:

for d in $(find /sys/kernel/iommu_groups/ -type l | sort -n -k5 -t/); do 
    n=${d#*/iommu_groups/*}; n=${n%%/*}
    printf 'IOMMU Group %s ' "$n"
    lspci -nns "${d##*/}"
done;

But this would normally not prevent the GPUs from showing up for isolation - only cause an error when the VM was booted.

Lastly, does sudo intel_gpu_top show both cards?

1 Like

@HoneyBadger Thanks for jumping in.

I found the next relevant sections in the output of executing your code:

IOMMU Group 0 c0:01.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge [1022:1482]
IOMMU Group 0 c0:01.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse GPP Bridge [1022:1483]
IOMMU Group 0 c1:00.0 PCI bridge [0604]: Intel Corporation Device [8086:4fa0] (rev 01)
IOMMU Group 0 c2:01.0 PCI bridge [0604]: Intel Corporation Device [8086:4fa4]
IOMMU Group 0 c2:04.0 PCI bridge [0604]: Intel Corporation Device [8086:4fa4]
IOMMU Group 0 c3:00.0 VGA compatible controller [0300]: Intel Corporation DG2 [Arc A750] [8086:56a1] (rev 08)
IOMMU Group 0 c4:00.0 Audio device [0403]: Intel Corporation DG2 Audio Controller [8086:4f90]
IOMMU Group 2 c0:03.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge [1022:1482]
IOMMU Group 2 c0:03.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse GPP Bridge [1022:1483]
IOMMU Group 2 c5:00.0 PCI bridge [0604]: Intel Corporation Device [8086:4fa1] (rev 01)
IOMMU Group 2 c6:01.0 PCI bridge [0604]: Intel Corporation Device [8086:4fa4]
IOMMU Group 2 c6:04.0 PCI bridge [0604]: Intel Corporation Device [8086:4fa4]
IOMMU Group 2 c7:00.0 VGA compatible controller [0300]: Intel Corporation DG2 [Arc A380] [8086:56a5] (rev 05)
IOMMU Group 2 c8:00.0 Audio device [0403]: Intel Corporation DG2 Audio Controller [8086:4f92]

I do not know what to conclude from that, but it looks good to me.
The BIOS settings for the IOMMU were on auto and I left it as such.

Furthermore I tried the intel_gpu_top before, but that gave a complete empty table. Repeated that just now with the same result.

Any ideas what could be the issue? How should I go on from here?

I dont have a solution, but just wanted to mention that i have the same GPU (A750) in my server, and that one shows up correctly for me in the isolate gpu section. I only have one card though…
Running on an older Supermicro X10 for me, but i get the same output on the IOMMU group as you.
Havent been able to check my bios-settings.

@CyberFluffy

Pulled the A380, just to be sure (and out of a bit of desperation).
Still no GPU to isolate and nothing as output of intel_gpu_top.

Im guessing this is the expected output?
image

Not helping much by showing it works for me, i know… But will check what my bios setting is when i get home, just in case it can help any.

I am new to TrueNAS myself, so i have a pretty much clean install. Installed 24.10.1, then i got the card later on, installed, and upgraded to 24.10.2.

@CyberFluffy Strange that it pops up for you to isolate, but do not see anything with intel_gpu_top.
@HoneyBadger Can you explain this?

Your IOMMU Groups 0 and 2 (respectively) that contain your GPUs are sharing space with host bridge and PCI bridge devices, which are marked as “system critical” and can’t be cleanly isolated.

midclt call device.get_gpus will result in a similar (but less-readable JSON) output indicating the exact devices that it’s sharing with.

By comparison, a snippet of the same output on my system:

IOMMU Group 8 80:04.7 System peripheral [0880]: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 DMA Channel 7 [8086:2f27] (rev 02)
IOMMU Group 9 81:00.0 3D controller [0302]: NVIDIA Corporation GP104GL [Tesla P4] [10de:1bb3] (rev a1)
IOMMU Group 10 ff:08.0 System peripheral [0880]: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 QPI Link 0 [8086:2f80] (rev 02)

In this case, the Tesla P4 is in its own IOMMU group (9) with no other devices and shows up in the drop-down.

Try forcing it to Enabled to see if the results of the IOMMU group dump change. If it shows only the “VGA compatible controller” and “Audio device” together, then it’s able to be passed through.

It should look like @CyberFluffy 's result - if you sudo prepend the command does it change?

Forced it to enabled but got the same results voor de IOMMU groups.

Confirmed what you indicated:

[
  {
    "addr": {
      "pci_slot": "0000:43:00.0",
      "domain": "0000",
      "bus": "43",
      "slot": "00"
    },
    "description": "ASPEED Technology, Inc. ASPEED Graphics Family",
    "devices": [
      {
        "pci_id": "1A03:2000",
        "pci_slot": "0000:43:00.0",
        "vm_pci_slot": "pci_0000_43_00_0"
      }
    ],
    "vendor": null,
    "uses_system_critical_devices": true,
    "critical_reason": "Critical devices found in same IOMMU group: 0000:43:00.0",
    "available_to_host": true
  },
  {
    "addr": {
      "pci_slot": "0000:c3:00.0",
      "domain": "0000",
      "bus": "c3",
      "slot": "00"
    },
    "description": "Intel Corporation DG2 [Arc A750]",
    "devices": [
      {
        "pci_id": "8086:56A1",
        "pci_slot": "0000:c3:00.0",
        "vm_pci_slot": "pci_0000_c3_00_0"
      }
    ],
    "vendor": "INTEL",
    "uses_system_critical_devices": true,
    "critical_reason": "Critical devices found in same IOMMU group: 0000:c3:00.0",
    "available_to_host": true
  },
  {
    "addr": {
      "pci_slot": "0000:c7:00.0",
      "domain": "0000",
      "bus": "c7",
      "slot": "00"
    },
    "description": "Intel Corporation DG2 [Arc A380]",
    "devices": [
      {
        "pci_id": "8086:56A5",
        "pci_slot": "0000:c7:00.0",
        "vm_pci_slot": "pci_0000_c7_00_0"
      }
    ],
    "vendor": "INTEL",
    "uses_system_critical_devices": true,
    "critical_reason": "Critical devices found in same IOMMU group: 0000:c7:00.0",
    "available_to_host": true
  }
]

And I did use sudo intel_gpu_top and that gave the same output as the screenshot of @CyberFluffy.

Hope that you have some further suggestions how to seperate the devices in the IOMMU groups.

I missed the part of group number at first myself, i see i have different ID on that part myself… But at least i learned a little about what IOMMU is today :smiley:

IOMMU Group 73 03:04.0 PCI bridge [0604]: Intel Corporation Device [8086:4fa4]
IOMMU Group 74 04:00.0 VGA compatible controller [0300]: Intel Corporation DG2 [Arc A750] [8086:56a1] (rev 08)
IOMMU Group 75 05:00.0 Audio device [0403]: Intel Corporation DG2 Audio Controller [8086:4f90]
IOMMU Group 76 06:00.0 Ethernet controller [0200]: Intel Corporation Ethernet Controller 10-Gigabit X540-AT2 [8086:1528] (rev 01)

But i did check my BIOS, and its all on auto and PCI Gen3 on my stuff under the North-Bridge settings…

And i have absolutely no load on my GPU yet as i havent been able to set up things yet, as im new to TrueNAS. Need more spare-time ^^

@HoneyBadger
I delved a bit in this IOMMU phenomenon and read something about a pcie_acs_override kernel option. Could that bring a solution?

I also left out one of the GPU’s and shifted the other one to different PCIe slots. This resulted each time in exactly the same combination of devices in the group of the GPU (with different group number of course).
Could it not be that that is no issue in itself? The audio device that is in the same group seems to belong to the GPU. And could the host and PCI bridges not also ‘belong’ to this device? Is the check whether the GPU can be isolated not too strict? What would happen if the isolation was forced (if that is possible at all)?

I opted not to mention the pcie_acs_override functionality specifically because it does pose a security and stability risk - you’re basically removing all of the barriers between memory access to PCI devices, and trusting that your guest VM doesn’t do anything deliberately or inadvertently bad to host memory addresses.

Audio and video devices together in the same IOMMU group is fine - the middleware specifically calls out PCI bridge devices as being critical. I’ll try to take a deeper look into the code to see if there’s any differentiation being made of “root” vs “non-root” bridge devices.

You may be able to set up the devices for manual passthrough under the VM → Devices → Add → PCI Passthrough Device but here be dragons and I strongly recommend you disable VM autostart for the VM in question; if the machine booting and isolating the device causes a system hang, VM autostart will mean that it’ll start again on a reboot, making it hang again … and around you go.

I came upon that option while I got ‘educated’ by the next video. At 17:15min the option of pcie_acs_override=downstream,multifunction is used and at 18:30min pcie_acs_override=id<...> to breakup devices below a bridge or so. I am new to this and maybe you already know these options. But are these less dangerous methods to solve the issue?

Isolating the complete group in which my GPU is could be a problem though if a Host bridge is in the group. So could the group be forced to split below this bridge?

This sounds like an option to use when the less dangerous ones fail. Since two of my goals to invest quite an amount in this new build is to do video processing with a really capable VM and start some serious games with a VM as well. It would be a real pity (to say the least :cry:) if I could not meet these goals.

@HoneyBadger
Delved some more and found an instruction to pass a GPU for use with KVM. They use the next kernel option

amd_iommu=on iommu=pt vfio-pci.ids=<device-adress>,<device-adress>

while enabling the vfio-pci kernel module.

We’re doing the same thing (binding the vfio-pci driver to the device) in the middleware - but doing it manually might result in a runtime error when you power on the VM with a “cannot reset/isolate device” if the PCI bridge is truly a system-critical one.

While interrupting the grub boot proces I witnessed that you use these options (amd_iommu=on iommu=pt) indeed. I added the vfio-pci.ids=<device-adress> option specifically for my Arc A750 GPU, but sadly to no avail.
I also installed a Windows VM and added the GPU as Passthrough Device. Again without succes.
Are there any options left? I am growing rather sad about this. Somehow it should be possible to break up this obstinate IOMMU group, don’t you think so?