System crash when attempting to start a VM with Hailo PCIe device

I installed TN on a minipc with N200 and today I updated to 25.04.2. I want to install a VM with a vanilla Debian, I configured the VM attaching a Hailo card PCIe device and uploaded the ISO but when I try to start it the system crash and does not respond anymore to UI or console. Here are the logs I captured before the crash.

TNlog.txt (9.9 KB)

If I remove the device the VM starts normally. The purpose to create the VM was to use that device, since the drivers are not avaiable in TN. Is there anything I can do to make it working?

The other option for me is to enable the developer mode and compile the drivers, but I was discouraged to do that.

Iirc passing a PCIe device to a VM requires passing it’s whole IOMMU group to the VM.

You can list your IOMMU groups using this script:

#!/bin/bash
for d in /sys/kernel/iommu_groups/*/devices/*; do
  n=${d#*/iommu_groups/*}; n=${n%%/*}
  printf 'IOMMU Group %s ' "$n"
  lspci -nns "${d##*/}"
done

Maybe the MiniPC’s PCIe configuration doesn’t allow the passthrough of just this one device…

It does not work. if I run the script as truenas_admin with sudo it returns:

truenas_admin@truenas[~]$ sudo ./iommu.sh
[sudo] password for truenas_admin:
sudo: unable to execute ./iommu.sh: Permission denied

if I execute as root it returns:

root@truenas[/home/truenas_admin]# ./iommu.sh
sudo: process 9976 unexpected status 0x57f
zsh: killed     ./iommu.sh

never seen this…

Did you set the executale bit on the script (chmod +x) ?
You can also just paste the content right into a bash window (it might work with zsh, though I am not sure if for- and string substitution syntax match between bash and zsh).

sure:
-rwxr-xr-x 1 truenas_admin truenas_admin 160 Jul 31 23:38 iommu.sh

I also tried with bash but the result is the same:

root@truenas[/home/truenas_admin]# bash
root@truenas:/home/truenas_admin# ./iommu.sh
sudo: process 11492 unexpected status 0x57f
Killed

Paste the output of cat iommu.sh please. There is something wrong here.
Sudo shouldn’t be called in there.

May be because I become root with sudo su:

root@truenas:/home/truenas_admin# cat iommu.sh
#!/bin/bash
for d in /sys/kernel/iommu_groups/*/devices/*; do
  n=${d#*/iommu_groups/*}; n=${n%%/*}
  printf 'IOMMU Group %s ' "$n"
  lspci -nns "${d##*/}"
done

That could be the issue. Just running sudo su does change the user but keeps your non-privileged user’s environment. Try running sudo -i or sudo su - (although that second one is broken on some versions of TN Scale).

no changes:

truenas_admin@truenas[~/hailort]$ sudo -i
root@truenas[~]# bash
root@truenas:~# cd /home/truenas_admin/
root@truenas:/home/truenas_admin# ./iommu.sh
sudo: process 12018 unexpected status 0x57f
Killed

sudo su - does not work

That is an extremely weird error. Just to be sure, can you try the following:

  • paste the text (all the lines at once) into a bash as root without saving them in a file first
  • run just lspci in the shell you are trying to run the script in
  • check if the for loop works by running:
for d in /sys/kernel/iommu_groups/*/devices/*; do
 echo $d
done

It works:

root@truenas:~# lspci
...
07:00.0 Co-processor: Hailo Technologies Ltd. Hailo-8 AI Processor (rev 01)
root@truenas:~# for d in /sys/kernel/iommu_groups/*/devices/*; do
 echo $d
done
...
/sys/kernel/iommu_groups/7/devices/0000:00:16.0
/sys/kernel/iommu_groups/8/devices/0000:00:1a.0
/sys/kernel/iommu_groups/9/devices/0000:00:1c.0

I cut several lines, the co-processor should be in the group 7 right?

What do you mean by “I cut sever lines”?

That still doesn’t explain the weird sudo error but if that’s the only device in that iommu group then that is not the problem. That still doesn’t bring us any farther to why your machine crashed though :confused:

several
Means that the output was much longer than that, see the file.
lspci.txt (4.0 KB)

Okay, sorry, didn’t get that.

Okay so your Hailo card is in it’s own IOMMU group 24.
That’s not the problem then.

In case there was more confusion:

Did you mean TrueNAS or the VM with “system” here? I understood, that you whole TrueNAS would crash.

If “only” the VM crashes: have you rebooted the TrueNAS since the last failed attempt?

You understood correct: the whole TN crash. I was connected with UI and ssh in addition to the system console and all were not avaiable anymore after 2 seconds, I had to power off and reboot.

I think at that point you should probably create a bug ticket at ixSystems Jira. Include a debug file (you can generate it under advanced settings).

Do you mean here?

It is the ‘Save Debug’ button right?

After the failure I enabled the developer mode to try compiling the driver, it will be shown in the debug file? When the system crashed the mode was not enabled, I may reinstall the system and produce the debug file. As I told you I’m still in the sandbox phase, therefore the system is still empty.

yes

correct

What do you mean it was not enabled then? Enabling developer mode does that for the current boot environment and for that it is a permanent change afaik.

If it was on 24.05.1 and you updated to 24.05.2 now without enabling it again that should be fine.

Today i first updated to 25.04.2, after that I did the tests with the VM, got the incident and started this thread, until that time the developer mode was not enabled.

One hour ago I enabled it to try the other way. Do you mean that if now I reinstall the system from scratch, I mean booting from usb key and reinstalling it, the developer mode will be still enabled?

No, if you reinstall developer mode will be disabled.
I am not sure how iXsystems analyze the debug files. There may well be a marker in there that causes some bot to say “developer mode active - rejected”, no idea…