HP ProLiant MicroServer Gen8 instable on v25, OK on v24 and Debian12: kernel issue?

TL;DR: HPE ProLiant MicroServer Gen8 crashes with Unrecoverable System Error (NMI) errors on TrueNAS SCALE v25.04 but does not on v24.10 nor Debian 12. Different kernel = cause?

Hello,

For some time now I’m trying to get TrueNAS SCALE working on a HPE ProLiant MicroServer Gen8 (CPU: E3-1220L V2, RAM: 16GB PC3L 12800E Memtest86+ OK) with extra PCI Express 9211-8i SAS card (to extend the existing storage provided by the integrated HPE Dynamic Smart Array B120i controller)

I get “Unrecoverable System Error (NMI)” and it reboots, the symptoms are:

The server reboots, the hardware “Health LED” blinks red and the iLO’s “Integrated Management Log” (BMC tool) page says:

Class: System Error Description: Unrecoverable System Error (NMI) has occurred. System Firmware will log additional details in a separate IML entry if possible
Class: OS Description: User Initiated NMI Switch

I’ll detail all my incremental attempts below but I am at point where it looks like it works with v24.10 “Electric Eel” but fails with v25.04 “Fangtooth”.

On my last attempt, I’ve managed to have it running a full week (without any crash/reboot in the end) on TrueNAS-SCALE v24.10.2.4 (Linux truenas 6.6.44-production+truenas #1 SMP PREEMPT_DYNAMIC Wed Aug 6 20:07:31 UTC 2025 x86_64 GNU/Linux) (thanks to CertainBumblebee769 on Reddit)
where the previous attempt on TrueNAS-SCALE v25.04.2.3 (Linux truenas 6.12.15-production+truenas #1 SMP PREEMPT_DYNAMIC Wed Aug 20 13:31:09 UTC 2025 x86_64 GNU/Linux) failed in 4 days resulting in a crash/reboot.

It took me a while because I suspected my PCI Express 9211-8i SAS card to be faulty (I also had to downgrade it’s firmware), or the support of SSDs, so I’ve tested without/without the card, with/without SSDs, vanilla Debian (v6.1.140-1) and TrueNAS v24/v25.

Those 2 tests are without my SAS card, I’m currently testing with the SAS card, then I’ll add storage HDDs.

If it works fine, I’d have built a working v24.10 setup, but I’d like to have a v25.04 :sweat_smile:.

Let’s say it works, the issue would most likely be related to the kernel? Is it possible to run TrueNAS SCALE v25.04 on v6.6 kernel? On a version comprised between 6.6 and 6.12 (to find the latest working one)?

Thanks

Recap:

  • TrueNAS SCALE v25.04.2.3 runs v6.12.15 (not working)
  • TrueNAS SCALE v24.10.2.4 runs v6.6.44 (working)
  • Debian v12.11 runs v6.1.140-1 (working)

Server firmware/BIOS are up-to-date:

  • System ROM: J06 04/04/2019
  • System ROM Date: 04/04/2019
  • Backup System ROM: J06 11/02/2015
  • iLO Firmware Version: 2.82 Feb 06 2023
  • Server Platform Services (SPS) Firmware: 2.2.0.31.2
  • System Programmable Logic Device: Version 0x06
  • System ROM Bootblock: 02/04/2012
  • Embedded Flash/SD-CARD: Controller firmware revision 2.10.00
Here are all my incremental attempts (🔧 highlights the change)
  • Test #1
    • Setup:
      • One 3.5" HDD on B120i
      • No 9211-8i PCIe SAS card
      • Debian 12.11 (kernel v6.1.140-1) installed on B120i’s HDD
    • Duration: 3 days
    • Verdict: :green_circle: No crash, no reboot, no NMI error
  • Test #2
    • Setup:
      • One 2.5" (:wrench:) HDD on B120i
      • No 9211-8i PCIe SAS card
      • Debian 12.11 (kernel v6.1.140-1) installed on B120i’s HDD
    • Duration: 3 days
    • Verdict: :green_circle: No crash, no reboot, no NMI error
  • Test #3
    • Setup:
      • One 2.5" HDD on B120i
      • 9211-8i PCIe SAS card inserted (:wrench:)
      • Debian 12.11 (kernel v6.1.140-1) installed on B120i’s HDD
    • Duration: 3 days
    • Verdict: :green_circle: No crash, no reboot, no NMI error
  • Test #4
    • Setup:
      • One 2.5" HDD on B120i
      • 9211-8i PCIe SAS card inserted
      • One 3.5" HDD powered and SATA-connected to the PCIe SAS card (:wrench:)
      • Debian 12.11 (kernel v6.1.140-1) installed on B120i’s HDD
    • Duration: 3 days
    • Verdict: :green_circle: No crash, no reboot, no NMI error
  • Test #5
    • Setup:
      • One 2.5" SSD (:wrench:) on B120i
      • 9211-8i PCIe SAS card inserted
      • One 3.5" HDD powered and SATA-connected to the PCIe SAS card
      • Debian 12.11 (kernel v6.1.140-1) installed on B120i’s SDD
    • Duration: 3 days
    • Verdict: :green_circle: No crash, no reboot, no NMI error
  • Test #6
    • Setup:
      • One 2.5" SSD on B120i
      • 9211-8i PCIe SAS card inserted
      • Four (:wrench:) 3.5" HDDs powered and SATA-connected to the PCIe SAS card
      • Debian 12.11 (kernel v6.1.140-1) installed on B120i’s SDD
    • Duration: Was OK idle, but failed when started to process data on thoses HDDs (disk I/O)
    • Verdict: :red_circle: kernel errors (“kernel: DMAR: ERROR: DMA PTE for vPFN 0xf1f80 already set (to f1f80003 not 120d5c001)”), No reboot
  • Test #6a
    • Setup:
      • One 2.5" SSD on B120i
      • 9211-8i PCIe SAS card inserted
      • Four 3.5" HDDs powered and SATA-connected to the PCIe SAS card
      • Debian 12.11 (kernel v6.1.140-1) installed on B120i’s SDD
      • Added intel_iommu=off to GRUB’s GRUB_CMDLINE_LINUX_DEFAULT (source) (:wrench:)
    • Duration: (Sadly, I didn’t write it down)
    • Verdict: :green_circle: No crash, no reboot, no NMI error
  • Test #7
    • Setup:
      • Two 2.5" SSD on B120i
      • 9211-8i PCIe SAS card inserted
      • Four 3.5" HDDs powered and SATA-connected to the PCIe SAS card
      • TrueNAS SCALE v25.04.2.3 (:wrench:) (Linux truenas 6.12.15-production+truenas #1 SMP PREEMPT_DYNAMIC Wed Aug 20 13:31:09 UTC 2025 x86_64 GNU/Linux) installed on SSDs
      • ZFS Data-pool on the 9211-8i HDDs (:wrench:)
    • Duration: 42 hours
    • Verdict: :red_circle: NMI errors, Server reboot
  • Test #8
    • Setup:
      • One (:wrench:) SSD on B120i
      • 9211-8i PCIe SAS card inserted
      • Four 3.5" HDDs powered and SATA-connected to the PCIe SAS card
      • TrueNAS-SCALE v25.04.2.3 (Linux truenas 6.12.15-production+truenas #1 SMP PREEMPT_DYNAMIC Wed Aug 20 13:31:09 UTC 2025 x86_64 GNU/Linux) installed on SSD
      • ZFS Data-pool on 4 9211-8i HDDs
      • Fix midclt call system.advanced.update '{"kernel_extra_options": "intel_iommu=off"}' applied (:wrench:)
    • Duration: 19 hours
    • Verdict: :red_circle: NMI errors, Server reboot
  • Test #9
    • Setup:
      • One SSD on B120i
      • No 9211-8i PCIe SAS card (:wrench:)
      • Four 3.5" HDDs powered but not SATA-connected (:wrench:)
      • TrueNAS-SCALE v25.04.2.3 (Linux truenas 6.12.15-production+truenas #1 SMP PREEMPT_DYNAMIC Wed Aug 20 13:31:09 UTC 2025 x86_64 GNU/Linux) installed on SSD
      • ZFS Data-pool on 4 HDDs, but offline
      • Fix midclt call system.advanced.update '{"kernel_extra_options": "intel_iommu=off"}' applied
    • Duration: 4 days and 5 hours
    • Verdict: :red_circle: NMI errors, Server reboot
  • Test #10
    • Setup:
      • Two SSDs on B120i
      • No 9211-8i PCIe SAS card
      • No HDD
      • TrueNAS-SCALE v24.10.2.4 (Linux truenas 6.6.44-production+truenas #1 SMP PREEMPT_DYNAMIC Wed Aug 6 20:07:31 UTC 2025 x86_64 GNU/Linux) installed on SSDs
    • Duration: 7 days
    • Verdict: :green_circle: No crash, no reboot, no NMI error
  • Test #11
    • Setup:
      • Two SSDs on B120i
      • 9211-8i PCIe SAS card inserted (:wrench:)
      • No HDD
      • TrueNAS-SCALE v24.10.2.4 (Linux truenas 6.6.44-production+truenas #1 SMP PREEMPT_DYNAMIC Wed Aug 6 20:07:31 UTC 2025 x86_64 GNU/Linux) installed on SSDs
      • Fix midclt call system.advanced.update '{"kernel_extra_options": "intel_iommu=off"}' applied (:wrench:)
    • Duration: ? (pending)
    • Verdict: ? (pending)
  • Test #12
    • Setup:
      • Two SSDs on B120i
      • 9211-8i PCIe SAS card inserted
      • Four 3.5" HDDs powered and SATA-connected to the PCIe SAS card (:wrench:)
      • TrueNAS-SCALE v24.10.2.4 (Linux truenas 6.6.44-production+truenas #1 SMP PREEMPT_DYNAMIC Wed Aug 6 20:07:31 UTC 2025 x86_64 GNU/Linux) installed on SSDs
      • Fix midclt call system.advanced.update '{"kernel_extra_options": "intel_iommu=off"}' applied
    • Duration: ? (no started)
    • Verdict: ? (no started)

Hi,

I have a Gen8 Microserver, with the same Celeron, 16Gb ECC memory, I do not remember which brand/type, likely Crucial, or Kingston. I bought it later, so it is not HPE.

The firmware versions you have listet are the newest, afaik.

It boots via an sdcard with grub pointing to an ssd on the ODD-Sata Port. 4 HDDs are connected to the Onboard SATA in AHCI mode, no PCI-E card in use.

I just updated it to the newest version of Fangtooth, 25.04.2.4 just now, from 25.04.1.

It has always been a kind of cold storage backup system for me, so it never has been running for several days on end, so although it is rather old hardware, it has not been running for such a long time.

Not sure, would not a kernel issue prevent it from working at all?

Edit:

Why was that, usually one has to make sure to use the latest versions?

And those type of cards can get quite hot, they do need airflow.

My 2x 8GB of RAM are SK Hynx one, with ECC.

It looks like we have the same setup except for the boot device where I use the B120i.

Because it was running v7.39.02.00 which is said to be incompatible with Gen8 v7.39.00.00 seems to last good one.

This I don’t know, your assumption looks fairly right but I don’t really know. I guess some might trigger something in some special/rare cases?

Ah, strictly speaking, you do not the the BIOS, just the HBA flashed with IT-Mode P.20 firmware. The BIOS is for configuring the disks, but we want HBA to just give truenas full access so that it can do its thing. I think I remember something about some P.20 versions being buggy.

image

Or just under load, if the drivers are buggy?

My Gen8 has been running for 25 hours now without any errors or reboots

Quick update:

My 11th ended without issue:

Test #11

  • Two SSDs on B120i
  • 9211-8i PCIe SAS card inserted (:wrench:)
  • No HDD
  • TrueNAS-SCALE v24.10.2.4 (Linux truenas 6.6.44-production+truenas #1 SMP PREEMPT_DYNAMIC Wed Aug 6 20:07:31 UTC 2025 x86_64 GNU/Linux) installed on SSDs
  • Fix midclt call system.advanced.update '{"kernel_extra_options": "intel_iommu=off"}' applied (:wrench:)

From 2025-09-11 22:59:11 to 2025-09-22 10:06

Duration: 10 days and 11 min

Verdict: :green_circle: No crash, no reboot, no NMI error

My last test (12th) is still running (for 27 days now):

Test #12

  • Two SSDs on B120i
  • 9211-8i PCIe SAS card inserted
  • Four 3.5" HDDs powered and SATA-connected to the PCIe SAS card (:wrench:)
  • TrueNAS-SCALE v24.10.2.4 (Linux truenas 6.6.44-production+truenas #1 SMP PREEMPT_DYNAMIC Wed Aug 6 20:07:31 UTC 2025 x86_64 GNU/Linux) installed on SSDs
  • Fix midclt call system.advanced.update '{"kernel_extra_options": "intel_iommu=off"}' applied
    (both “intel_iommu=on” and “intel_iommu=off” are present, in this order, in /boot/grub/grub.cfg)

From 2025-09-22 12:07:30 to ???

I’ve found those bug report on Debian for Gen8 ProLiant:

It suggests trying adding intel_idle.max_cstate=2 to the kernel command line.

I’ll try upgrading my v24.10 to v25.04 (or v25.10) and test it.

Sorry I didn’t spot your posts on this earlier! This problem’s been around for quite a while now, and is well known.

You can fix it by applying the “intel_iommu=off” flag to the kernel, or turning off a couple of Intel virtualisation features in the BIOSL:

My “Test #12” setup (running ElectricEel-24.10.2.4, with intel_iommu=off visible in the /proc/cmdline) had been running for about 89 days, so I’ve updated it to v25.04.2.6 using the WebUI.

Test #13

  • Two SSDs on B120i

  • 9211-8i PCIe SAS card inserted

  • Four 3.5" HDDs powered and SATA-connected to the PCIe SAS card

  • TrueNAS-SCALE v25.04.2.6 (Linux truenas 6.12.15-production+truenas #1 SMP PREEMPT_DYNAMIC Wed Oct 29 14:40:06 UTC 2025 x86_64 GNU/Linux) installed on SSDs (:wrench:)

  • Fix midclt call system.advanced.update '{"kernel_extra_options": "intel_iommu=off"}' applied
    (both “intel_iommu=on” and “intel_iommu=off” are present, in this order, in /boot/grub/grub.cfg)

So far it’s running fine for 2 days :crossed_fingers:.

I need to double check but I’m pretty sure I did not disabled "Intel Virtualisation Technology” and “Intel VT-d” in the BIOS.

There is also another possible solution: setting intel_idle.max_cstate to 2.

Source:

Test #13 (TrueNAS-SCALE v25.04.2.6) crashed after only 2 days and 8 hours.

From 2025-12-20 17:34:37 to 2025-12-23 11:54:40

Duration: 2 days and 8 hours

Verdict: :red_circle: NMI errors, Server reboot


So I’m trying the intel_idle.max_cstate=2 fix:

Test #14

  • Two SSDs on B120i

  • 9211-8i PCIe SAS card inserted

  • Four 3.5" HDDs powered and SATA-connected to the PCIe SAS card

  • TrueNAS-SCALE v25.04.2.6 (Linux truenas 6.12.15-production+truenas #1 SMP PREEMPT_DYNAMIC Wed Oct 29 14:40:06 UTC 2025 x86_64 GNU/Linux) installed on SSDs

  • Fix midclt call system.advanced.update '{"kernel_extra_options": "intel_iommu=off intel_idle.max_cstate=2"}' applied (:wrench:)
    (both “intel_iommu=on” and “intel_iommu=off” are present, in this order, in /boot/grub/grub.cfg)

No luck, Test #14 (TrueNAS-SCALE v25.04.2.6) rebooted, iLO mentions a “User Initiated NMI Switch“ but no “Unrecoverable System Error (NMI) has occurred. System Firmware will log additional details in a separate IML entry if possible“.

Test #14 (TrueNAS-SCALE v25.04.2.6) crashed after only

From 2025-12-23 02:09:31 to 2025-12-26 06:25:00

Duration: 3 days and 4 hours

Verdict: :red_circle: Server reboot


I am now testing with intel_iommu=off but without the default intel_iommu=on (from /etc/default/grub.d/truenas.cfg) and without intel_idle.max_cstate=2 (might not be the right value anyway as grep . /sys/devices/system/cpu/cpu0/cpuidle/state*/name only lists POLL, C1, C1E, C3 and C6 for my Intel Xeon E3-1220L 2V CPU):

Test #15

  • Two SSDs on B120i

  • 9211-8i PCIe SAS card inserted

  • Four 3.5" HDDs powered and SATA-connected to the PCIe SAS card

  • TrueNAS-SCALE v25.04.2.6 (Linux truenas 6.12.15-production+truenas #1 SMP PREEMPT_DYNAMIC Wed Oct 29 14:40:06 UTC 2025 x86_64 GNU/Linux) installed on SSDs

  • Fix midclt call system.advanced.update '{"kernel_extra_options": "intel_iommu=off"}' applied (:wrench:)
    (“intel_iommu=on” removed from /boot/grub/grub.cfg using sed -i -E ‘s,(\Wlinux\W/ROOT/.*) intel_iommu=on(.*$),\1\2,’ /boot/grub/grub.cfg)