Random reboots: best practices to debug/understand what is happening?

Dear all,
I’ve been noticing random spontaneous reboots of my TrueNAS (SCALE Dragonfish-24.04.2) but nothing really relevant visible on journalctl when checking what happened just before the reboot… normal stuff like dhcpoffers or containers starting etc.
The only thing I could spot is after the reboot there are traces about:

Oct 15 22:34:18 truenas kernel: BERT: Error records from previous boot:
Oct 15 22:34:18 truenas kernel: [Hardware Error]: event severity: fatal
Oct 15 22:34:18 truenas kernel: [Hardware Error]:  Error 0, type: fatal
Oct 15 22:34:18 truenas kernel: [Hardware Error]:   section_type: Firmware Error Record Reference
Oct 15 22:34:18 truenas kernel: [Hardware Error]:   Firmware Error Record Type: SOC Firmware Error Record Type2
Oct 15 22:34:18 truenas kernel: [Hardware Error]:   Revision: 2
Oct 15 22:34:18 truenas kernel: [Hardware Error]:   Record Identifier: 8f87f311-c998-4d9e-a0c4-6065518c4f6d
Oct 15 22:34:18 truenas kernel: [Hardware Error]:   00000000: 11048101 00000080 00000000 fe013ddc  .............=..
Oct 15 22:34:18 truenas kernel: [Hardware Error]:   00000010: 00000000 f9e7cb6e 0000200c f9e7e665  ....n.... ..e...
Oct 15 22:34:18 truenas kernel: [Hardware Error]:   00000020: 0000200d f9efb172 00002012 f9eff1da  . ..r.... ......
Oct 15 22:34:18 truenas kernel: [Hardware Error]:   00000030: 00002013 f9f08609 00002624 0419b127  . ......$&..'...

Also trying to check for other strange behavior in the logs I could find a couple of segfaults, mainly python or systemd processes; maybe it’s related:

Oct 15 23:07:22 truenas systemd-coredump[16583]: [🡕] Process 16566 (python3) of user 0 dumped core.
                                                 
                                                 Module _message.abi3.so without build-id.
                                                 Module libsystemd.so.0 from deb systemd-252.19-1~deb12u1.amd64
                                                 Module libudev.so.1 from deb systemd-252.19-1~deb12u1.amd64
                                                 Stack trace of thread 16566:
                                                 #0  0x000000000050c830 n/a (python3.11 + 0x10c830)
                                                 #1  0x000000000050beff n/a (python3.11 + 0x10beff)
                                                 #2  0x000000000050b946 n/a (python3.11 + 0x10b946)
                                                 #3  0x000000000050aeae n/a (python3.11 + 0x10aeae)
                                                 #4  0x00000000006303e7 n/a (python3.11 + 0x2303e7)
                                                 #5  0x00000000004fb023 _PyObject_GC_New (python3.11 + 0xfb023)
                                                 #6  0x000000000053aa52 n/a (python3.11 + 0x13aa52)
                                                 #7  0x00000000005313ad _PyEval_EvalFrameDefault (python3.11 + 0x1313ad)
                                                 #8  0x000000000055c931 _PyFunction_Vectorcall (python3.11 + 0x15c931)
                                                 #9  0x0000000000512b14 n/a (python3.11 + 0x112b14)
                                                 #10 0x000000000052f8a2 _PyEval_EvalFrameDefault (python3.11 + 0x12f8a2)
                                                 #11 0x000000000055c931 _PyFunction_Vectorcall (python3.11 + 0x15c931)
                                                 #12 0x000000000052f8a2 _PyEval_EvalFrameDefault (python3.11 + 0x12f8a2)
                                                 #13 0x000000000052360b PyEval_EvalCode (python3.11 + 0x12360b)
                                                 #14 0x0000000000647497 n/a (python3.11 + 0x247497)
                                                 #15 0x0000000000644d4f n/a (python3.11 + 0x244d4f)
                                                 #16 0x000000000056f01d PyRun_StringFlags (python3.11 + 0x16f01d)
                                                 #17 0x000000000063e4d6 PyRun_SimpleStringFlags (python3.11 + 0x23e4d6)
                                                 #18 0x000000000064fb54 Py_RunMain (python3.11 + 0x24fb54)
                                                 #19 0x00000000006275c7 Py_BytesMain (python3.11 + 0x2275c7)
                                                 #20 0x00007f2f4d3a824a __libc_start_call_main (libc.so.6 + 0x2724a)
                                                 #21 0x00007f2f4d3a8305 __libc_start_main_impl (libc.so.6 + 0x27305)
                                                 #22 0x0000000000627461 _start (python3.11 + 0x227461)
                                                 ELF object binary architecture: AMD x86-64
...

Any ideas what it could be?
What are the best prectices to further investigate?

Thanks in advance,
Tent.

My HW → i3-N305 with 8 cores
MemTotal: 32642964 kB
igc 0000:01:00.0 enp1s0: NIC Link is Up 2500 Mbps Full Duplex, Flow Control: RX/TX Ethernet controller: Intel Corporation Ethernet Controller I226-V (rev 04)
2x Micron/Crucial Technology P2 NVMe PCIe SSD (rev 01) for data
1x SATA Samsung 1Tb EVO as boot drive

Based on a google search of the Record Identifier: 8f87f311-c998-4d9e-a0c4-6065518c4f6d it appears the mostly likely culprit is a memory issue. You may want to run memtest or swap out the ram for known good modules and see if the problem persists.

2 Likes

Just a quick confirmation: memtest found a lot of errors on the 32Gb DDR5- 4800 SODIMM from Crucial… will file an RMA… hope this things do not happen often…

This is why ECC memory exists…
Also check RAM cooling in your N305 system. There was a theread about that recently.

well… if the memory would have been ECC it would just have “masked” that it was defective even more… no?

And about the cooling: do you mean that when temperature rises it might create memory errors? that would be even worse no?

It would detect and log errors, but not crash the system.

well yes fair… anyways from my understanding if I buy ECC DDR5 it would not be supported by my system cause it’s i3 alder lake, correct? I’d need to have a i5 (but that one would be even hotter…

also since noise should not be a big deal I could put a nice fan on top of the fans of the system if that is beneficial… ideally a fan that can control speed… what fans do you guys usually use?

another kind of curious question about DDR5 SODIMMS: is it normal I don’t find/see 64Gb ones? Maximum seems to be 48Gb? (I have only one memory slot, so maybe an upgrade to 64Gb would be sweet…)

Intel states that the max memory size for the N305 is 16GB.
While that doesn’t necessarily mean bigger won’t work, you’re off the reservation and relying your motherboard dealing with it well.

really? the vendor of the full system states 32Gb and they work…

actually my question was more like: is it normal that amazon nor others list 64Gb DDR5 SODIMMS? They are still not on the market I guess?

Back to the heat topic, I could not spot the thread that was talking about alder lake i3 and heat issues but do you think the heat could damage the DDR5 or induce the errors due to it?

The thread was about confined DDR5 memory modules getting hot and causing errors. When cooled sufficiently or downclocked the errors went away.

As reported in the thread @Okedokey references, I was having similar issues with random reboots and escalating instability.

When running DDR5 RAM at 4800MHz the system would fail memtest. I replaced the RAM and the replacement RAM also failed memtest. I returned the original N100 motherboard and replaced it with an N100/i3 305 motherboard. It behaves exactly the same.

For both motherboards, capping the max memory speed in BIOS to 4600MHz allowed memtest to run for >24 hours without error.

OMG that sounds really bad if true… aaaand: in my BIOS I see no trace of settings to change DDR5 frequency speed nor mem voltage…
I also have a TOPTON mother board, integrated in TOPTON case which is a complete dissipator: similar to this one → The Fanless King Intel Core i5 6x 2.5GbE System Review by CWWK

Update: so it seems that if I’d like to tweak things in the BIOS I will need to flash it with an “unofficial” version (check here: CWWK i5-1235U 6 port i226 report | ServeTheHome Forums ) sounds quite scary, also unsure because in that thread the CPU is a different/older one… not alder lake from what I understand… not sure I will risk it…