Dear all,
I’ve been noticing random spontaneous reboots of my TrueNAS (SCALE Dragonfish-24.04.2) but nothing really relevant visible on journalctl when checking what happened just before the reboot… normal stuff like dhcpoffers or containers starting etc.
The only thing I could spot is after the reboot there are traces about:
Oct 15 22:34:18 truenas kernel: BERT: Error records from previous boot:
Oct 15 22:34:18 truenas kernel: [Hardware Error]: event severity: fatal
Oct 15 22:34:18 truenas kernel: [Hardware Error]: Error 0, type: fatal
Oct 15 22:34:18 truenas kernel: [Hardware Error]: section_type: Firmware Error Record Reference
Oct 15 22:34:18 truenas kernel: [Hardware Error]: Firmware Error Record Type: SOC Firmware Error Record Type2
Oct 15 22:34:18 truenas kernel: [Hardware Error]: Revision: 2
Oct 15 22:34:18 truenas kernel: [Hardware Error]: Record Identifier: 8f87f311-c998-4d9e-a0c4-6065518c4f6d
Oct 15 22:34:18 truenas kernel: [Hardware Error]: 00000000: 11048101 00000080 00000000 fe013ddc .............=..
Oct 15 22:34:18 truenas kernel: [Hardware Error]: 00000010: 00000000 f9e7cb6e 0000200c f9e7e665 ....n.... ..e...
Oct 15 22:34:18 truenas kernel: [Hardware Error]: 00000020: 0000200d f9efb172 00002012 f9eff1da . ..r.... ......
Oct 15 22:34:18 truenas kernel: [Hardware Error]: 00000030: 00002013 f9f08609 00002624 0419b127 . ......$&..'...
Also trying to check for other strange behavior in the logs I could find a couple of segfaults, mainly python
or systemd
processes; maybe it’s related:
Oct 15 23:07:22 truenas systemd-coredump[16583]: [🡕] Process 16566 (python3) of user 0 dumped core.
Module _message.abi3.so without build-id.
Module libsystemd.so.0 from deb systemd-252.19-1~deb12u1.amd64
Module libudev.so.1 from deb systemd-252.19-1~deb12u1.amd64
Stack trace of thread 16566:
#0 0x000000000050c830 n/a (python3.11 + 0x10c830)
#1 0x000000000050beff n/a (python3.11 + 0x10beff)
#2 0x000000000050b946 n/a (python3.11 + 0x10b946)
#3 0x000000000050aeae n/a (python3.11 + 0x10aeae)
#4 0x00000000006303e7 n/a (python3.11 + 0x2303e7)
#5 0x00000000004fb023 _PyObject_GC_New (python3.11 + 0xfb023)
#6 0x000000000053aa52 n/a (python3.11 + 0x13aa52)
#7 0x00000000005313ad _PyEval_EvalFrameDefault (python3.11 + 0x1313ad)
#8 0x000000000055c931 _PyFunction_Vectorcall (python3.11 + 0x15c931)
#9 0x0000000000512b14 n/a (python3.11 + 0x112b14)
#10 0x000000000052f8a2 _PyEval_EvalFrameDefault (python3.11 + 0x12f8a2)
#11 0x000000000055c931 _PyFunction_Vectorcall (python3.11 + 0x15c931)
#12 0x000000000052f8a2 _PyEval_EvalFrameDefault (python3.11 + 0x12f8a2)
#13 0x000000000052360b PyEval_EvalCode (python3.11 + 0x12360b)
#14 0x0000000000647497 n/a (python3.11 + 0x247497)
#15 0x0000000000644d4f n/a (python3.11 + 0x244d4f)
#16 0x000000000056f01d PyRun_StringFlags (python3.11 + 0x16f01d)
#17 0x000000000063e4d6 PyRun_SimpleStringFlags (python3.11 + 0x23e4d6)
#18 0x000000000064fb54 Py_RunMain (python3.11 + 0x24fb54)
#19 0x00000000006275c7 Py_BytesMain (python3.11 + 0x2275c7)
#20 0x00007f2f4d3a824a __libc_start_call_main (libc.so.6 + 0x2724a)
#21 0x00007f2f4d3a8305 __libc_start_main_impl (libc.so.6 + 0x27305)
#22 0x0000000000627461 _start (python3.11 + 0x227461)
ELF object binary architecture: AMD x86-64
...
Any ideas what it could be?
What are the best prectices to further investigate?
Thanks in advance,
Tent.
My HW → i3-N305 with 8 cores
MemTotal: 32642964 kB
igc 0000:01:00.0 enp1s0: NIC Link is Up 2500 Mbps Full Duplex, Flow Control: RX/TX Ethernet controller: Intel Corporation Ethernet Controller I226-V (rev 04)
2x Micron/Crucial Technology P2 NVMe PCIe SSD (rev 01) for data
1x SATA Samsung 1Tb EVO as boot drive