Two new Intel servers randomly rebooting

thaddeusf · January 14, 2025, 4:32pm

I have two new Intel 13th Gen, B760 systems - one with i7, 16 core, 64MB, the other with and i5, 14 core, 32MB. Both servers are identically configured with 8x HDDs.

Both servers are randomly rebooting. There is no indication of any errors that I can see. Can someone please guide me of where to look for details of these restarts? Trial and error with BIOS settings has not been successful. I’m running Dragonfish-24.04.2.5.

Thanks.

neofusion · January 14, 2025, 5:30pm

Unexpected reboots are usually HW related.

I would start by testing the RAM with something like memtest86. Let it run overnight.
Next up is to check if it could some form of power issue, it could be bad PSU’s or possibly unclean power in your building. A UPS may help with an unstable power grid.
My last guess would be faulty boot drives.

There are more things that can cause these issues, but the above would be my first things to investigate.

thaddeusf · January 14, 2025, 8:57pm

Thanks for the suggestion. Let me further clarify that the two servers are at two completely physical locations. One server is on an enterprise UPS, the other is not. These two servers independently having the same behavior seems unlikely a hardware flaw. Perhaps a BIOS setting, but…

However, my real question is how to enable logging or have some visibility of what the software is reporting. Thanks.

ChrisRJ · January 14, 2025, 9:16pm

It is very likely hardware. So the very first thing is to understand exactly what is in those machines.

Exact motherboard, CPU, RAM, PSU, disks, case, fans please.

thaddeusf · January 19, 2025, 6:07pm

Hello. I’ve gone ahead and updated both servers to Electric Eel. No changes. As for the hardware, both servers include:

ASUS ROG Strix B760-I Gaming WiFi 6E Intel® B760(13th and 12th Gen) LGA 1700 mini-ITX motherboard
One server includes:
Intel Core i5-13500 Desktop Processor 14 cores
The other includes
Intel Core i7-13700 Desktop Processor 16 cores
Both
CORSAIR VENGEANCE DDR5 RAM 64GB (2x32GB) 6000MHz CL40 Intel XMP iCUE Compatible Computer Memory
JONSBO N3 Mini-ITX NAS PC Chassis, ITX Computer Case, 8HHD+1-SSDD isk Bays NAS Mini Aluminum with Steel Plate Case

The more I watch the behavior of the systems, but more I suspect the B760 network driver. I cannot imagine TrueNAS does not have system logging which would help diagnose where the problem originates from. Can anyone point me to this?

Thanks

ChrisRJ · January 20, 2025, 5:50pm

We still don’t know about PSUs and drives.

In my experience logging is not the most promising approach for this kind of issues. It may help, but other approaches are more promising.

“Random reboots” is extremely vague. The first thing I would do is starting some kind of journal. What did you do when it happened? Were other consumers of electricity turned on or off at the moment? Was a scrub running? Did you transfer data? If yes, how exactly? Is it always the same time? Do you have a large building site in the vicinity?

The hardware is less than ideal for the usage. Is it new? Has it worked better with other OSes? If no, I suggest you install Windows and test it there also.

thaddeusf · January 20, 2025, 7:15pm

Hello ChrisRJ.

Thank you for your reply. Let me first further clarify the situation and then I’ll answer your questions. Both of these servers have been running for many months without any rebooting issues. Meaning in the same environment they were running with a fairly typical configuration with the following services:

TrueNAS electric eel.
SMB service with 8 shares
Plex application (up to date)
syncthing application (up to date). Replicating between the two servers.
tailscale application (up to date)

The servers started showing this rebooting behavior after upgrading past 23.10 (Cobia). There was no hardware change in any way to this hardware.

To answer your questions:

What do you do when it happens? Increasingly it appears to be aligned with slightly heavier network usage. This is not heavy usage by the way, it might be a single client transferring files or watching plex.
…other consumers of electricity…? The server in my location is behind a large UPS and also a large bank of home batteries. The UPS is reporting the AC is healthy and is reporting no errors.
Was a scrub running? I have the default scrub schedule running without errors. There seems to be no correlation in the timing.
Did I transfer data? Syncthing will transfer to the other server when seeing changes to a specific folder, but the reboots are independent of any transfers. I’ve on purpose not changed content to see if there is a correlation.
Do I have a large building…? There is no large building in this area.

You mention ‘hardware is less than ideal’, why is that? The systems are newer, which might be your point, but I would the specifications of these systems are extremely generous for TrueNAS.

Lastly, the HDDs in the locale server are 8 x Western Digital 18TH SAS DC HC550.

I certainly understand need to have clarification as part of the diagnosis of problems, but I’m trying to understand if there are any tools available to show the source of errors on the server.

Thanks.

etorix · January 20, 2025, 7:25pm

Gamer motherboard, not necessarily engineered for stable 24/7 operation.
We’d prefer to see server-grade or workstation-grade hardware, even if older.
And no 2.5G NIC (yes, even Intel!).

SAS drives? You had not described that.
What’s the HBA, which firmware does it run and how is it cooled?

thaddeusf · January 20, 2025, 8:32pm

I’m not a huge fan of gamer systemboards either, but the case (linked above), supports only ITX format. The systemboard supports 2.5Gb, but I have considered installing PCIe 10Gb cards and would immediately do this if I could find evidence the B760 Ethernet was causing the crashes.

The WD HDD are as listed:
Disk Size: 16.37 TiB
Transfer Mode: Auto
Serial: 3WGS5YHJ
Model: WUH721818AL5204
Rotation Rate: 7200 RPM
Type: HDD
HDD Standby: ALWAYS ON
Description: N/A

I don’t know how to retrieve the firmware version from console, but on the drive, it shows FW:232 with a date of 02-SEP-2020, P/N: 0F38953. It is cooled as shown in the case description.

Yes, I know I keep pointing this out, but the server(s) have been running for months without issue, but it appears 24.x introduced the problem.

Looking for tool/procedure to see what the kernel is reporting that causes the reboot.

Thanks

etorix · January 20, 2025, 8:45pm

What’s the HBA?
You can check its firmware with either
sudo sas2flash -list or
sudo sas3flash -list

thaddeusf · January 20, 2025, 9:09pm

 Adapter Selected is a Avago SAS: SAS3008(C0)

    Controller Number              : 0
    Controller                     : SAS3008(C0)
    PCI Address                    : 00:01:00:00
    SAS Address                    : 500605b-b-1218-0145
    NVDATA Version (Default)       : 0e.01.00.08
    NVDATA Version (Persistent)    : 0e.01.00.08
    Firmware Product ID            : 0x2721 (IR)
    Firmware Version               : 16.00.10.00
    NVDATA Vendor                  : LSI
    NVDATA Product ID              : SAS9311-8i
    BIOS Version                   : 08.37.00.00
    UEFI BSD Version               : 18.00.00.00
    FCODE Version                  : N/A
    Board Name                     : SAS9311-8i
    Board Assembly                 : N/A
    Board Tracer Number            : N/A

    Finished Processing Commands Successfully.
    Exiting SAS3Flash.

Fleshmauler · January 20, 2025, 9:35pm

HBA flashed to IR mode instead of IT - not saying it is without a doubt the cause of your problems, but worth flashing to IT asap

richsmif · January 20, 2025, 9:52pm

I have also had this in the last week or so, randomly rebooting. Full hard reset.
(as if hardware failure) Anything from 30 seconds to 5 minutes. Related to data transfer. If i did “nothing”, it would hold. As soon as start moving data, bang reboot. Nothing wrong with the hardware. I’ve reinstalled 24.10.1 and so far its holding. Slowing adding drives back into the setup. Sadly i couldn’t pin point it, i also have a HBA 3008. Something is a foot and not worked it out yet.

neofusion · January 20, 2025, 10:04pm

I recommend you post your own thread and describe your hardware in detail there.
This issue can stem from so many different causes.

As a counter point, I also have 3008 but have never had an unexpected reboot.

thaddeusf · January 20, 2025, 10:21pm

Just for completeness, I changed out the card so now both of the servers are in IT mode with the 3008.

Adapter Selected is a Avago SAS: SAS3008(C0)

    Controller Number              : 0
    Controller                     : SAS3008(C0)
    PCI Address                    : 00:01:00:00
    SAS Address                    : 56c92bf-0-003b-cf08
    NVDATA Version (Default)       : 0e.00.00.00
    NVDATA Version (Persistent)    : 0e.00.00.00
    Firmware Product ID            : 0x2221 (IT)
    Firmware Version               : 14.00.02.00
    NVDATA Vendor                  : INSPUR
    NVDATA Product ID              : INSPUR 3008IT
    BIOS Version                   : 08.33.00.00
    UEFI BSD Version               : 16.00.00.00
    FCODE Version                  : N/A
    Board Name                     : INSPUR 3008IT
    Board Assembly                 : INSPUR
    Board Tracer Number            : QTHC06EA1104A40

    Finished Processing Commands Successfully.
    Exiting SAS3Flash.

thaddeusf · January 20, 2025, 10:22pm

Also, I was reminded another downside of this systemboard - there isn’t an extra PCIe slot available for a 10Gb card.

etorix · January 20, 2025, 10:27pm

That’s bad. The right firmware is P16.00.12.00.
.12 corrects a bug with SSDs, so .10 is probably good enough for HDDs, and IR should be acceptable for ZFS (3% penalty performance over IT according to jgreco). But P14 is not good at all: The driver in TrueNAS wants P16.

Fleshmauler · January 20, 2025, 10:36pm

I just realized you’re running 13th gen cpus… Which intel has acknowledged suffered manufacturing defects on certain batches (yey internal rust) as well as having microcode updates required… I have suffered at least a dozen hours to get my 13900k stable before this was a known fact with manually setting clocks & voltage curves…

Stupid questions time: is bios fully up to date? If you load up a quick windows os & run cinebench (r23 shows issues faster, but 2024 is fine to test, will just take longer to show symptoms)… Do the systems crash?

thaddeusf · January 20, 2025, 10:52pm

Ok. I will put the previous IR card back in until I can get the replacement flashed to P16+

When I rebooted the system, I noticed this message…

pinctrl)alderlake: module verification failed: signature and/or required key missing - tainting kernel

Can’t attach picture, apparently.

Fleshmauler · January 20, 2025, 10:58pm

85% sure that’d be igpu related… Which reminds me that igpu shares system ram.

Any chance that manually setting arc max size a few gigs lower than default max make a difference? Anither thing to test at least. Don’t have command on hand since not at home, but it won’t survive a reboot or deployment/shutdown of a vm; so minimal chance of borking things badly.