I have two new Intel 13th Gen, B760 systems - one with i7, 16 core, 64MB, the other with and i5, 14 core, 32MB. Both servers are identically configured with 8x HDDs.
Both servers are randomly rebooting. There is no indication of any errors that I can see. Can someone please guide me of where to look for details of these restarts? Trial and error with BIOS settings has not been successful. I’m running Dragonfish-24.04.2.5.
I would start by testing the RAM with something like memtest86. Let it run overnight.
Next up is to check if it could some form of power issue, it could be bad PSU’s or possibly unclean power in your building. A UPS may help with an unstable power grid.
My last guess would be faulty boot drives.
There are more things that can cause these issues, but the above would be my first things to investigate.
Thanks for the suggestion. Let me further clarify that the two servers are at two completely physical locations. One server is on an enterprise UPS, the other is not. These two servers independently having the same behavior seems unlikely a hardware flaw. Perhaps a BIOS setting, but…
However, my real question is how to enable logging or have some visibility of what the software is reporting. Thanks.
JONSBO N3 Mini-ITX NAS PC Chassis, ITX Computer Case, 8HHD+1-SSDD isk Bays NAS Mini Aluminum with Steel Plate Case
The more I watch the behavior of the systems, but more I suspect the B760 network driver. I cannot imagine TrueNAS does not have system logging which would help diagnose where the problem originates from. Can anyone point me to this?
In my experience logging is not the most promising approach for this kind of issues. It may help, but other approaches are more promising.
“Random reboots” is extremely vague. The first thing I would do is starting some kind of journal. What did you do when it happened? Were other consumers of electricity turned on or off at the moment? Was a scrub running? Did you transfer data? If yes, how exactly? Is it always the same time? Do you have a large building site in the vicinity?
The hardware is less than ideal for the usage. Is it new? Has it worked better with other OSes? If no, I suggest you install Windows and test it there also.
Thank you for your reply. Let me first further clarify the situation and then I’ll answer your questions. Both of these servers have been running for many months without any rebooting issues. Meaning in the same environment they were running with a fairly typical configuration with the following services:
TrueNAS electric eel.
SMB service with 8 shares
Plex application (up to date)
syncthing application (up to date). Replicating between the two servers.
tailscale application (up to date)
The servers started showing this rebooting behavior after upgrading past 23.10 (Cobia). There was no hardware change in any way to this hardware.
To answer your questions:
What do you do when it happens? Increasingly it appears to be aligned with slightly heavier network usage. This is not heavy usage by the way, it might be a single client transferring files or watching plex.
…other consumers of electricity…? The server in my location is behind a large UPS and also a large bank of home batteries. The UPS is reporting the AC is healthy and is reporting no errors.
Was a scrub running? I have the default scrub schedule running without errors. There seems to be no correlation in the timing.
Did I transfer data? Syncthing will transfer to the other server when seeing changes to a specific folder, but the reboots are independent of any transfers. I’ve on purpose not changed content to see if there is a correlation.
Do I have a large building…? There is no large building in this area.
You mention ‘hardware is less than ideal’, why is that? The systems are newer, which might be your point, but I would the specifications of these systems are extremely generous for TrueNAS.
Lastly, the HDDs in the locale server are 8 x Western Digital 18TH SAS DC HC550.
I certainly understand need to have clarification as part of the diagnosis of problems, but I’m trying to understand if there are any tools available to show the source of errors on the server.
Gamer motherboard, not necessarily engineered for stable 24/7 operation.
We’d prefer to see server-grade or workstation-grade hardware, even if older.
And no 2.5G NIC (yes, even Intel!).
SAS drives? You had not described that.
What’s the HBA, which firmware does it run and how is it cooled?
I’m not a huge fan of gamer systemboards either, but the case (linked above), supports only ITX format. The systemboard supports 2.5Gb, but I have considered installing PCIe 10Gb cards and would immediately do this if I could find evidence the B760 Ethernet was causing the crashes.
The WD HDD are as listed:
Disk Size: 16.37 TiB
Transfer Mode: Auto
Serial: 3WGS5YHJ
Model: WUH721818AL5204
Rotation Rate: 7200 RPM
Type: HDD
HDD Standby: ALWAYS ON
Description: N/A
I don’t know how to retrieve the firmware version from console, but on the drive, it shows FW:232 with a date of 02-SEP-2020, P/N: 0F38953. It is cooled as shown in the case description.
Yes, I know I keep pointing this out, but the server(s) have been running for months without issue, but it appears 24.x introduced the problem.
Looking for tool/procedure to see what the kernel is reporting that causes the reboot.
I have also had this in the last week or so, randomly rebooting. Full hard reset.
(as if hardware failure) Anything from 30 seconds to 5 minutes. Related to data transfer. If i did “nothing”, it would hold. As soon as start moving data, bang reboot. Nothing wrong with the hardware. I’ve reinstalled 24.10.1 and so far its holding. Slowing adding drives back into the setup. Sadly i couldn’t pin point it, i also have a HBA 3008. Something is a foot and not worked it out yet.
That’s bad. The right firmware is P16.00.12.00.
.12 corrects a bug with SSDs, so .10 is probably good enough for HDDs, and IR should be acceptable for ZFS (3% penalty performance over IT according to jgreco). But P14 is not good at all: The driver in TrueNAS wants P16.
I just realized you’re running 13th gen cpus… Which intel has acknowledged suffered manufacturing defects on certain batches (yey internal rust) as well as having microcode updates required… I have suffered at least a dozen hours to get my 13900k stable before this was a known fact with manually setting clocks & voltage curves…
Stupid questions time: is bios fully up to date? If you load up a quick windows os & run cinebench (r23 shows issues faster, but 2024 is fine to test, will just take longer to show symptoms)… Do the systems crash?
85% sure that’d be igpu related… Which reminds me that igpu shares system ram.
Any chance that manually setting arc max size a few gigs lower than default max make a difference? Anither thing to test at least. Don’t have command on hand since not at home, but it won’t survive a reboot or deployment/shutdown of a vm; so minimal chance of borking things badly.