So had a power cut and the server first complained of secure boot issues, I went into the BIOS and make sure that was okay and it booted after. But shows something changed with the power cut.
TrueNAS boots and looks okay but then randomly (say within an hour) resets. I can see the command line reboot again but I cannot find out why or what triggered it.
If I look at console.log at the point of reboot it shows the time and nothing else apart from the details of the reboot. Doesn’t seem to show an error.
Because I loose network at this point I’d considered the 10GBe card but switched to the motherboard 1Gbe and it does the same thing.
So how do I troubleshoot this? Are there other places where I can see logs? Is there any way I can understand what could be triggering this?
So this is really about approaches in troubleshooting random resets…
Prior to the power cut was fine, think it was up for 100 days or so.
Some things that have bitten people are power save options, or over-clocking. Both generally not suitable for a server. And in some cases those trigger failures.
Since you lost power, you might check and make sure your motherboard battery is still good. If not replace. And then check your BIOS settings for odd things, possibly defaulting the settings.
Please note that ZFS was specifically designed to withstand TONS of graceless, (aka un-expected), power losses, without any data loss. (Except for any data in flight, like any other file system.) So it is unlikely you have any pool corruption causing your random reboots.
Of course that said, hardware RAID and hardware failures could lead to ZFS data loss. (Again full hardware list is helpful, as I am just stating general information here…)
But what I am trying to work out is whether truenas is rebooting or the server itself. And what I’m hoping to do is to see some kind of error log in truenas before it boots. The only thing I am aware of is console.log and so I don’t know whether there are other places I should be looking?
The BIOS is a good call, there was an immediate error about secure boot after the power cut and I went in a made sure the settings were good. It was set to secure boot for windows and I switched to Other OS and then the server booted.
I don’t believe I have any dataloss and I don’t see errors when the server is running
I’m on the latest stable truenas version, but I was on the previous one and it did it with that. I upgraded just in case there was a software issue.
So where can I find any errors beyond console.log? I have a monitor plugged into the HDMI of the server so I see a/the console on there but I cannot catch anything being said before it reboots.
However, you don’t list if you are using TrueNAS Core, (based on FreeBSD), or TrueNAS SCALE, (based on Debian Linux). That would affect where to find which logs.
Hi Nick, what points to that? Have no issues in the past, nothing changed hardware wise apart from a power cut. Would that really fry a PSU and make it reboot? It maybe possible but before going down that route I’m keen to understand the thought process?
Just a bit of deduction. I could still be wrong but I’m pretty confident.
You had a working system with no issues, then had an unscheduled power outage, and now you began having issues with the system randomly restarting.
I’ve seen systems randomly crash with a bad PSU in general, and then Input voltage doing weird things can absolutely damage capacitors and components. Occam’s razor and all that, it’s probably your PSU because the only change was a power problem.
Good point. I’ve swapped the CMOS battery and reset and re-flashed the BIOS. So PSU could be the one. It’s not something where I can see an error (voltage drop or something) then a reset? Gonna be a swap and see.
I have a drive now throwing lots of errors as well, I assume that really is the drive and have a replacement on the way. (As well as an UPS so this doesn’t happen again)
I have a new PSU coming tomorrow, so will swap and see
So I just wanted to follow up. I rebuilt the server with a new PSU (Forgot that different brands cabling would be different) and so far, touch wood, that has solved the issue.
So huge thanks to everyone and @NickF1227 for pushing me down this route.
Now have a UPS as well, so never again.
The CRC errors would have been a cable too. The drive itself seems okay.