Post Powercut - Server randomly reboots, cannot work out why

Paul.Inventome · May 18, 2024, 4:35pm

Hi All,

So had a power cut and the server first complained of secure boot issues, I went into the BIOS and make sure that was okay and it booted after. But shows something changed with the power cut.

TrueNAS boots and looks okay but then randomly (say within an hour) resets. I can see the command line reboot again but I cannot find out why or what triggered it.

If I look at console.log at the point of reboot it shows the time and nothing else apart from the details of the reboot. Doesn’t seem to show an error.

Because I loose network at this point I’d considered the 10GBe card but switched to the motherboard 1Gbe and it does the same thing.

So how do I troubleshoot this? Are there other places where I can see logs? Is there any way I can understand what could be triggering this?

So this is really about approaches in troubleshooting random resets…

Prior to the power cut was fine, think it was up for 100 days or so.

cheers
Paul

Arwen · May 18, 2024, 8:56pm

A full hardware list is helpful.

Some things that have bitten people are power save options, or over-clocking. Both generally not suitable for a server. And in some cases those trigger failures.

Since you lost power, you might check and make sure your motherboard battery is still good. If not replace. And then check your BIOS settings for odd things, possibly defaulting the settings.

Please note that ZFS was specifically designed to withstand TONS of graceless, (aka un-expected), power losses, without any data loss. (Except for any data in flight, like any other file system.) So it is unlikely you have any pool corruption causing your random reboots.

Of course that said, hardware RAID and hardware failures could lead to ZFS data loss. (Again full hardware list is helpful, as I am just stating general information here…)

Paul.Inventome · May 19, 2024, 8:06am

Thanks for replying. I’d assumed signature had move from the old forum but not. So the hardware spec is

Xeon build X99 board X99-WS/IPMI
32GB RAM
Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz
LSI HBA Card (4xSATA).
9x16TB Red Pro NAS,
5x4TB SSD,
4x2TB NVmE on OWC NVMe card Accelsior 4M2 PCIe x8 (with PCI Switch).
10Gbe Intel network card

But what I am trying to work out is whether truenas is rebooting or the server itself. And what I’m hoping to do is to see some kind of error log in truenas before it boots. The only thing I am aware of is console.log and so I don’t know whether there are other places I should be looking?

The BIOS is a good call, there was an immediate error about secure boot after the power cut and I went in a made sure the settings were good. It was set to secure boot for windows and I switched to Other OS and then the server booted.

I don’t believe I have any dataloss and I don’t see errors when the server is running

I’m on the latest stable truenas version, but I was on the previous one and it did it with that. I upgraded just in case there was a software issue.

So where can I find any errors beyond console.log? I have a monitor plugged into the HDMI of the server so I see a/the console on there but I cannot catch anything being said before it reboots.

Thanks!
Paul

Arwen · May 19, 2024, 4:44pm

Sorry, I don’t have much to add.

However, you don’t list if you are using TrueNAS Core, (based on FreeBSD), or TrueNAS SCALE, (based on Debian Linux). That would affect where to find which logs.

Which one?
And version?

NickF1227 · May 19, 2024, 9:56pm

Sound like maybe your power supply needs to be replaced.

Paul.Inventome · May 20, 2024, 8:55am

I’m on TrueNAS Core.

My feeling is that the issue is outside of TrueNAS but if you can point me to any other logs that may give a bit more info I’d really appreciate that?

Paul.Inventome · May 20, 2024, 8:57am

Hi Nick, what points to that? Have no issues in the past, nothing changed hardware wise apart from a power cut. Would that really fry a PSU and make it reboot? It maybe possible but before going down that route I’m keen to understand the thought process?

Thanks
Paul

NickF1227 · May 20, 2024, 2:07pm

Just a bit of deduction. I could still be wrong but I’m pretty confident.

You had a working system with no issues, then had an unscheduled power outage, and now you began having issues with the system randomly restarting.

I’ve seen systems randomly crash with a bad PSU in general, and then Input voltage doing weird things can absolutely damage capacitors and components. Occam’s razor and all that, it’s probably your PSU because the only change was a power problem.

Paul.Inventome · May 20, 2024, 5:14pm

Good point. I’ve swapped the CMOS battery and reset and re-flashed the BIOS. So PSU could be the one. It’s not something where I can see an error (voltage drop or something) then a reset? Gonna be a swap and see.

I have a drive now throwing lots of errors as well, I assume that really is the drive and have a replacement on the way. (As well as an UPS so this doesn’t happen again)

I have a new PSU coming tomorrow, so will swap and see

thanks
Paul

Paul.Inventome · May 23, 2024, 1:53pm

So I just wanted to follow up. I rebuilt the server with a new PSU (Forgot that different brands cabling would be different) and so far, touch wood, that has solved the issue.

So huge thanks to everyone and @NickF1227 for pushing me down this route.

Now have a UPS as well, so never again.

The CRC errors would have been a cable too. The drive itself seems okay.

thanks
Paul