TrueNAS-13.0-U6.2 - server reboots every 65 min

bkindel · September 3, 2024, 10:41pm

This is my first post - please be gentle:

My current system:
Hardware: Dell PowerEdge T20 - E3-1225 v3 3.2GHz; 12GB ECC RAM
Software: TrueNAS-13.0-U6.2

Issue: My TrueNAS server shows an unscheduled reboot every 65 min.

2024-09-03_TrueNAS_Alerts

The reboots are 65 min +/- 2 sec or so.

I first noticed the issue mid last week. At the time, my server was running version 13.0-U6.1. I tried installing the update thinking it might fix the problem, but no change.

I approached it as though it were a hardware problem. I have swapped out the power supply (no change). I have booted into BIOS, running a diagnostics test. The test does not find any issues.

Just for some background, the original install for the server was FreeNAS version 9.10 back in 2017. I have migrated to version 11, then 12, and now 13. The system has been pretty rock-solid until this recent issue.

I only have 2 services enable (SSH and SMB). I have 1 plugin installed (Nextcloud). I have tried stopping the Nextcloud jail, but no change in behavior.

I grabbed a screenshot of the local display on one of the reboots - not sure if this info may be helpful.

I’m not sure where to look in terms of logs, etc. that might help reveal what is going on.

Any assistance is greatly appreciated.

Stux · September 3, 2024, 11:00pm

Are there any watchdog settings in the bios/bmc?

bkindel · September 3, 2024, 11:26pm

None that I have found.

Definitely nothing that I have changed.

bkindel · September 7, 2024, 2:23am

Update:

Hardware changes:

removed heat sink, cleaned, new thermal paste
replaced CR2032 coin-cell battery
replace power supply
replaced RAM (2x 8GB ECC)

Software changes:

updated to TrueNAS-13.0-U6.2 (from 13.0-U6.1)
disabled 2 cron jobs (for reading temps) - no current tasks running
stopped Nextcloud plugin (no other plugins installed)

Still experiencing the same issue - reboot every 65 min.

Any additional thoughts appreciated.

mvd8tvJI · September 7, 2024, 2:34am

It seem to be panicking with fstatat.
https://man.freebsd.org/cgi/man.cgi?query=fstatat&apropos=0&sektion=2&manpath=FreeBSD+13.0-RELEASE&arch=default&format=html
Are there any errors on the disk?

bkindel · September 7, 2024, 8:06pm

I don’t see any disk errors.

I’m only looking in the GUI.

Is there somewhere else to check or something else that I should look at?

I’m curious why (almost) exactly 65 min every time.

Seems like a clue, but I can’t seem to figure it out.

bkindel · September 8, 2024, 9:08pm

New update to share:

I have 1 ZFS pool - 4x 5TB drives

I unplugged all 4 of the 5TB drives and rebooted.

The pool is obviously offline. The server has been stable, though and hasn’t rebooted.

Any ideas on narrowing down to determine which drive(s) may be the causing the issue, or if there is some other issue related to the pool?

I can see that there a ZFS update available. I have thought about trying to upgrade the pool, but don’t have much confidence that it will actually solve the problem.

bkindel · September 11, 2024, 1:55pm

I decided to try unplugging 1 of the 5TB drives at a time and booting the system.

I went through all 4 drives one at a time.

In each case, the system booted and showed the pool as degraded, but still working (all samba shares accessible, etc.)

In each case, the system still rebooted after 65 min.

So - still the same issue, and not able to narrow the problem down to a single drive.

I tried running a manual scrub, but the scrub can’t complete in 65 min. I tried pausing the scrub before the system rebooted, but it does not save the scrub progress. It starts a new scrub after every reboot.

etorix · September 11, 2024, 4:00pm

If any set of four drives is fine but the whole five fails it could be an issue with power supply—though it’s hard to conceive why the PSU would wait exactly 65 minutes to show its discontent.

bkindel · September 11, 2024, 4:03pm

I suspected the PSU as well.

I have replace the PSU, as it was not terribly expensive. I still have the same issue.

(I have also replaced the coin-cell battery, the RAM, and cleaned the heat-sink and re-installed with new thermal paste.)

bkindel · September 16, 2024, 1:16am

3 days ago, my system just mysteriously stopped rebooting.

Update was 3 days and some change.

This afternoon, it started rebooting again. Back to every 65 min for the last few hours.

etorix · September 16, 2024, 8:15am

@NickF1227 , @ericloewe can you help here?

Okedokey · September 16, 2024, 8:25am

Wild guess here. Do you have the HDD set to spin down after 1 hour? Could there be an issue when this happens or when they try to rewake?

bkindel · September 16, 2024, 12:52pm

Thanks for the assistance.

Here is a look at the config for my drives:

All of the drives are configured the same - always on, and adv. power management disabled.

Constantin · September 16, 2024, 1:26pm

Given that you have eliminated the PSU, the HDDs from the equation, the only thing left is the CPU / motherboard? I presume you have been watching the CPU / motherboard temps and they are fine? Was the system cooler over the three days that it did operate w/o rebooting? T

o me, this looks like a “thermal” issue, at the same time I see nothing wrong with trying out a 13.3 upgrade and see what happens. Your pool data should be safe as long as you have a backup of the config file.

bkindel · September 16, 2024, 3:19pm

Hi @Constantin - thanks for the reply.

I have completed enough testing that I’m fairly positive it is a problem specific to the 4x 5TB pool and/or one of the drives in the pool.

I have done both of the following to help confirm:

I have disconnected the 4x 5TB drives in the pool and rebooted the server. When I do this, the pool is obviously offline, but the server is then stable and does not reboot.
I have boot the served to a ‘live’ Linux distro on USB drive. Again, the server is stable and does not reboot.

CPU temps all look good (avg temps 26C, hottest core 29C). No difference noted during the 3 days of uptime.

NickF1227 · September 16, 2024, 3:47pm

If you send me a PM with a debug file I’d be happy to take a look. Without it I’m only going to be guessing at this point.

bkindel · September 16, 2024, 4:43pm

Hi @NickF1227 - I appreciate the offer to assist!

I would be glad to grab/share debug file(s) info. I’m not sure what all files I need to provide. If you can tell me what files/info are valuable, I will capture and share.

Thanks - Brian

NickF1227 · September 16, 2024, 4:46pm

Go to System Settings → Advanced and then click Save Debug on the top right hand side.

bkindel · September 16, 2024, 5:17pm

Well…that was easy. (I had looked around in the UI, but just hadn’t found it. )

It doesn’t look like I can send PMs (I’m new to the forum). Try this link.
[Redacted - Mod]