Random reboots since upgrading from Core to Scale

I upgraded from 13.3 to 24.04, then 24.10 yesterday afternoon. Since moving to scale I have been having random reboots, on both 24.04 and 24.10. I haven’t found any culprits in the logs and others experiencing the same seem to have Ryzen CPUs, but I do not. Is there any additional debugging I can enable to help track down why it keeps rebooting without cause?

Many of the reboots, but not all also show the following kernel entries in the syslog.

/var/log/syslog:Nov 8 16:42:05 Onyx kernel: perf: interrupt took too long (2588 > 2500), lowering kernel.perf_event_max_sample_rate to 77250
/var/log/syslog:Nov 8 17:19:45 Onyx kernel: perf: interrupt took too long (2513 > 2500), lowering kernel.perf_event_max_sample_rate to 79500
/var/log/syslog:Nov 8 19:37:20 Onyx kernel: perf: interrupt took too long (2799 > 2500), lowering kernel.perf_event_max_sample_rate to 71250
/var/log/syslog:Nov 8 21:31:49 Onyx kernel: perf: interrupt took too long (2667 > 2500), lowering kernel.perf_event_max_sample_rate to 75000
/var/log/syslog:Nov 8 21:31:49 Onyx kernel: perf: interrupt took too long (2667 > 2500), lowering kernel.perf_event_max_sample_rate to 75000
/var/log/syslog:Nov 8 21:51:06 Onyx kernel: perf: interrupt took too long (2574 > 2500), lowering kernel.perf_event_max_sample_rate to 77500
/var/log/syslog:Nov 9 01:08:12 Onyx kernel: perf: interrupt took too long (2766 > 2500), lowering kernel.perf_event_max_sample_rate to 72250
/var/log/syslog:Nov 9 02:08:57 Onyx kernel: perf: interrupt took too long (2634 > 2500), lowering kernel.perf_event_max_sample_rate to 75750
/var/log/syslog:Nov 9 02:36:05 Onyx kernel: perf: interrupt took too long (2680 > 2500), lowering kernel.perf_event_max_sample_rate to 74500
/var/log/syslog:Nov 9 06:41:57 Onyx kernel: perf: interrupt took too long (2554 > 2500), lowering kernel.perf_event_max_sample_rate to 78250
/var/log/syslog:Nov 9 07:25:13 Onyx kernel: perf: interrupt took too long (2560 > 2500), lowering kernel.perf_event_max_sample_rate to 78000
/var/log/syslog:Nov 8 21:31:49 Onyx kernel: perf: interrupt took too long (2667 > 2500), lowering kernel.perf_event_max_sample_rate to 75000
/var/log/syslog:Nov 8 21:51:06 Onyx kernel: perf: interrupt took too long (2574 > 2500), lowering kernel.perf_event_max_sample_rate to 77500
/var/log/syslog:Nov 9 01:08:12 Onyx kernel: perf: interrupt took too long (2766 > 2500), lowering kernel.perf_event_max_sample_rate to 72250
/var/log/syslog:Nov 9 02:08:57 Onyx kernel: perf: interrupt took too long (2634 > 2500), lowering kernel.perf_event_max_sample_rate to 75750
/var/log/syslog:Nov 9 02:36:05 Onyx kernel: perf: interrupt took too long (2680 > 2500), lowering kernel.perf_event_max_sample_rate to 74500
/var/log/syslog:Nov 9 06:41:57 Onyx kernel: perf: interrupt took too long (2554 > 2500), lowering kernel.perf_event_max_sample_rate to 78250
/var/log/syslog:Nov 9 07:25:13 Onyx kernel: perf: interrupt took too long (2560 > 2500), lowering kernel.perf_event_max_sample_rate to 78000

The larger gaps yesterday I went back to 13.3 for a bit just to make sure it still all worked.

Which specific TrueNAS SCALE version?

An SMB bug with 24.10.0.2 was fixed yesterday

Versions 24.04.2.4, 24.10.0.1, and 24.10.0.2 have all shown the same behavior of randomly rebooting. It rebooted twice in a row after starting this thread so I switched back to 13.3 and it has been up for a bit over 19 hours without any issues. I will try 24.10.0.2 again a bit later today.

Given multiple software versions… I would assume its a hardware issue.

I’d start with non-ECC DIMMs. remove/replace 1 at a time.

I’d assume the same if it wasn’t completely stable on 13.3. Bad ram would affect every os, not just one branch. I’ll run a mem test to see if anything shows amiss.

I saw your other posts in a thread with similar issues.

It is true that the background load of 24.10 is lower than previous versions.
We have seen processors that are poor at moving from low power mode to full performance mode.

Linux enables low power mode through control of C-states… so it is possible that BIOS settings may need to be tweaked.

I don’t know if this is the issue, but found some links:

https://www.reddit.com/r/AMDHelp/comments/nly59n/disabling_global_c_state_control_to_fix_idle/

I inferred that from those threads and disabled all C-state control in bios, as well as the cool and quiet features. Unfortunately it has not made a difference.

Its an old board… I’m not sure whether the graphics driver is supported on modern linux.

That might be worth digging into or disabling??

I’ll see if I can disable it. Its a headless unit anyhow. Doesn’t usually have a monitor hooked up unless I need to get into bios.