Random reboots after upgrade to 24.10

Hi everyone. Yesterday I upgraded from the latest version of Truenas scale Dragonfish to 24.10.
The transition went well, apart from a couple of applications which for unspecified reasons did not migrate. Being a novice in Linux (but with 40 years of computer experience) I simply reinstalled them. Unfortunately, however, the real problem emerged shortly thereafter. In the last 12 hours the system has spontaneously rebooted 5 times, sometimes every few minutes, sometimes after hours. Sometimes just using Plex, sometimes with the system doing absolutely nothing.
In the past six months of continuous power on I had experienced I believe two restarts. The system is not mission critical but such a frequency of reboots is unacceptable. Given the emergence of the problem with the transition to 24.10, I think it is reasonable to exclude hardware problems.

Do you have any ideas on what I could check to understand what’s going on (remember I’m NOT an expert in linux)?

Is there a possibility to go back to the previous version and in this case what would happen to all my applications?

The system is a Ryzen 1700x, Asus x370 Prime motherboard, Geforge (old, not supported by drivers), 16GB of RAM, no ECC (checked and works well)

Thanks to anyone who knows and can help me!

I just migrated from core to scale today and am also seeing random reboots. I have looked through several logs and I don’t see anything. It just reboots without warning or any logical reason. I’ve seen another thread with the same issue today as well, so we aren’t alone.

Looking like i’m making at least some sense of it, I did find errors in syslog that correspond to the crashes. Can you check if you are seeing similar?

syslog:Nov 8 16:42:05 Onyx kernel: perf: interrupt took too long (2588 > 2500), lowering kernel.perf_event_max_sample_rate to 77250
syslog:Nov 8 17:19:45 Onyx kernel: perf: interrupt took too long (2513 > 2500), lowering kernel.perf_event_max_sample_rate to 79500
syslog:Nov 8 19:37:20 Onyx kernel: perf: interrupt took too long (2799 > 2500), lowering kernel.perf_event_max_sample_rate to 71250
syslog:Nov 8 21:31:49 Onyx kernel: perf: interrupt took too long (2667 > 2500), lowering kernel.perf_event_max_sample_rate to 75000
syslog:Nov 8 21:31:49 Onyx kernel: perf: interrupt took too long (2667 > 2500), lowering kernel.perf_event_max_sample_rate to 75000
syslog:Nov 8 21:51:06 Onyx kernel: perf: interrupt took too long (2574 > 2500), lowering kernel.perf_event_max_sample_rate to 77500
syslog:Nov 9 01:08:12 Onyx kernel: perf: interrupt took too long (2766 > 2500), lowering kernel.perf_event_max_sample_rate to 72250

Hi, I would like to do that and I already spent hours but for a complete linux noob it is not easy. :sweat_smile:
var/log/ I guess the logs are here? but which one should I check? Kern.log? If this is the one I just see nothing after the reboot (I mean…a bunch of stuff during the rebooting and then nothing until the next one)

For first gen ryzen there were some bios settings that hat to be disabled.
For older bios versions those settings were: erp-ready, amd cool&Quit and global c-state control. On newer bios options there was an option for power supply idle controll which had to be set typcial current from low power.

Yes, I’m aware of this. My platform worked for six months with everything enabled (It was my old rig for many years and when I used it as truenas server I didn’t change a thing). Now I tried to change these settings following an old post/solution:

Precision Boost Overdrive, (can’t find it)
Core Performance Boost (disabled)
Global C-State Control (disabled)
PSS Support, Can’t find it
D.O.C.P. lowered my mem frequency from 3200 to 2400 with awful timings…

Wainting to see if something is going to change.

Try this command, it will search for that phrase in every log file in /var/log.

grep “kernel: perf: interrupt took too long” /var/log/*

yes… I have the error:

/var/log/kern.log:Nov 6 00:14:18 truenas kernel: perf: interrupt took too long (2625 > 2500), lowering kernel.perf_event_max_sample_rate to 76000
/var/log/kern.log.1:Nov 1 13:52:50 truenas kernel: perf: interrupt took too long (7503 > 7010), lowering kernel.perf_event_max_sample_rate to 26500
grep: /var/log/libvirt: Is a directory
/var/log/messages:Nov 6 00:14:18 truenas kernel: perf: interrupt took too long (2625 > 2500), lowering kernel.perf_event_max_sample_rate to 76000
/var/log/messages.1:Nov 1 13:52:50 truenas kernel: perf: interrupt took too long (7503 > 7010), lowering kernel.perf_event_max_sample_rate to 26500
/var/log/messages.1:Nov 1 13:52:50 truenas kernel: perf: interrupt took too long (7503 > 7010), lowering kernel.perf_event_max_sample_rate to 26500

But for what I understand there is not a direct correspondence with the crashes that are more frequent and all started a couple of days ago…

Yes, this is the most likely cause, in my mind.

There have been numerous posts by people since 24.10.X running older Ryzens specifically complaining about “random reboots when idle”, a key symptom of the older Ryzen power issue at idle.

Update the BIOS and set Power Supply Idle Control to Typical.

I wouldn’t change the other settings. The above should resolve the crashes.
Changing PBO, CPB and so on are likely red herrings.

1 Like

I don’t have a Ryzen CPU. I’m using an AMD FX-6200.
I also saw the same in 24.04 while I was step upgrading from 13.3.
Core had been rock solid for about 5 years on this hardware, aside from a failed power supply last year. It was replaced with a Corsair 750W gold.

Then you should probably make your own thread and post full system details.

Will do. Was hoping to find some commonalities to help narrow down the issue.

I decided to try this solution and wait a few days to see how it went. Well, not only have I no longer had any random reboots but it seems that the error problems on all the disks in my pool have ALSO been resolved (errors in the checksums, sometimes in the order of 3/4 per disk which I was unable to give an explanation.
I was also able to restore normal RAM performance (however, I left the c-states disabled, I didn’t notice any changes in temperature so, since everything is fine now, I’ll leave things as they are)
Thank you all!

1 Like