CORE system randomly crashing...where to start diagnosis?

I recently moved my six HDD’s in my old CORE system to a new motherboard, CPU, HBA card, PSU, etc. This ‘new’ computer was previously running Win10 for years with (as far as I can tell) no problems.

Now, it appears to crash maybe once every 2 days, with no rhyme or reason.

After the crash, its inaccessible via webgui, of course, but on the machine itself, the internal fans and LEDs are still on, but no output to the monitor, I can’t SSH in, and the LEDs on the attached USB keyboard are no longer lit up. I’ve bypassed the onboard Realtek NIC, removed the graphics card and switched to onboard graphics, ran memtest overnight, but all was fine.

When I force reboot, Truenas recognizes that its been the victim of an unscheduled reboot, but doesn’t give me any more info than that.

I did notice that after the latest restart, my pool was degraded because it was unable to read one of the HDD’s, so I turned it off, then unplugged and re-seated all the SATA data cables and SATA power cables, then rebooted and it was fixed.

Where would I even start to diagnose this? I can’t recreate it at all, it seems random.

TrueNAS CORE 13.0-U6.1
AMD Ryzen 5 1600
Gigabyte B450M DS3H
80GB DDR4 RAM
Dell H200 6Gbps SAS HBA =(LSI 9211-8i) IT Mode
10Gtek 10/100/1000Mbps Gigabit NIC (Intel 82576)
6 HDD’s in RAIDZ1
1 Nvme boot drive
1 Flashdrive as cloned bootpool

If I remember correctly, first generation AMD Ryzen CPUs had a power save bug. So I would start in the BIOS and disable all CPU related power save. As for why MS-Windows did not have problems, it may have been worked around in the OS. While FreeBSD maybe did not.

But, you can research 1st gen Ryzen and verify that issue.

2 Likes

As Arwen was alluding to, older Ryzens were known to be unstable at idle. If you can find a BIOS update install that and then look for “Power Supply Idle Control” and set it to “Typical”. If you can’t find it, try disabling at least C6-sleep states in the BIOS and see if it helps.

If that doesn’t help, investigate the usual suspects:

  • Test your RAM with memtest86 or similar; the amount suggests mixed kits, have you confirmed they run stable? Disable your memory overclock if you’re running one. Try with a single kit.
  • Lastly, do you trust the PSU? Do you have a different one you could try? (careful, don’t mix modular cables from different makers)

Thanks, I had no idea ryzens had such issues, but there were several settings in my bios I was able to turn off (CoolNQuiet, C6 state, set power idle to “typical”)…I guess the only thing now is to wait and see if it crashes again.

There’s 3 options that need to be disabled for 1st gen ryzen:

  1. AMD Cool&Quit
  2. Global C-State control
  3. Erp-Ready

Did my fair share of diagnosing random crashen with my first gen ryzen and truenas when i started out 3 years ago. Since then i made a switch to a 3700x and scale and don’t have any problems anymore.

1 Like