Truenas core halting every few days

This has been happening ever since I adopted truenas core. Once every few days the server will just stop responding. I’ve replaced the cpu & ram.

Nothing in /data/crash or /var/crash . /var/log/messages doesn’t have anything obvious. dmesg doesn’t show anything obvious. When it crashes there is almost no load on the system.

Any ideas? When it crashes, the system doesn’t respond. I can’t load the webui or ping it.

Hard to know. Shooting in the dark.

Could likely be an overheating issue, which has been seen in the old forums.

I also setup email alerts, but I don’t get anything until after I force a reboot - and I just get an unscheduled reboot notification.

Doesn’t appear to be overheating. Shouldn’t it overheat when I’m using it and not at idle?

Anything on the console when it does that?

nope

You should list your hardware in detail if you want to receive any help beyond guesswork.

I suggest running memtest86+.

2 Likes

I started a memtest. I don’t think it will fail, the memory is brand new and it did this with the old memory too.

Motherboard is asrock ab350m pro4
cpu Ryzen 1700
ram is 4x16GB, teamgroup elite, TED432G3200C22DC01 , DDR4-3200 running @ 2666
boot drive I think is a 500G WD ssd
the zfs pool is a raidz made up of 3x 4TB wd-reds
re0 is a TX201 that connects directly to another TX201 in my desktop.

I can get you more specific info once memtest is done.

https://www.reddit.com/r/freenas/comments/f50qet/entire_system_freezes_after_about_24_hours/

Thanks. I disabled cstates. I will let you know if it works or not.

Did this fix it for you? I had the same issue with a 1700 on TrueNAS Core (using Asrock X370 Taichi but I think mobo is irrelevant). Something in Zen architecture doesn’t play well with TrueNAS Core (or maybe FreeBSD in general??) After upgrading to a Zen+ CPU (2700X) on the same mobo all my issues went away.

It could be anything.
Temperature, RAM error, PSU problem, faulty sata cable, HBA problem, even a microcrack on the motherboard.
A couple of years ago I encountered such a problem and changed all the components. Nothing helped.
The only thing that helped was changing the server.

The Problem with first gen Ryzen are the power saving option in the bios.
When global c-states, erp-ready and AMD Cool&Quiet are enabled and the cpu is mostly idle, those options cut power to certain components and bsd doesnt like that and the whole system freezes. With my 1600x it was happening around the 72h uptime mark, rebooting would fix the issue but it would return 72h later.
Disabling those three settings made my nas stable with a max. uptime of around 44 Days before i rebooted due to an truenas update.
Since then i’ve upgraded to a 3700x and switched to scale.
I don’t know if it’s the third gen ryzen or the debian kernel with better support for amd cpu’s but i no longer have to disable the power saving options and my scale box has not freezed in over 2 years.

2 Likes

This mirrors my own experiences with modern AMD on the BSD side. It tended to have a bit more quirks and rough edges around c-states that could get you into trouble. I avoided AMD processors on my personal systems until SCALE for this reason.

1 Like

I didn’t want to jinx it by reporting in too soon. But it’s been up for more than 11 days, which is several times longer than the best previously. I think this fixed it.

I will say it had a negative impact in that the idle temp is 10c higher now. It used to idle at 45c now at 55c. So I’m assuming it’s because cores don’t get parked when idle now.