Truenas Scale keeps freezing and needs reboots

Daily TrueNAS scale latest version is hanging and I need to reboot it. I cannot access the web GUI and the CLI goes unresponsive on the following screen I cannot type anything. All the apps go down when this happens.

Dragonfish-24.04.1.1
Supermicro SYS- 4028GR-TRT 4U
2X Intel(R) Xeon(R) CPU E5-2697 v4 @ 2.30GHz
440.8GiB total available (ECC)
LSI 9200-8E
Supermicro 45 Bay JBOD Expansion Server Shelf 847E16-RJBOD1
eight ebay WL 22TB OEM Enterprise SATA 7200RPM HDD Comparable to ST22000NM001E so essentially 8 22TB Seagate exos drives configured in one zfs pool and 9 8TB drives in the same pool.

I’d guess its a hardware issue… no other reports like this.

You have a curious amount of RAM. Either way, even with ECC RAM it would be a good idea to run memtest (probably a few days with that amount).

Look at the PSU/power situation, could something be tripping power up in some way as to cause instability?

Check your logs, see if anything of interest is being reported. Can you link the freezes to scheduled tasks or is it freezing when essentially idle?

How quickly does this happen after a reboot?

Can you preemtively attach a monitor+keyboard to it so you can see if there’s any output there when it stops responding over the network?

Guys this is ECC ram. Not only should truenas push through errors if there was a problem, but there is no ram issues in the IPMI. I don’t think it’s the RAM. I have 4, 1000W PSU as this is designed as an 8x GPU server and they are barely being taxed. No power issues in IPMI with all voltages and watts correct and it’s on a practically dedicated circuit.

I have a Emporia home energy monitor and the circuit is well within limits. Everything seems essentially idle, I haven’t manually shceduled any tasks.

I haven’t used the system and it hasn’t had an issue since when I posted the original post. Otherwise it can be happening more frequently I have seen maybe twice a day. Not sure but it might have to do with this really flakey container GPU passthrough for plex and it trying to transcode. I don’t think the issues started until I started trying to transcode.

IPMI is always connected and the screenshot is exactly the output when it stops responding. I will try to dig into the TrueNAS logs more later.

I have a server with 64 GB of RAM and I’m facing the same problems. Hardware is also ECC RAM and an Intel Xeon W-2123. Nothing ‘strange’ in there as well.
I do run a few small containers (PiHole, Borg-Server, serarxng and a NPM) and a pool of 10.78 TiB.

The first sign of those freezes for me is usually SNMP acting up, spitting out something like the following messages:

<30>Jul 1 13:20:49 truenas snmpd[3362]: systemstats_linux: unexpected header length in /proc/net/snmp. 237 != 224 - that’s the only thing I can find on my remote syslog.

1 Like

I found out the problem was literally just have a GPU in the system. The second I put in the 3050 the system would start generating enormous amounts of errors in the logs. GPU pass through was a nightmare. The moment I took the GPU out of the system it stopped doing this zombie mode crashing thing.

I don’t have another system handy to test the GPU but I am pretty sure this was some issue with Truenas.

I have 2 NAS, 1 on a Supermicro x9 2xXeon 2797 v2 with 512gb ecc ddr3 ram and no video card, the other on a Ryzen Pro 4750ge with 64gb ecc udimm ddr4 and integrated video
Tried turning off C-states, doesn’t help – both machines freeze intentionally for 24 hours, logs are empty.
While any other Linux/FreeBSD and Windows distributions run on them for months.
The problem is definitely somewhere in TrueNAS Scale and it’s very annoying, I’ve never encountered anything like this before, but everyone pretends that everything is fine