TrueNAS Scale consistently crashes after writing certain quantity of data

UrchinSlacks · November 21, 2024, 3:53pm

Howdy, Forum!

I’m a newbie working to convert from Synology Diskstation to Truenas Scale 24.10.0.2 on home lab hardware (bare metal). I have recently begun attempting to migrate my media library using rsync over ssh, which is when the crashes started happening. After rigorous testing, I’ve found that I can pretty reliably force a system reboot after copying between ~64 and ~80 gigabytes to separate file handles. Examples below. I’m not able to find an error in the system logs. My box has a BMC IPMI interface, so I set up a screen recording on the virtual terminal to see if anything got thrown to console, and the answer was ‘no’. It goes straight from a running system to POST with no warning.

Some tests I’ve run (using my desktop’s /dev/urandom as a data source):
Streaming data over SSH to a single file without limitation - Reboot after writing ~70 GB
Streaming data over SSH to multiple files with sizes between 1 and 20 GB - Reboot after writing ~70 GB
Streaming data over NFS to multiple files with sizes between 1 and 20 GB - Reboot after writing ~70 GB
Streaming ~5 GB data over SSH to multiple files, with 2 minutes sleep between each copy - Reboot after writing ~70 GB
Streaming ~5 GB data over SSH to multiple files, with 60 minutes sleep after 10 files - Reboot after writing ~70 GB
Streaming data over SSH to a single file, writing/overwriting between 1 and 20 GB to the same file handle - No reboot, halted test after ~100 GB

Tests I want to run but haven’t gotten to yet:
“Stress”-testing READ rather than WRITE
Local copy (Dataset to dataset, and/or from usb drive)

Target datasets are on a pool that has 4 physical SSDs, comprising 2 mirrored vDevs, all assigned to ‘Data’.

I love weird edge-case problems like this, so I’m game to keep tinkering if anyone has any ideas/suggestions. I’m also open to trying CORE instead of SCALE, after assessing any caveats. I mainly picked SCALE because Linux is in my wheelhouse – but I don’t think it matters much when dealing with an Appliance-grade OS.

Edit: Including hardware specs, per @SmallBarky
Proc: AMD EPYC 4464P
Board: AsRockRack B650D4U (Using on-board NICs and SATA)
RAM: 2x 32 GB ECC Unregistered Unbuffered DIMM
Boot Disks: 2x 500GB NVME
Data Pool: 2x Samsung SSD 870 (4TB), 1x TEAM T2532TB (2TB), 1x SPCC Solid State (2 TB)

SmallBarky · November 21, 2024, 4:00pm

Give us hardware details like in my sig / Detail section. I was going to ask about NIC but since you mentioned BMC IPMI, I am guessing server hardware and server grade NIC.

Have you run hardware tests like LONG SMART and RAM? Any chance of overheating in server, HBA or NIC?

UrchinSlacks · November 21, 2024, 4:22pm

Updated original post with hardware.
I’ve run 1 SMART test on the SPCC Solid State (it threw an error during pool creation, and also does not seem to report its temperature correctly, but it came back clean from testing)

Overheating may be a concern, as I’m all air-cooling, and my case design necessitated putting the SSDs off in a corner away from the blowers. However, during the ‘overwrite the same file’ test, each SSD is pushing around ~150 MiB/s, same as with all the other tests, so I would think that if it was an overheating issue, this test would be sufficient.

SmallBarky · November 21, 2024, 4:43pm

Stick with Scale. All the development is going there.

If you have a good, current backup of your data elsewhere, you could try removing SPCC from that mirror and run your tests again. The pool will be in degraded state and you are at risk of the entire pool dying if the TEAM device goes down.

It is RISKY but may be worth a try. I only mention that as your pool is 6TB or less so it shouldn’t take much time to reload data.

UrchinSlacks · November 21, 2024, 10:13pm

Thanks for the suggestion. I set the disk offline and tested again, still crash. I’m performing long SMART tests on the other disks in the pool out of curiousity, and will continue to poke at it.

UrchinSlacks · November 22, 2024, 3:28am

I kicked off a memtest86 instance and almost immediately my system crashed, so I guess I’m looking at a bad DIMM – I suppose ~70GB worth of file handles is what it takes for the fs cache (or whatever ZFS uses? ARC?) to reach a damaged cell. I’d have expected it to be more random, I thought modern memory allocation schemes were supposed to be somewhat randomized, though balancing between DIMMs might make a difference.

Anyway, thanks to everyone who considered this issue.

HoneyBadger · November 22, 2024, 2:20pm

Hey @UrchinSlacks

Your board does have IPMI, so you might be able to pull something from sudo ipmitool sel list that could help direct you to the faulted DIMM if it’s failing in a way that ECC will grab and log.

UrchinSlacks · November 22, 2024, 6:47pm

Sweet, that could save me interminable hours testing DIMM/slot combinations, I’ll give it a try.