ZFS Cache is using all RAM, Dataset locks, ZFS unhealthy

PopcornMachina · July 10, 2025, 11:49am

I’m currently copying all my data on my new and fresh TrueNAS. The ZFS Cache is using a lot of RAM while writing - that’s the good part.

But it will use all 100% RAM while I’m copying my data via SMB and then suddenly, the dataset locks, ZFS gets Unhealthy due to some errors in the last written files. And After that, the Cache becomes empty.

That can’t be right, right? How can I limit the max value? I thought, TrueNAS (v. 25 btw) uses 50% per default. The problem is, that the Cache doesn’t get flushed after smaller write sessions. I have 128 GB of RAM and I’m writing maybe 60 GB per task. So I can run 2 tasks and then it will crash again.

What did I miss while setting up the pool and datasets?

LarsR · July 10, 2025, 11:53am

The 50% limit was removed in 24.04 or 23.10 (can’t remember which one it was). Now it’s the same behavior as core, so i believe around 90%

PopcornMachina · July 10, 2025, 11:55am

okay thanks. but it’s def. 100% for me. The Cache might want to use 200%.

PopcornMachina · July 10, 2025, 12:03pm

AFAIK TrueNAS should reduce the Cache size if other parts need RAM. So why is my system ignoring this?

LarsR · July 10, 2025, 12:07pm

Yes it should adjust itself, and with my system it does… so no idea why yours doesn’t.
Is it a vanilla installation or did you do some manual tweaking of zfs parameters?`
Did you try to manually cap the arc size and see if your sytem locks up?

awalkerix · July 10, 2025, 12:08pm

I don’t see strong correlation between ARC utilization and symptoms. Perhaps describe symptoms without ascribing to ARC and give exact errors presented by the system. Also provide full hardware details.

PopcornMachina · July 10, 2025, 12:11pm

It’s the latest version of TrueNAS Scale and a normal pool w/o any cache VDEV or similar. I did not change anything. Just set up the pool (8x 6TB with RAIDZ2) and created some datasets.

I’m trying to limit the RAM but I’m searching the right config file atm.
Is it

vfs.zfs.arc_max=2147483648

in /etc/sysctrl.config?

LarsR · July 10, 2025, 12:15pm

echo SIZE IN BYTES >> /sys/module/zfs/parameters/zfs_arc_max

For example:

echo 47191459840 >> /sys/module/zfs/parameters/zfs_arc_max

To make the setting persist you will need to setup a post-init script to run it at bootup. You can reset it to ‘0’ to return to default behavior.

PopcornMachina · July 10, 2025, 12:18pm

root@truenas[/mnt/Asgard/USERS/odin]# cat /sys/module/zfs/parameters/zfs_arc_max
0
root@truenas[/mnt/Asgard/USERS/odin]# echo 128849018880 >> /sys/module/zfs/parameters/zfs_arc_max
root@truenas[/mnt/Asgard/USERS/odin]# cat /sys/module/zfs/parameters/zfs_arc_max                
128849018880

That should be 120 of 124.9 GB. Let me try that.

winnielinnie · July 10, 2025, 12:24pm

You shouldn’t have to limit the ZFS ARC size to prevent locks or crashes.

Something else is to blame.

Is this ECC RAM? Have you run memtests overnight?

PopcornMachina · July 10, 2025, 12:26pm

no. its standard DDR5 RAM for now. Couldn’t find any EEC RAM without bleeding my bank account.

PopcornMachina · July 10, 2025, 12:36pm

Here. A reboot right after 10 am and then starting to copy all my data on the NAS. The free RAM is running down to zero like a burn-down-chart.
The NAS “ejects” and locks all datasets, the web GUI isn’t responding and then after locking all datasets, the Cache is empty and I can restart.

I started now a 390 GB transfer with >150k files. The Cache is at 80 GB right now. I’m curious…

If its a RAM damage, shouldn’t the NAS completely freeze and must be rebooted?

winnielinnie · July 10, 2025, 12:42pm

Running a memtest overnight will rule this out.

Bad RAM, RAM slots, or a memory controller does not have a particular way to manifest. Random and strange things can occur. If you can run a memtest overnight to rule out bad RAM, you’ll at least eliminate that possibility.

I recently dealt with the symptoms of a bad RAM stick, which I initially thought was the graphics card because of what appeared to be GPU lockups and the display not responding.

PopcornMachina · July 10, 2025, 1:36pm

It happened again even with a max value of 120 GB. The Cache is sucking every single kB.

I will run a memtest. Any recommendations for the testing method? Is one run sufficient?

winnielinnie · July 10, 2025, 2:08pm

Overnight is the best. It allows multiple passes, and you won’t have to stay awake to monitor it. The next morning you can see if it failed.

If it’s not bad RAM, you can then start to investigate other culprits.

HoneyBadger · July 10, 2025, 2:38pm

For testing purposes, drop it to JEDEC speeds - so no XMP, no EXPO, no custom tuning. I’ve seen systems pass everything else in memtest but spectacularly fail the bit fade test.

PopcornMachina · July 10, 2025, 6:57pm

Right now memtest v10.1 Free is running test 8 in Pass 1 of 2.
Already 44 Errors (Random Number Sequence).

But it’s not ECC RAM. So is this normal or is my RAM faulty?
Should I replace all 4 bars?

SmallBarky · July 10, 2025, 7:00pm

We really need detailed info on your system. Expand ‘My system specs’ on LarsR post #8 above. That is the kind of detail we like. Hard drive models would help as were are looking to make sure the drives are CMR and not SMR types.

I copied 13TB of data from Windows 11 to TrueNAS Fangtooth 25.04.1 and had no problems. I was using Windows Robocopy to make the backup onto a SMB share.

HoneyBadger · July 10, 2025, 7:10pm

Even one error is too many in my mind. Are you running with XMP/EXPO?

PopcornMachina · July 10, 2025, 7:20pm

Test 9 give also some errors in run 1 of 2 (Modulo 20)

Here are my specs:

Mainboard: ASRock B850M Pro-A
CPU: Ryzen 5 7600
RAM: 4x 32 GB DDR5 5200 non-ECC
HBA: Broadcom 9500-8i
PSU: bequiet PurePower 13M 750W
HDD: 8x 6TB WD Red Plus in RAIDZ2 (WD60EFRX, EFZX and EFPX)
Boot: 2x 250GB WD Red SN700 (mirrored)

Here are the 8 WD Reds:

nvme1n1 WD Red SN700 250GB 232.89 GiB boot-pool
nvme0n1 WD Red SN700 250GB 232.89 GiB boot-pool
sda WDC_WD60EFRX-68L0BN1 5.46 TiB Asgard
sdb WDC_WD60EFZX-68B3FN0 5.46 TiB Asgard
sdc WDC_WD60EFPX-68C5ZN0 5.46 TiB Asgard
sdd WDC_WD60EFPX-68C5ZN0 5.46 TiB Asgard
sde WDC_WD60EFRX-68L0BN1 5.46 TiB Asgard
sdf WDC_WD60EFPX-68C5ZN0 5.46 TiB Asgard
sdg WDC_WD60EFRX-68L0BN1 5.46 TiB Asgard
sdh WDC_WD60EFZX-68B3FN0 5.46 TiB Asgard

I’m not sure about XMP. I did not activated that. But let me check the BIOS.