ZFS Cache is using all RAM, Dataset locks, ZFS unhealthy

I’m currently copying all my data on my new and fresh TrueNAS. The ZFS Cache is using a lot of RAM while writing - that’s the good part.

But it will use all 100% RAM while I’m copying my data via SMB and then suddenly, the dataset locks, ZFS gets Unhealthy due to some errors in the last written files. And After that, the Cache becomes empty.

That can’t be right, right? How can I limit the max value? I thought, TrueNAS (v. 25 btw) uses 50% per default. The problem is, that the Cache doesn’t get flushed after smaller write sessions. I have 128 GB of RAM and I’m writing maybe 60 GB per task. So I can run 2 tasks and then it will crash again.

What did I miss while setting up the pool and datasets?

The 50% limit was removed in 24.04 or 23.10 (can’t remember which one it was). Now it’s the same behavior as core, so i believe around 90%

okay thanks. but it’s def. 100% for me. The Cache might want to use 200%.

AFAIK TrueNAS should reduce the Cache size if other parts need RAM. So why is my system ignoring this?

Yes it should adjust itself, and with my system it does… so no idea why yours doesn’t.
Is it a vanilla installation or did you do some manual tweaking of zfs parameters?`
Did you try to manually cap the arc size and see if your sytem locks up?

I don’t see strong correlation between ARC utilization and symptoms. Perhaps describe symptoms without ascribing to ARC and give exact errors presented by the system. Also provide full hardware details.

1 Like

It’s the latest version of TrueNAS Scale and a normal pool w/o any cache VDEV or similar. I did not change anything. Just set up the pool (8x 6TB with RAIDZ2) and created some datasets.

I’m trying to limit the RAM but I’m searching the right config file atm.
Is it

vfs.zfs.arc_max=2147483648

in /etc/sysctrl.config?

echo SIZE IN BYTES >> /sys/module/zfs/parameters/zfs_arc_max

For example:

echo 47191459840 >> /sys/module/zfs/parameters/zfs_arc_max

To make the setting persist you will need to setup a post-init script to run it at bootup. You can reset it to ‘0’ to return to default behavior.

root@truenas[/mnt/Asgard/USERS/odin]# cat /sys/module/zfs/parameters/zfs_arc_max
0
root@truenas[/mnt/Asgard/USERS/odin]# echo 128849018880 >> /sys/module/zfs/parameters/zfs_arc_max
root@truenas[/mnt/Asgard/USERS/odin]# cat /sys/module/zfs/parameters/zfs_arc_max                
128849018880

That should be 120 of 124.9 GB. Let me try that.

You shouldn’t have to limit the ZFS ARC size to prevent locks or crashes.

Something else is to blame.

Is this ECC RAM? Have you run memtests overnight?

no. its standard DDR5 RAM for now. Couldn’t find any EEC RAM without bleeding my bank account.


Here. A reboot right after 10 am and then starting to copy all my data on the NAS. The free RAM is running down to zero like a burn-down-chart.
The NAS “ejects” and locks all datasets, the web GUI isn’t responding and then after locking all datasets, the Cache is empty and I can restart.

I started now a 390 GB transfer with >150k files. The Cache is at 80 GB right now. I’m curious…

If its a RAM damage, shouldn’t the NAS completely freeze and must be rebooted?

Running a memtest overnight will rule this out.

Bad RAM, RAM slots, or a memory controller does not have a particular way to manifest. Random and strange things can occur. If you can run a memtest overnight to rule out bad RAM, you’ll at least eliminate that possibility.

I recently dealt with the symptoms of a bad RAM stick, which I initially thought was the graphics card because of what appeared to be GPU lockups and the display not responding.

1 Like

It happened again even with a max value of 120 GB. The Cache is sucking every single kB.

I will run a memtest. Any recommendations for the testing method? Is one run sufficient?

Overnight is the best. It allows multiple passes, and you won’t have to stay awake to monitor it. The next morning you can see if it failed.

If it’s not bad RAM, you can then start to investigate other culprits.

For testing purposes, drop it to JEDEC speeds - so no XMP, no EXPO, no custom tuning. I’ve seen systems pass everything else in memtest but spectacularly fail the bit fade test.

2 Likes

Right now memtest v10.1 Free is running test 8 in Pass 1 of 2.
Already 44 Errors (Random Number Sequence).

But it’s not ECC RAM. So is this normal or is my RAM faulty?
Should I replace all 4 bars?

We really need detailed info on your system. Expand ‘My system specs’ on LarsR post #8 above. That is the kind of detail we like. Hard drive models would help as were are looking to make sure the drives are CMR and not SMR types.

I copied 13TB of data from Windows 11 to TrueNAS Fangtooth 25.04.1 and had no problems. I was using Windows Robocopy to make the backup onto a SMB share.

Even one error is too many in my mind. Are you running with XMP/EXPO?

4 Likes

Test 9 give also some errors in run 1 of 2 (Modulo 20)

Here are my specs:

  • Mainboard: ASRock B850M Pro-A
  • CPU: Ryzen 5 7600
  • RAM: 4x 32 GB DDR5 5200 non-ECC
  • HBA: Broadcom 9500-8i
  • PSU: bequiet PurePower 13M 750W
  • HDD: 8x 6TB WD Red Plus in RAIDZ2 (WD60EFRX, EFZX and EFPX)
  • Boot: 2x 250GB WD Red SN700 (mirrored)

Here are the 8 WD Reds:

  1. nvme1n1 WD Red SN700 250GB 232.89 GiB boot-pool
  2. nvme0n1 WD Red SN700 250GB 232.89 GiB boot-pool
  3. sda WDC_WD60EFRX-68L0BN1 5.46 TiB Asgard
  4. sdb WDC_WD60EFZX-68B3FN0 5.46 TiB Asgard
  5. sdc WDC_WD60EFPX-68C5ZN0 5.46 TiB Asgard
  6. sdd WDC_WD60EFPX-68C5ZN0 5.46 TiB Asgard
  7. sde WDC_WD60EFRX-68L0BN1 5.46 TiB Asgard
  8. sdf WDC_WD60EFPX-68C5ZN0 5.46 TiB Asgard
  9. sdg WDC_WD60EFRX-68L0BN1 5.46 TiB Asgard
  10. sdh WDC_WD60EFZX-68B3FN0 5.46 TiB Asgard

I’m not sure about XMP. I did not activated that. But let me check the BIOS.