ZFS_ARC_MAX issue - out-of-memory errors in kernel with Scale

Please report this as a separate bug and start a separate thread.

Frankly 2.5TB is a very large single file and I’m not sure what the limits are both with Samba and ZFS.

Its important to specify your RAM size and record size… my guess is that file size limits will depend on RAM

I will get a further bug report submitted for the audit logs.

ZFS supports 16 exbibytes, I don’t think I am at risk of exceeding that.
As for samba, I took it out of the mix by attempting to delete the same file from shell on the NAS with the same result.

A file deletion should not be a particularly IO intensive operation, and there is no reason it should need a large amount of ram to complete. As for creating the file, truenas doesn’t know it’s going to 2.5tb, it crashes randomly but usually before 20-30gb written. There is something about certain types of data, usually various types of backup files from different systems that are not playing nicely, in this case Veeam and previously Proxmox.

The system currently has 32gb of ECC ram with more on the way, no vm’s, and is supporting a small 24tb raid-z2 pool plus 4tb NVME pool.

Is that a single 2.5 TiB file or a few millions of files within a directory. For example MacOS sparsebundle volumes are actually directories containing tons of files. A single file should not generate very many audit messages. If the backup client is writing to millions of files at once then you’ll end up with millions of audit entries. One workaround in this case is to whitelist the particular account that is used for backups to bypass auditing over SMB (this is exposed in the UI).

Pefer a separate thread… this is highly unusual and may be hardware related. So full hardware specs would be useful an d check for any other errors.

In response to @awalkerix, it was just a single 2.5tb veeam compressed file.
I have not been able to easily replicate this but making a backup that big with without causing snapshot headaches is difficult, I think it was just due to memory pressure at the time which crashed everything.

I am still seeing intermittent OOM errors for intensive operations, albeit less frequently, with the zfs arc shrinker limit and arc pc percent change. Pushing up to 24.04.2 tonight. Occurring approx. once every 3-4 days and does not coincide with any scheduled tasks etc. only with heavy usage of the pool. While it may be a red-herring, it does seem to occur more with pre-compressed content.

@Joel_Gray I was following your issue for a while, and I was hoping this would be resolved in 24.04.2.

Unfortunately, I can confirm the issue is still there. In my case it is not NFS but SMB causing OOM killer to kill itself and other high memory processes.

I have created a new issue [NAS-129987] - iXsystems TrueNAS Jira. Fingers crossed this can be fixed soon.

1 Like

@sra thanks for logging the ticket. Just for reference, on the latest version I can reproduce the same issues you are facing with SMB as well, still with no VMs etc running.

I have unfortunately gone from an extremely stable core installation that never even had a hiccup, to fatal issues occuring almost daily with any sort of demanding load.

I am certain this is not hardware induced, the same memory behaviour is observable across the different workload failures.

NFS was resolved at the expense of huge ARC dips with the zfs shrinkage setting and no limits on it, but it didn’t resolve the SMB use-case and running the default settings on the latest release is still causing a lot of problems with both of them.

Hopefully some real testing can be done around this, as in my opinion it should not have been considered a stable release candidate.

Joel, do you have a ticket you are working on with us directly? This is a very active investigation on our part and we suspect there are a combination of factors in play here that lead to this behavior. Especially since there are a really low number of reports and we don’t have a reproduction case when we put systems through our own performance stress testing. Once we nail down what the specific variables in play are, it’ll make resolution that much easier :slight_smile:

I was having the same issue with my samba service being killed, but running then adding the echo "0" > /sys/module/zfs/parameters/zfs_arc_shrinker_limit to init seems to have fixed it.

I run TrueNAS in a vm on a dell R730xd running Proxmox with 384GB ram. I have 96GB allocated to TrueNAS, and there are no VMs or apps running within the vm. Disks and nvme drives are passed through directly to the TrueNAS vm.

The service would be killed during backup of proxmox VMs to a share on TrueNAS. Since the data never leaves the machine, the read/write load on the system isn’t bottle necked by the network. No other network based backup seemed to cause this, even over 10gbit.

I never had the issue on older versions of TrueNAS, or any issues for that matter.

Even though I haven’t had the issue in weeks ( 18 days to be exact ), should I still submit a ticket? I’m not sure if the configuration it contains might help identify commonalities between setups having this issue.

@Milkysunshine In the OOM killer messages in your /var/log/kern.log you should see amount of memory used by each process when it happened. We saw several reports when killed samba process occupied many (8-16) gigabytes of RAM. That should not be normal, and may be a valid reason for it to be killed. Our services team is notified and collecting the data. But if you have one more data point of that, it would be good to know. Otherwise if setting zfs_arc_shrinker_limit to 0 fixed the problem then there is not much to do, since it is a part of already released TrueNAS 24.04.2.

cat kern.log | grep oom
Jun 22 08:31:01 zion kernel: DBENGINE invoked oom-killer: gfp_mask=0x140cca(GFP_HIGHUSER_MOVABLE|__GFP_COMP), order=0, oom_score_adj=-900
Jun 22 08:31:01 zion kernel:  oom_kill_process+0xf9/0x190
Jun 22 08:31:01 zion kernel: [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
Jun 22 08:31:01 zion kernel: oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0-1,global_oom,task_memcg=/system.slice/smbd.service,task=smbd[,pid=9436,uid=1000
Jun 22 08:31:01 zion kernel: Out of memory: Killed process 9436 (smbd[ total-vm:3987592kB, anon-rss:3881956kB, file-rss:3072kB, shmem-rss:16836kB, UID:1000 pgtables:7788kB oom_score_adj:0
Jun 22 08:31:03 zion kernel: oom_reaper: reaped process 9436 (smbd[, now anon-rss:0kB, file-rss:3072kB, shmem-rss:16836kB
Jun 29 08:54:05 zion kernel: smbd invoked oom-killer: gfp_mask=0x140cca(GFP_HIGHUSER_MOVABLE|__GFP_COMP), order=0, oom_score_adj=0
Jun 29 08:54:05 zion kernel:  oom_kill_process+0xf9/0x190
Jun 29 08:54:05 zion kernel: [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
Jun 29 08:54:05 zion kernel: oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0-1,global_oom,task_memcg=/system.slice/smbd.service,task=smbd[,pid=9153,uid=1000
Jun 29 08:54:05 zion kernel: Out of memory: Killed process 9153 (smbd[ total-vm:5854592kB, anon-rss:5756240kB, file-rss:3840kB, shmem-rss:10900kB, UID:1000 pgtables:11444kB oom_score_adj:0
Jun 29 08:54:07 zion kernel: oom_reaper: reaped process 9153 (smbd[, now anon-rss:0kB, file-rss:3148kB, shmem-rss:10900kB

Is this useful?

Can you send me a debug via private message.

My trust level is too low to send PMs

You can create a jira ticket and I’ll communicate with you through it. This is only to investigate root cause of the SMB process high memory usage.

Same here… Ever since the last or the previous update truenas scale is very frequently crashing… usually without any visually message but sometimes with an OOM message of some sorts.

I’ve been running truenas for years on proxmox without a hitch but the last weeks it is nothing but problems.

I have 24 gigs commited to Truenas… Total system memory has enough free… I used to have only a swap size of 2 gb but have since upped it to 24 gigs. Same result.

Intermittant crashes usually when I’m writing a lot to the nas.

I did do a complete fresh reinstall of the VM on this latest version.

Maybe check RES memory for smbd processes if you’re using SMB service for loopback mounts from apps or connections from proxmox host. There are some backup applications that appear to infinitely queue up writes to the SMB server and over very fast link (like loopback) it can end up with excessively large smbd queue depths waiting on writes to complete.