ZFS_ARC_MAX issue - out-of-memory errors in kernel with Scale 24.04.1.1

Joel_Gray · June 9, 2024, 10:58am

I very recently swapped from Core to Scale along with a hardware change.

I will start by noting this is a fresh install, and I am running no apps.
I am seeing a strange behavior within scale causing the system to become unresponsive, or completely crash in some instances.

When performing a backup of a VM in proxmox to an NFS share based on a HDD dataset, with the source being a fast NVME stripe on the same machine, there is a stage of the backup process that reliably triggers out-of-memory messages ultimately causing middleware, nfsd, samba etc to be killed and the system to become unresponsive.

kernel: nfsd invoked oom-killer: gfp_mask=0x100cc0(GFP_USER), order=0, oom_score_adj=0

This message is triggered by a large number of processes, not just nfsd.

My working theory after following a few threads in relation to changes to the 04.1.1 release is that ARC behavior has changed, but I cannot say this wasn’t an issue in prior version of scale as I did not run them.

I set zfs_arc_max to be 50% of my RAM capacity for testing, which does sort-of resolve the issue. With this parameter set, ssh remains up and barely responsive during the problematic part of the backup, and the backups will usually complete, however monitoring the free memory still shows a huge spike of at least 13gb of RAM usage up to the max available in my system (32gb), presumably by NFS. The GUI still crashes here, but most things shortly recover, and the backup does complete successfully in proxmox.

The same backup action used to cause no issues on Core on half of the available RAM of this system. All networking is 10gig. RAM is ECC and the pool is small so 32gb should suffice in this instance.

Something about the backup process from proxmox, specifically the end part where it states “backup is sparse” which I presume is where it is running compression, is where the ram usage usually takes off, and I am guessing ZFS cannot clear the space fast enough resulting in memory pressure?

From proxmox:
INFO: 100% (200.0 GiB of 200.0 GiB) in 3m 21s, read: 1.3 GiB/s, write: 8.0 KiB/s
INFO: backup is sparse: 181.15 GiB (90%) total zero data
–crash happens here without zfs_arc_max change–

I can replicate this reliably.

Whilst I’m a wintel engineer by profession, I am new to linux troubleshooting, so please go easy if I’m missing something.

Any thoughts on next steps are appreciated.

kris · June 9, 2024, 2:33pm

Can you get us a bug ticket and debug file from the system when it starts to become unstable? With no Apps, I would have to assume something is consuming memory and not releasing. (ZFS or NFS is likely). You can also kickoff an htop and watch it in real-time to see what is holding it.

sfatula · June 9, 2024, 7:02pm

As far as here on the forums, would be useful to see the complete Scale hardware list.

Joel_Gray · June 10, 2024, 1:06pm

Will do, I’ll be free to replicate the issue and pull debug files first thing tomorrow.

Cheers.

Spunky17 · June 10, 2024, 4:45pm

wondering if this is related to what i’m seeing. i’m getting similar symptoms with the GUI and SSH going unresponsive. my last crash i saw my ARC get up almost my entire memory footprint.

kris · June 10, 2024, 5:15pm

Hard to say. This htop screenshot doesn’t show a system that has run out of memory yet, it’s barely half used. ARC is by design supposed to use as much RAM as is available. Unused RAM is wasted RAM. But its also supposed to give it back and shrink as other services reserve more.

awalkerix · June 10, 2024, 5:21pm

How many VMs are you running on the server and how much memory have you commited to them?

Spunky17 · June 10, 2024, 5:48pm

I have 8 VMs with 64G commited to them.

Joel_Gray · June 11, 2024, 2:44pm

I think ARC using the whole of the available system memory is a good thing providing it works correctly.

If you check /var/log/messages are you seeing OOM-Killer invoked at all killing these processes due to memory pressure?

Joel_Gray · June 11, 2024, 3:05pm

Hi kris.

I spent some time replicating the issue tonight and got a bug ticket submitted.
I had to submit it post reboot as replicating the issue causes the system to crash to a grub prompt immediately now, so I cannot obtain it mid-issue. I hope that it is still of some use.

I’ve also replicated this behavior on a friends system running the same version of scale. We have similar setups.

If you need anything else let me know.
I obtained a htop maybe 15 seconds before the full system crash / ssh disconnection:
I think ARC was shrinking somewhat, but SSH wasn’t able to update quickly.

kris · June 11, 2024, 3:39pm

Thanks, got the ticket here. We’ll take a look at it!

Sid · June 14, 2024, 9:01pm

I believe I have the same issue.

Clean install, absolutely no additional appliances or VMs except NFS services.

After a few hours, i will get OOM errors for asyncio_loop & nfsd when doing heavy writes to pool through the NFS share.

Exact same setup worked perfectly on the same system, only started having issues since replacing the boot disk and doing a fresh install of TrueNAS Scale.

AMD Ryzen 5 3400G
48GB Ram
60 disks of 16TB each in a shelf + 1 x MIRROR | 2 wide | 465.76 GiB

gist.github.com

https://gist.github.com/SippieCup/0d1bb9ad13fd50d5f67f43a3525cd5b0

gistfile1.txt

Jun 02 16:37:28 furion kernel: kworker/3:3 invoked oom-killer: gfp_mask=0x2cc2(GFP_KERNEL|__GFP_HIGHMEM|__GFP_NOWARN), order=0, oom_score_adj=0
Jun 02 16:37:28 furion kernel:  oom_kill_process+0xf9/0x190
Jun 02 16:37:28 furion kernel: [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
Jun 02 16:37:28 furion kernel: oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/system.slice/middlewared.service,task=asyncio_loop,pid=11184,uid=0
Jun 02 16:37:28 furion kernel: Out of memory: Killed process 11184 (asyncio_loop) total-vm:3759084kB, anon-rss:302684kB, file-rss:4364kB, shmem-rss:0kB, UID:0 pgtables:1152kB oom_score_adj:0
Jun 02 16:37:28 furion systemd[1]: middlewared.service: Failed with result 'oom-kill'.
Jun 02 16:46:15 furion kernel: python3 invoked oom-killer: gfp_mask=0x140cca(GFP_HIGHUSER_MOVABLE|__GFP_COMP), order=0, oom_score_adj=0
Jun 02 16:46:15 furion kernel:  oom_kill_process+0xf9/0x190
Jun 02 16:46:15 furion kernel: [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
Jun 02 16:46:15 furion kernel: oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/system.slice/middlewared.service,task=asyncio_loop,pid=11150,uid=0

This file has been truncated. show original

Hope this helps.

kris · June 15, 2024, 11:21am

We believe (and have some reports that also seems to confirm) that this may be fixed in the upcoming 24.04.2. If you want to give a nightly a shot to confirm in your case, you can grab this update file and install it via the UI as a manual update.

https://update.sys.truenas.net/scale/TrueNAS-SCALE-Dragonfish-Nightlies/TrueNAS-SCALE-24.04.2-MASTER-20240614-013916.update

Migamix · June 15, 2024, 7:26pm

i think i should submit a bug report also, it appears setting the max doesnt always hold. i had mine as ~32G, a few days later it decided to ignore what i set in both command line and init, to 42+G, which caused an issue with launching VMs. look up my recent post to this for the OOM issues.
my short research has me flushing all cache, and it slowly rebuilding, but it needs to do as i say and not exceed what i set. (almost an apple/windows “we know best” scenario)

Joel_Gray · June 17, 2024, 2:54am

fast turnaround if so! thanks, I will see if I can get this file tested shortly.

das1996 · June 20, 2024, 10:01pm

Any progress/feedback? I’m seeing similar errors with 24.04.1.1.

Joel_Gray · June 23, 2024, 7:57am

I haven’t had a chance to test the nightly release yet.
The system has remained unstable with heavy NFS workloads but I am overhauling my pools currently, rebuilding and adding drives and I don’t want to be running the nightly version while doing so.

Has anyone else had a chance to try out the the linked 24.04.2 release?
When can we expect an official channel release?

Urworstnit3m3r · June 24, 2024, 2:41pm

I installed the patch but my issue seems to be different, about every 2 days NFS crashes and I have to reboot the server. the patch did not resolve that. I should note that this is a recent core to scale migrated server, I will probably have to do a backup and just install scale from the get go.

Jun 24 01:58:14 truenas-srv-01 kernel: task:nfsd            state:D stack:0     pid:6791  ppid:2      flags:0x00004000
Jun 24 01:58:14 truenas-srv-01 kernel: Call Trace:
Jun 24 01:58:14 truenas-srv-01 kernel:  <TASK>
Jun 24 01:58:14 truenas-srv-01 kernel:  __schedule+0x349/0x950
Jun 24 01:58:14 truenas-srv-01 kernel:  schedule+0x5b/0xa0
Jun 24 01:58:14 truenas-srv-01 kernel:  schedule_timeout+0x151/0x160
Jun 24 01:58:14 truenas-srv-01 kernel:  wait_for_completion+0x86/0x170
Jun 24 01:58:14 truenas-srv-01 kernel:  __flush_workqueue+0x144/0x440
Jun 24 01:58:14 truenas-srv-01 kernel:  ? __queue_work+0x1bd/0x410
Jun 24 01:58:14 truenas-srv-01 kernel:  nfsd4_destroy_session+0x1ce/0x2b0 [nfsd]
Jun 24 01:58:14 truenas-srv-01 kernel:  nfsd4_proc_compound+0x356/0x680 [nfsd]
Jun 24 01:58:14 truenas-srv-01 kernel:  nfsd_dispatch+0xee/0x200 [nfsd]
Jun 24 01:58:14 truenas-srv-01 kernel:  ? __pfx_nfsd+0x10/0x10 [nfsd]
Jun 24 01:58:14 truenas-srv-01 kernel:  svc_process_common+0x2f5/0x6f0 [sunrpc]
Jun 24 01:58:14 truenas-srv-01 kernel:  ? __pfx_nfsd_dispatch+0x10/0x10 [nfsd]
Jun 24 01:58:14 truenas-srv-01 kernel:  ? __pfx_nfsd+0x10/0x10 [nfsd]
Jun 24 01:58:14 truenas-srv-01 kernel:  svc_process+0x131/0x180 [sunrpc]
Jun 24 01:58:14 truenas-srv-01 kernel:  nfsd+0x84/0xd0 [nfsd]
Jun 24 01:58:14 truenas-srv-01 kernel:  kthread+0xe5/0x120
Jun 24 01:58:14 truenas-srv-01 kernel:  ? __pfx_kthread+0x10/0x10
Jun 24 01:58:14 truenas-srv-01 kernel:  ret_from_fork+0x31/0x50
Jun 24 01:58:14 truenas-srv-01 kernel:  ? __pfx_kthread+0x10/0x10
Jun 24 01:58:14 truenas-srv-01 kernel:  ret_from_fork_asm+0x1b/0x30
Jun 24 01:58:14 truenas-srv-01 kernel:  </TASK>
Jun 24 01:58:14 truenas-srv-01 kernel: Future hung task reports are suppressed, see sysctl kernel.hung_task_warnings

the_man · June 24, 2024, 11:04pm

I experienced the same issue Sid had earlier in the thread. Tried the 24.04.2 Nightly, but it didn’t fix it.

After 6 hours of constant writes over an NFS share, Scale ran out of memory and terminated the asyncio_loop process and I was unable to log into web ui until I restarted.

I ended up going back to Core for the time being, so unfortunately I haven’t been able to capture any data following this occurrence.

Joel_Gray · June 25, 2024, 3:54am

Just to provide an update, I have also now had a chance to try the Nightly and am still seeing the out-of-memory errors under ongoing heavy NFS workloads.

Hopefully this can get some eyes on it, as I would suspect it will impact quite a lot of people’s workloads and is a difficult issue for less technical end users to identify.

I have more ram on the way, but suspect it will just be eaten up quickly as well given the issue isn’t present on Core.