ZFS_ARC_MAX issue - out-of-memory errors in kernel with Scale 24.04.1.1

awalkerix · October 18, 2024, 4:06pm

smbd issue where it could go OOM is fixed in 24.10.

Joel_Gray · October 19, 2024, 5:11am

This issue was never solved for me either.

On the latest release it’s still not just SMB but also NFS that will cause these OOM crashes every time a degree of real load is placed on the Truenas server. The underlying ZFS memory behaviour is still broken.

I’ve had to write it off and move to another solution for now as there was no traction on getting it resolved, but I still have the truenas server and will re-test 24.10 when it releases just in case.

the_man · November 1, 2024, 6:33pm

What solution did you end up going with? I just went back to Core hoping this would get fixed soon, but I’m losing confidence it will.

mav · November 4, 2024, 12:38am

TrueNAS 24.10 was just released. We don’t have OOM reports for it so far. If somebody still do, please open the ticket with relevant details.

the_man · November 4, 2024, 9:37pm

I’m holding off until the next release, it looks like there may be a possible hotfix in the works already: SMB service randomly turns off due to going OOM - #12 by awalkerix

Really hoping this does the trick, I’m excited to get off TrueNAS Core.

kris · November 5, 2024, 2:01pm

Yea, that fix will land in the hotfix release later this week. So far its the only OOM we’ve seen, and it wasn’t ZFS related at all, Samba specific for that particular workload.

Momos · December 21, 2024, 2:25pm

Hi,

Truenas Scale EE 24.10 here.

“Dec 21 13:49:45 truenas kernel: oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=df25152698d89dbaacac10f583933d03df2d9fe334e721676fa6a7bc8d3dcb90,mems_allowed=0,oom_memcg=/docker/df25152698d89dbaacac10f583933d03df2d9fe334e721676fa6a7bc8d3dcb90,task_memcg=/docker/df25152698d89dbaacac10f583933d03df2d9fe334e721676fa6a7bc8d3dcb90,task=ffmpeg,pid=3524723,uid=0”

It might be related to the same issue.

mav · December 21, 2024, 10:54pm

@Momos Do you have any memory/ARC statistics for the time it happened? Because with TrueNAS no longer having swap partitions any misbehaving or just requiring too much resources application can end up in that position. All that TrueNAS can and should do is to try shrinking ARC as much as possible, but there is only as much it can do.

Momos · December 22, 2024, 12:14am

If you need some other info please tell me how to get it not very tech savvy

winnielinnie · December 22, 2024, 1:39am

Was there transcoding going on via Plex or Jellyfin around the time it happened?

Fleshmauler · December 22, 2024, 3:37am

Uncertain if related or useless information - I’ve noticed every time I shutdown/stop a VM the ARC MAX resets from my manually set values on boot. But I’m a few versions behind because I don’t currently have time; maybe this is already patched out.

Momos · December 22, 2024, 5:33am

No. Frigate app was running.

Momos · December 23, 2024, 5:10am

happened again:
Dec 23 02:45:51 truenas kernel: oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=9704b43b3df9625462a22e37d267fa101cd2129fbd0689288628f80046bb7d34,mems_allowed=0,oom_memcg=/docker/9704b43b3df9625462a22e37d267fa101cd2129fbd0689288628f80046bb7d34,task_memcg=/docker/9704b43b3df9625462a22e37d267fa101cd2129fbd0689288628f80046bb7d34,task=python3,pid=1115451,uid=0

No idea what is causing it:( , but the Frigate app crashes and doesnt restart, gets stuck at deploying. Then if works fine again for a few days/hours.

NickF1227 · December 23, 2024, 1:47pm

How much RAM did you give Frigate, do you have other apps?

Momos · December 23, 2024, 2:14pm

And again it happened.

Dec 23 12:39:30 truenas kernel: oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=c275e4dd561d2a2d387c61e519a482a4369d58be26f8a35a4462305637bbec17,mems_allowed=0,oom_memcg=/docker/c275e4dd561d2a2d387c61e519a482a4369d58be26f8a35a4462305637bbec17,task_memcg=/docker/c275e4dd561d2a2d387c61e519a482a4369d58be26f8a35a4462305637bbec17,task=ffmpeg,pid=276860,uid=0

This is my app list:

Not eating much from my system

To Frigate app i gave 12 GiB. And this last crash Frigate was still up and runnig. There does not seem to be a common denominator.

PS: side question about memory allocation for Apps in EE: does the sum of allocated memory should be under the total system memory or it allocates based on the need? as far as i can tell the second seems to hold true.

I don’t think i have a memory issue.

awalkerix · December 23, 2024, 2:31pm

Maybe create an SSH session and run command top -o RES to watch resident memory for applications before OOM killer gets triggered. Typically the problem application will float to the top of that output.

Momos · December 23, 2024, 2:38pm

i can run this in a putty window from a windows PC. Is there a way to autosave the results to a file every minute for example ?

Cant stare at the screen until something bad happens …

Momos · December 23, 2024, 6:39pm

and another crash.
Dec 23 19:15:28 truenas kernel: oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=c275e4dd561d2a2d387c61e519a482a4369d58be26f8a35a4462305637bbec17,mems_allowed=0,oom_memcg=/docker/c275e4dd561d2a2d387c61e519a482a4369d58be26f8a35a4462305637bbec17,task_memcg=/docker/c275e4dd561d2a2d387c61e519a482a4369d58be26f8a35a4462305637bbec17,task=python3,pid=3474063,uid=0

of course i wasnt watching the top screen

NickF1227 · December 23, 2024, 8:01pm

Does the issue persist if you limit the memory to 8GiB? Maybe it will work around the issue? It looks like you have a VM also consuming RAM, so it may be a combination of factors here leading to resource exhaustion.

Momos · December 23, 2024, 10:27pm

I’m sure it is a combination of factors.

Things i tried: setting the Frigate App memory to 6GiB, then to 8GiB, playing with the number of cores allocated from 4 to 8, and combinations of the above.

I think it might have something to do with HW acceleration not working as intended on the iGPU. Frigate app shows only 1% on the Intel GPU, so most of the processing goes to the CPU probably.

Will keep testing and report.