SMB service suddenly stops without any errors, manual restart needed

drlauridsen · June 12, 2024, 12:49pm

Upgraded from Scale v22.x to latest v24 yesterday.
Have a very simple setup with only SMB shares.

Suddenly the SMB service stops after some time, and I cannot see any errors in the audit. The NFS service is still running, but has no shares.

After manual restart of the SMB service it will work for hours, and then just stop again.

How do I find out what causes the service to stop?

essinghigh · June 12, 2024, 12:54pm

/var/log/samba4/log.smbd may have some info, though can’t say it’s something I’ve looked at before.

drlauridsen · June 12, 2024, 3:15pm

Nothing in that log, besides multiple warnings about a deprecated log setting. However those warnings stop at 4 in the night, and returns after 10 am when I restart SMB service, so at least I know what time the service fails… I just dont know why…

drlauridsen · June 12, 2024, 8:56pm

Syslog revealed out of memory errors, and that memory manager closed down services.
I did not have any problems with v22, although I was already pushing it to the limit with 8 gb. It is running in proxmox, so now I upped it to 12 gb. (non ballooning)
Same error after a couple of hours.

Found a bug report that might be related - NAS-128788
It states however, that is was resolved in 24.04.1 - which is the version I am running…

Is there any way to downgrade to V23?

Stux · June 12, 2024, 9:43pm

Make sure you are running at least 24.04.1

You could reduce zfs arc max to 50% of your ram.

If you still have issues, try a 24.04.2 nightly to see if it’s a kernel issue. (New kernel in nightly)

drlauridsen · June 12, 2024, 11:09pm

Thanks, I followed advice in TrueNAS - Issues - iXsystems TrueNAS Jira

And did this in the console…

ARC_PCT="85"
ARC_BYTES=$(grep '^MemTotal' /proc/meminfo | awk -v pct=${ARC_PCT} '{printf "%d", $2 * 1024 * (pct / 100.0)}')
echo ${ARC_BYTES} > /sys/module/zfs/parameters/zfs_arc_max

I did however not take a backup of the original file - so cant revert to the original setting. Will this revert with the next update?

It seems however to have solved it, so now services have quite a lot of RAM available (and SMB has not crashed yet) and ZFS cache a lot less.
I assume this is a performance hit for the disk i/o?
But better than crashing services.

Stux · June 12, 2024, 11:34pm

I think the default is 0, which means use the default, which is 100% of your memory. And I think it resets on restart.

drlauridsen · June 13, 2024, 7:14am

SMB crashed again after a couple of hours.
Either you are right, that reboot resets the arc to default, because I did a reboot after the setting (however it seemed that memory management was different, since services had more memory) or there is another major bug.

essinghigh · June 13, 2024, 2:46pm

Yes, a reboot does undo this. You should set a post-init task if you want it to persist on reboot. You don’t need to reboot for it to take effect.

I wouldn’t think ARC using the majority of memory would be causing issues as it should resize dynamically as needed. I’ve demonstrated this in previous posts quite a few times, and I have never managed to cause an OOM condition despite attempts.

How much memory do you have?

As opposed to using zfs_arc_max, maybe take a look at zfs_arc_sys_free, this is the number of free bytes that ARC should leave as free memory on the system. By default I believe it’s 1/64th of the total memory capacity (so 128GiB would be 2GiB free), but you could nudge this to test whether free memory availability is actually the issue. This way if services start eating up memory ARC won’t grow beyond where you want it.

EDIT:

Also, now that you know when the service stops roughly, have you checked netdata stats to see what CPU usage, memory usage, etc looks like during this time?

awalkerix · June 13, 2024, 2:51pm

Assuming smbd is getting killed OOM killer (should be visible in /var/log/messages), it would be a good idea to investigate what is using memory and triggering OOM condition. You can review probably by htop / top and sorting by RES. Some variants of this can be misleading due to presenting separate entries for threads in multithreaded apps (making middlewared appear to take up staggeringly large amounts of memory, when in reality the memory is shared between the threads).