TrueNAS reboots when starting docker stack

Ever since updating to 25.04, I have experienced reboots when starting my AI docker stack (Ollama, Open-WebUI, Speaches-AI, Paperless-AI).

This is only after my system has been up for a few days and my ZFS cache is at 50% of 64GB. After booting TrueNAS, I can start and stop the stack many times without issue.

I believe its due to running out of memory but any idea where to look to verify? I don’t see anything of value in /var/log/messages

I have set zfs_arc_max to 40% of my memory to see if that helps as well.

ARC uses whatever memory is left over after all your apps run.

AI apps are notoriously memory hungry and may also have memory leaks it is possible (or perhaps likely) that an AI app will grow its memory usage to use whatever is available.

If your apps use more memory than you actually have, then you get no ARC and reboots are probably the consequence. (And with a very small ARC you get terrible performance.)

If multiple AI apps try to use more memory than available they will clash with each other as well.

The solution is to set memory constraints on each of your docker images such that the total memory used by your VMs and apps is (say) 8GB less than your system memory.

Setting the zfs_arc_max is not the solution because it does the opposite of what is happening…

1 Like

I have 64GB which is more than enough for my docker containers and it was never an issue on previous operating systems. The docker stack that causes issues is only using ~2GB memory when running.

When I have all my VMs and containers running, it takes up just over 50% of memory. In the brief moment the containers go down and release memory, I can see the ZFS memory usage go up before the containers try starting again and causing my system to reboot.

Since limiting the zfs_arc_max, restarting those containers no longer causes a reboot.

1 Like

Based on this description here is my theory (and it is only that)…

I have a feeling that under FreeBSD (not 100% sure) it was integrated with the memory management, so a memory request from an app would result in ARC dynamically freeing memory.

However OpenZFS is not as well integrated with the Linux kernel and its memory management. So ZFS adjusts ARC size periodically to keep the free memory pool topped up.

Now if you start a bunch of containers that suddenly require a massive amount of memory, ARC may not be able to free up memory fast enough to stop the Linux free memory pool from becoming exhausted - and that might trigger a spontaneous reboot (though I cannot see why that is a sensible response).

There are a couple of openZFS tuneables which might help (and some that probably won’t but should be mentioned):

  • zfs_arc_sys_free - which defaults to 1/64 of installed memory i.e. 1GB, but you can change it to something higher (e.g. 4GB = 4294967296) and see if it helps.
  • zfs_arc_shrink_shift - the default value of this is 7 which corresponds to shrinking the ARC by 128. Setting this to 6 will double the amount that ARC shrinks by each time.
  • zfs_arc_pc_percent - TBH I don’t understand this one - the default is that it is disabled, and I think that changing this will slow down ARC memory being freed, so you probably don’t want to change this.
  • spl_kmem_* - there are several tuneables related to how the SPL kernel module allocates large memory blocks - and kmem_alloc warnings can be issued to the system console / dmesg if this happens to frequently or too large - so you should probably see if you are getting any of these messages before the reboot (though a spontaneous reboot may cause these messages to be lost). My guess is that you shouldn’t mess with these unless you a) are getting kmem_alloc warnings and b) you absolutely know what you are doing.

This actually makes sense now that you have described what happens.

  1. It keeps a large amount of memory free for VMs and containers to use at will.
  2. An ARC size of c. 28GB is still pretty substantial, so you should get a decent (and more consistent) hit rate without allowing it to grow to c. 60GB when none of your apps and VMs are running.

There was recently a bug fixed where a type of vm memory wasn’t being tracked properly, thus causing the ARC to fail to yield.

Maybe this will help.

Interesting commentary starts here:

Probably in 25.04.1 so perhaps try again?