Memory issues with Ubuntu VM after Core->Scale migration

coolnodje · September 20, 2024, 7:47am

I’ve this issue since my main Ubuntu VM (hosting docker containers) has migrated from Bhyve to KVM during my CORE->SCALE migration:
The entire sytem SCALE+VM consistently hogs the entire host memory.

The VM has 24Gb of RAM and has never used it entirely in the past.
Now I get service interruptions because the mem become entirely used up as well as the 4Gb of Swap:

( kernel: systemd-journald[397]: Under memory pressure, flushing caches.
2024-09-20T06:03:58.263127+02:00 host kernel: message repeated 19 times: [ systemd-journald[397]: Under memory pressure, flushing caches.]
2024-09-20T06:03:59.525922+02:00 host kernel: workqueue: drm_fb_helper_damage_work hogged CPU for >10000us 512 times, consider switching to WQ_UNBOUND
2024-09-20T06:04:08.692272+02:00 host kernel: systemd-journald[397]: Under memory pressure, flushing caches.)

Concurrently, the host, SCALE, also use up its 64GB of RAM, though it doesn’t seem to produce any issue AFAIK.

I’m having a hard time identifying where this come from, as shutting down the VM does reclame memory, but not much from (59Gb to 44Gb as indicated by HTOP) and then the RAM usage doesn’t seem to come from anywhere.
I can’t find a process using much ram (K3s is the top one with 0,7% usage, middlewared come first when looking at VIRT mem with 4GB only).

AFAIK only rebooting the SCALE host makes it solves the mem usage issue.
Restarting the VM makes it start with the very high mem usage from startup (about 19Gb where it normally starts using about 8Gb of RAM)

I upgraded TrueNAS Scale to latest patch less than 24h ago, and the RAM was all used up before the next morning.

I can’t really afford NOT to start the VM for 24h to see whether it is the source of the mem leak, but it seems obvious that it is, even though stopping it doesn’t reclaim the memory.
I don’t think a “no VM Truenas Scale” would leak memory like this.
(mine does Cloud sync, Snapshot syncs at night, and the VM doesn’t do much heavy data processing and doesn’t do much either)

Stux · September 21, 2024, 4:36am

Are you confusing the ARC filling up for a memory leak?

coolnodje · September 21, 2024, 8:13pm

ARC doesn’t use up the VM memory, it’s the VM memory that is leaking.

sfatula · September 21, 2024, 9:41pm

You need to look in the VM, not in Scale, for it’s memory usage with htop, etc. to see what inside the VM is using it.

coolnodje · September 23, 2024, 7:28am

Thatś what iḿ doing in the past few days, roaming the web about ways to get information about used memory.

But I can’t find any useful way to understand what’s going on this VM.
htop, smem, /proc/meminfo, etc. all show very low usage as compare with the 24Gb available and almost used up as well as the 4Gb of Swap. At most I can gather 20% of used memory in running processes.

If I stop all containers the VM returns to a 15Gb usage, which is still a huge amount as the system is then close to a vanilla Ubuntu 24.04. But swap returns to almost 100% free.

I think it could be expected though that the memory doesn’t get reclaimed as shown in htop. It has to be a container that is “leaking memory”.

It’s odd though that it happens suddenly after migrating to SCALE. But could be a coincidence, a slight change in a container config.

Though, migrating from Bhyve to KVM must have some implications in the way the system reports memory usage and processes, and also why not, in behavior.

coolnodje · January 28, 2025, 11:41am

Months after the upgrade, I’m still stuck with a Ubuntu VM hogging memory.

After I’ve removed as many processes as possible, it stills consumes >20Gb of RAM.

I’m unable to identify what is using the memory, whichever tool I use (top htop vmstat /proc/meminfo sar smem … noything shows up)

The only thing I can see is a process using 100% CPU at bootup for ~30s, the time for the used memory to go up to 22Gb when looking at it in htop.
It’s a kworker/5:1+events_freezable

Dmesg/journalctl don’t say anything meaningful to me.
Maybe
workqueue: update_balloon_size_func hogged CPU for >10000us 32 times, consider switching to WQ_UNBOUND have something to do with it, as awell as workqueue: drm_fb_helper_damage_work hogged CPU for >10000us 16 times, consider switching to WQ_UNBOUND but I see these as consequences rather than cause.

See htop snapshot for a look at the empty running system with RAM consumption:

Stux · January 28, 2025, 12:14pm

If you have set a “minimum” amount of ram in the VM config… don’t.

It uses “memory ballooning”

coolnodje · January 28, 2025, 1:49pm

Very good catch I must say, it solves the issue instantly. THANKS!!

I thought I actually started to use the option while searching for a solution to this memory usage though. Must have been activated before that then.
Is the option also available on TN Core?
Or maybe the “memory ballooning” behavior came in TN SCALE 24.10 ?

Anyway reading about “memory ballooning” It should’nt have caused an issue.
It did not actually, it;s just that whichever amount of RAM I gave to the VM, it always seemed to use it all. I gave 10Gb as a minimum but as seen 22Gb were used at startup.

Is there anyway to see that “memory ballooning” is at play from within a VM?

Stux · January 28, 2025, 7:47pm

It’s a QEMU feature, so it’s been available in scale for a while.

It requires a virtio driver to work correctly (which would be installed in the guest already)

In my experience it does not work well.

coolnodje · January 29, 2025, 6:15pm

In any case, it feels a lot better to see the memory available again.
But I wouldn’t be able to say how it affected the services I run.
Though my first post suggest it was causing memory issues…