Anybody still experiencing freezing Linux VMs on CORE?

pmh · July 13, 2024, 1:58pm

Hi all,

on the old forum we had quite a few users whose Linux VMs occasionally froze with no apparent reason. Looks like some good soul in the small bhyve community I am a part of found the root cause and a solution.

Linux uses the RDRAND instruction to generate randomness and it doesn’t like it if the CPU just pauses for a moment, because it has run out of entropy.

If this ever occurs depends on the CPU architecture and generation and that’s why I among others could never reproduce the problem.

The solution is to add the virtio-rnd device to the VM. This is currently not supported by either middlewared or the UI, but if someone still experiencing the problem is willing to try, I have a gross hack that would add the device to the bhyve command line (my Python is not good enough to do it properly), but if successful we could file a ticket in JIRA.

Kind regards,
Patrick

Stux · July 13, 2024, 2:27pm

Moved on to scale

victor · July 13, 2024, 2:32pm

I’ve been running a FreePBX on one of my CORE servers for a few years now and haven’t really ever seen a freeze.

Krisbee · July 13, 2024, 2:48pm

My VMs only get light use, so I’ve not encountered freezes. But If there’s a simple way to generate the problem, I’m willing to test the change with bhyve running as a nested VM on a virtual instance of current CORE or BETA.

Someone posted this on the net last October: Bhyve CPU Lockup Issue | Thoughts | Yin Jun, Phua -- Assistant Professor at Tokyo Tech

pmh · July 13, 2024, 3:02pm

If your host CPU supports that RDRAND instruction, you won’t experience the problem. That’s why it only hit some people. If the CPU doesn’t, the instruction is emulated in bhyve and if the host runs out of entropy the virtual CPU might be paused for a couple of milliseconds. Linux as a guest doesn’t like that and locks up.

That’s how I understood the problem, I might not have got all the details correct but I should have got the essential mechanism - host pausing to collect entropy, guest CPU pausing in turn.

That’s why we need someone of the original crowd who can reproduce the issue.

The solution is to hack the middleware to add -s 28,virtio-rnd to the bhyve command line and I think that’s really all.

I did not understand the code to be able to correctly add another device. Any takers?

For a quick test - if we find a volunteer, I would hack the VNC configuration section in the middleware to just add the parameter mentioned above in front of the ones for VNC.

Krisbee · July 13, 2024, 6:13pm

As a home user, entropy in VMs is not something I’ve given much thought to. In Linux when using libvirtd via virt-manager, all Linux VMs are created with a “RNG /dev/urandom” device which is non-blocking. Other OS VMs, such as FreeBSD or Windows, are not created with this device. When creating a TrueNAS VM part of the first post-install boot sequence is to generate/speed entropy. I haven’t checked to see what happens on Proxmox.

RRAND is broadwell Intel onwards and equiv AMD CPUs, so in theory the problem could occur on my kit but don’t know you’d force it to happen.

Stux · July 13, 2024, 10:54pm

According to wiki, I think the instruction is RDRAND and the broadwell cpus added RDSEED

But the bit I found interesting was where it talked about spectre type mitigations which serialized multiple access to the instruction.

That would cause issues in VMs

Krisbee · July 14, 2024, 9:03am

Yes, I muddled up RDRAND and RDSEED as various sources on the net do.
I’ve still not come across a simple test for “entropy starvation” in VMs.

Stux · July 14, 2024, 10:03am

I did suffer from the “stall” on broadwell cpus so that might checkout

Krisbee · July 14, 2024, 11:40am

You don’t have to hack the middleware vm.py file to test the addition of a virtio-rnd device to a VM created via the WebUI.

FreeNAS CORE is using libvirtd ( when did that happen?) so you can edit the VM’s xml file direct using:

virsh edit <vmname>

Just add an extra bhyve arg line at the end of the xml file, e.g:

  <bhyve:commandline>
    <bhyve:arg value='-s 29,fbuf,tcp=192.168.0.66:5900,w=1600,h=900'/>
    <bhyve:arg value='-s 28,virtio-rnd'/>
  </bhyve:commandline>
</domain>

Get the vname CORE has allocated from:

virsh list --all

You must start the VM from the CLI not the WebUI, otherwise the xml edit is lost:

virsh start ...

Virtio device as sen in a Linux VM:

root@bunsenvm:~# lsmod | grep rng
virtio_rng             16384  0
virtio                 20480  4 virtio_rng,virtio_pci,virtio_blk,virtio_net
virtio_ring            45056  4 virtio_rng,virtio_pci,virtio_blk,virtio_net
root@bunsenvm:~#

pmh · July 14, 2024, 11:57am

That’s what I had tried first.

This is what I did not know. Both the UI and midclt call vm.restart <id> overwrite the XML file. I wasn’t aware you could bypass the middleware entirely.

Thanks, but no use trying in my environment - I never experienced that problem.

pmh · July 14, 2024, 12:02pm

The original finding that lead me to the claim “running out of entropy” is probably the cause of the issue was in this article by Yin Jun Phua.

Here’s an older article explaining randomness inside of VMs in more general terms.

Stux · July 14, 2024, 12:32pm

FWIW, that article perfectly describes the issues i had with Bhyve/Linux thread stalls

Krisbee · July 14, 2024, 3:11pm

If the freeze reports were confined to Linux VMs and started a round 2 years ago, I wonder if they coincided with the then kernel overhaul of entropy, /dev/random etc.

At this time the value of both /proc/sys/kernel/random/entropy_avail and /proc/sys/kernel/random/poolsize dropped from 4096 to 256. Prompting discussions such as here and here.

diskdiddler · July 16, 2024, 9:43am

@pmh I am happy to try this on my proxmox backup VM and see if the random crashes stop.

That being said, this crashing has been nothing but beneficial for me, since it slowly drove me insane I’ve since learnt proxmox, setup my own little server and have a whole heap of things, I would never run under bhyve again due to that reliability problem.

However, identifying if this is the cause, would be great testing and fun, let me know what to do and I’ll disable, my new, daily reboot script (thanks to you) here:

pmh · July 16, 2024, 9:49am

Does your VM have a VNC device?

diskdiddler · July 16, 2024, 12:03pm

I do believe so yes, lets me view the console if SSH is down, correct?

pmh · July 16, 2024, 12:50pm

Then try the method outlined by @Krisbee please:

Krisbee · July 16, 2024, 4:54pm

Just to be clear, the “virsh edit” method makes a temporary change for testing bhyve args and will not survive WebUI VM actions or system re-boots. The middleware vm.py code dynamically builds the VM bhyve command every time a VM is started via the WebUI.

diskdiddler · July 16, 2024, 11:22pm

5 years I’ve been having the problem.