Anybody still experiencing freezing Linux VMs on CORE?

Hi all,

on the old forum we had quite a few users whose Linux VMs occasionally froze with no apparent reason. Looks like some good soul in the small bhyve community I am a part of found the root cause and a solution.

Linux uses the RDRAND instruction to generate randomness and it doesn’t like it if the CPU just pauses for a moment, because it has run out of entropy.

If this ever occurs depends on the CPU architecture and generation and that’s why I among others could never reproduce the problem.

The solution is to add the virtio-rnd device to the VM. This is currently not supported by either middlewared or the UI, but if someone still experiencing the problem is willing to try, I have a gross hack that would add the device to the bhyve command line (my Python is not good enough to do it properly), but if successful we could file a ticket in JIRA.

Kind regards,
Patrick

1 Like

Moved on to scale :confused:

I’ve been running a FreePBX on one of my CORE servers for a few years now and haven’t really ever seen a freeze.

My VMs only get light use, so I’ve not encountered freezes. But If there’s a simple way to generate the problem, I’m willing to test the change with bhyve running as a nested VM on a virtual instance of current CORE or BETA.

Someone posted this on the net last October: Bhyve CPU Lockup Issue | Thoughts | Yin Jun, Phua -- Assistant Professor at Tokyo Tech

If your host CPU supports that RDRAND instruction, you won’t experience the problem. That’s why it only hit some people. If the CPU doesn’t, the instruction is emulated in bhyve and if the host runs out of entropy the virtual CPU might be paused for a couple of milliseconds. Linux as a guest doesn’t like that and locks up.

That’s how I understood the problem, I might not have got all the details correct but I should have got the essential mechanism - host pausing to collect entropy, guest CPU pausing in turn.

That’s why we need someone of the original crowd who can reproduce the issue.

The solution is to hack the middleware to add -s 28,virtio-rnd to the bhyve command line and I think that’s really all.

I did not understand the code to be able to correctly add another device. Any takers?

For a quick test - if we find a volunteer, I would hack the VNC configuration section in the middleware to just add the parameter mentioned above in front of the ones for VNC.

1 Like

As a home user, entropy in VMs is not something I’ve given much thought to. In Linux when using libvirtd via virt-manager, all Linux VMs are created with a “RNG /dev/urandom” device which is non-blocking. Other OS VMs, such as FreeBSD or Windows, are not created with this device. When creating a TrueNAS VM part of the first post-install boot sequence is to generate/speed entropy. I haven’t checked to see what happens on Proxmox.

RRAND is broadwell Intel onwards and equiv AMD CPUs, so in theory the problem could occur on my kit but don’t know you’d force it to happen.

According to wiki, I think the instruction is RDRAND and the broadwell cpus added RDSEED

But the bit I found interesting was where it talked about spectre type mitigations which serialized multiple access to the instruction.

That would cause issues in VMs

Yes, I muddled up RDRAND and RDSEED as various sources on the net do.
I’ve still not come across a simple test for “entropy starvation” in VMs.

I did suffer from the “stall” on broadwell cpus so that might checkout

You don’t have to hack the middleware vm.py file to test the addition of a virtio-rnd device to a VM created via the WebUI.

FreeNAS CORE is using libvirtd ( when did that happen?) so you can edit the VM’s xml file direct using:

virsh edit <vmname>

Just add an extra bhyve arg line at the end of the xml file, e.g:

  <bhyve:commandline>
    <bhyve:arg value='-s 29,fbuf,tcp=192.168.0.66:5900,w=1600,h=900'/>
    <bhyve:arg value='-s 28,virtio-rnd'/>
  </bhyve:commandline>
</domain>

Get the vname CORE has allocated from:

virsh list --all

You must start the VM from the CLI not the WebUI, otherwise the xml edit is lost:

virsh start ...

Virtio device as sen in a Linux VM:

root@bunsenvm:~# lsmod | grep rng
virtio_rng             16384  0
virtio                 20480  4 virtio_rng,virtio_pci,virtio_blk,virtio_net
virtio_ring            45056  4 virtio_rng,virtio_pci,virtio_blk,virtio_net
root@bunsenvm:~#

That’s what I had tried first.

This is what I did not know. Both the UI and midclt call vm.restart <id> overwrite the XML file. I wasn’t aware you could bypass the middleware entirely.

Thanks, but no use trying in my environment - I never experienced that problem.

The original finding that lead me to the claim “running out of entropy” is probably the cause of the issue was in this article by Yin Jun Phua.

Here’s an older article explaining randomness inside of VMs in more general terms.

1 Like

FWIW, that article perfectly describes the issues i had with Bhyve/Linux thread stalls

If the freeze reports were confined to Linux VMs and started a round 2 years ago, I wonder if they coincided with the then kernel overhaul of entropy, /dev/random etc.

At this time the value of both /proc/sys/kernel/random/entropy_avail and /proc/sys/kernel/random/poolsize dropped from 4096 to 256. Prompting discussions such as here and here.

@pmh I am happy to try this on my proxmox backup VM and see if the random crashes stop.

That being said, this crashing has been nothing but beneficial for me, since it slowly drove me insane I’ve since learnt proxmox, setup my own little server and have a whole heap of things, I would never run under bhyve again due to that reliability problem.

However, identifying if this is the cause, would be great testing and fun, let me know what to do and I’ll disable, my new, daily reboot script (thanks to you) here:

Does your VM have a VNC device?

I do believe so yes, lets me view the console if SSH is down, correct?

Then try the method outlined by @Krisbee please:

Just to be clear, the “virsh edit” method makes a temporary change for testing bhyve args and will not survive WebUI VM actions or system re-boots. The middleware vm.py code dynamically builds the VM bhyve command every time a VM is started via the WebUI.

1 Like

5 years I’ve been having the problem.

1 Like