on the old forum we had quite a few users whose Linux VMs occasionally froze with no apparent reason. Looks like some good soul in the small bhyve community I am a part of found the root cause and a solution.
Linux uses the RDRAND instruction to generate randomness and it doesn’t like it if the CPU just pauses for a moment, because it has run out of entropy.
If this ever occurs depends on the CPU architecture and generation and that’s why I among others could never reproduce the problem.
The solution is to add the virtio-rnd device to the VM. This is currently not supported by either middlewared or the UI, but if someone still experiencing the problem is willing to try, I have a gross hack that would add the device to the bhyve command line (my Python is not good enough to do it properly), but if successful we could file a ticket in JIRA.
My VMs only get light use, so I’ve not encountered freezes. But If there’s a simple way to generate the problem, I’m willing to test the change with bhyve running as a nested VM on a virtual instance of current CORE or BETA.
If your host CPU supports that RDRAND instruction, you won’t experience the problem. That’s why it only hit some people. If the CPU doesn’t, the instruction is emulated in bhyve and if the host runs out of entropy the virtual CPU might be paused for a couple of milliseconds. Linux as a guest doesn’t like that and locks up.
That’s how I understood the problem, I might not have got all the details correct but I should have got the essential mechanism - host pausing to collect entropy, guest CPU pausing in turn.
That’s why we need someone of the original crowd who can reproduce the issue.
The solution is to hack the middleware to add -s 28,virtio-rnd to the bhyve command line and I think that’s really all.
I did not understand the code to be able to correctly add another device. Any takers?
For a quick test - if we find a volunteer, I would hack the VNC configuration section in the middleware to just add the parameter mentioned above in front of the ones for VNC.
As a home user, entropy in VMs is not something I’ve given much thought to. In Linux when using libvirtd via virt-manager, all Linux VMs are created with a “RNG /dev/urandom” device which is non-blocking. Other OS VMs, such as FreeBSD or Windows, are not created with this device. When creating a TrueNAS VM part of the first post-install boot sequence is to generate/speed entropy. I haven’t checked to see what happens on Proxmox.
RRAND is broadwell Intel onwards and equiv AMD CPUs, so in theory the problem could occur on my kit but don’t know you’d force it to happen.
This is what I did not know. Both the UI and midclt call vm.restart <id> overwrite the XML file. I wasn’t aware you could bypass the middleware entirely.
Thanks, but no use trying in my environment - I never experienced that problem.
If the freeze reports were confined to Linux VMs and started a round 2 years ago, I wonder if they coincided with the then kernel overhaul of entropy, /dev/random etc.
At this time the value of both /proc/sys/kernel/random/entropy_avail and /proc/sys/kernel/random/poolsize dropped from 4096 to 256. Prompting discussions such as here and here.
@pmh I am happy to try this on my proxmox backup VM and see if the random crashes stop.
That being said, this crashing has been nothing but beneficial for me, since it slowly drove me insane I’ve since learnt proxmox, setup my own little server and have a whole heap of things, I would never run under bhyve again due to that reliability problem.
However, identifying if this is the cause, would be great testing and fun, let me know what to do and I’ll disable, my new, daily reboot script (thanks to you) here:
Just to be clear, the “virsh edit” method makes a temporary change for testing bhyve args and will not survive WebUI VM actions or system re-boots. The middleware vm.py code dynamically builds the VM bhyve command every time a VM is started via the WebUI.