Snapshots of running VMs ZVOLS on Scale

Ezaum · February 12, 2025, 11:56am

There are several discussions regarding this, but I was unable to find a definite answer.

On a system running Scale is it necessary to do some aditional steps to ensure ZVOL snapshots are consistent besides using the jobs we can schedule using the TrueNAS web interface?

The VMs disks are ZVOLs on a ZFS tank. Each ZVOLS has daily schedule snapshots and replication to another pool on the same system AND to another server.

All my windows VMs have the VirtIO drivers instaled, and also the guest agent.

I’ve seen information telling me to suspend the VMs before taking snapshots, also information telling me that if the VMs have the guest agent installed I wouldnt need to, but I dont know if that applies only to qemu disk “files” on some filesystem or to direct ZVOLs, which is my case.

Thanks in advance.

pmh · February 12, 2025, 12:39pm

To my knowledge TN does not do anything to “quiesce” the virtual disks when taking snapshots of VMs hosted on TN itself. There is a mechanism when serving iSCSI storage to VMware, IIRC.

Taking a snapshot without any special mechanism to enforce consistency is equivalent to pulling the plug of a physical machine. In most cases that won’t do more harm than losing just the open transactions still in the cache - the cache of the OS running on that hypothetical machine, i.e. the guest in case of a VM.

Ezaum · February 12, 2025, 1:32pm

Yes, I know, thats why I’m asking. If I need to ensure consistency and cannot relly on the UI tools for that specific use case I need to know and that proper measures.

So, if I need to ensure consistency of the snapshots of the ZVOLs can I use only the tools provided in the UI?

Ezaum · February 12, 2025, 1:41pm

There’s a vm property thats not exposed in the web interface that you can access from the vm service, called “suspend_on_snapshot”.

It’s always false and I don’t know if changing it to true will make the TrueNAS out of the box scripts take something into account to ensure proper consistent snapshots of VMs ZVOLs.

pmh · February 12, 2025, 1:45pm

ZFS doesn’t know that the zvol is the backing store to a VM’s virtual “disk”. To my knowledge there is no mechanism in TN guaranteeing consistancy.

Only option: shut down the VM, take snapshot, boot VM.

I’d love to be corrected, even if it’s only on the road map for a future release.

ESXi can do it as (if I am not mistaken) does Proxmox. But they both do not just use a filesystem snapshot on the block layer without any connection to the state of the VM.

Ezaum · February 12, 2025, 2:47pm

Yeah, it would be nice if the scripts generated by the TrueNAS would send the commands to quiesce the VMs (dont shutdown, but integrate to Windows VSS or qemu guest agent) and handle this gracefully.

Maybe someone has scripts ready for this?

How to properlly quiesce the VMs before the snapshot, without doing a shutdown?

garyez_28558 · February 12, 2025, 3:17pm

Perhaps this will help:

Ezaum · February 12, 2025, 10:21pm

Ok, so I’m getting that the official answer is no, if you use virtual machines you cannot trust the UI tools to have consistent snapshots.

I will come up with some scripts, theres a way to do it without shuting down the VMs.

neofusion · February 13, 2025, 11:05pm

For comparison, here’s the section related to modes of backup of VMs directly from the Proxmox manual:

Backup modes for VMs:

stop mode

This mode provides the highest consistency of the backup, at the cost of a short downtime in the VM operation. It works by executing an orderly shutdown of the VM, and then runs a background QEMU process to backup the VM data. After the backup is started, the VM goes to full operation mode if it was previously running. Consistency is guaranteed by using the live backup feature.

suspend mode

This mode is provided for compatibility reason, and suspends the VM before calling the snapshot mode. Since suspending the VM results in a longer downtime and does not necessarily improve the data consistency, the use of the snapshot mode is recommended instead.

snapshot mode

This mode provides the lowest operation downtime, at the cost of a small inconsistency risk. It works by performing a Proxmox VE live backup, in which data blocks are copied while the VM is running. If the guest agent is enabled (agent: 1) and running, it calls guest-fsfreeze-freeze and guest-fsfreeze-thaw to improve consistency.

Implementing that guest-fsfreeze-freeze/guest-fsfreeze-thaw mechanic would be welcome.

Ezaum · February 14, 2025, 12:58pm

Yes!!! That last mode is what I was looking for when I started this thread. I’m new to the this forum, don’t know how those requests are handled.

What should we do so it gets added to a possible backlog?

neofusion · February 14, 2025, 2:56pm

You can post a new topic as a Feature request.

That lets you describe your reasoning and lets other users vote for the request, if they agree.
Don’t forget to vote for it yourself.