There are several discussions regarding this, but I was unable to find a definite answer.
On a system running Scale is it necessary to do some aditional steps to ensure ZVOL snapshots are consistent besides using the jobs we can schedule using the TrueNAS web interface?
The VMs disks are ZVOLs on a ZFS tank. Each ZVOLS has daily schedule snapshots and replication to another pool on the same system AND to another server.
All my windows VMs have the VirtIO drivers instaled, and also the guest agent.
I’ve seen information telling me to suspend the VMs before taking snapshots, also information telling me that if the VMs have the guest agent installed I wouldnt need to, but I dont know if that applies only to qemu disk “files” on some filesystem or to direct ZVOLs, which is my case.
To my knowledge TN does not do anything to “quiesce” the virtual disks when taking snapshots of VMs hosted on TN itself. There is a mechanism when serving iSCSI storage to VMware, IIRC.
Taking a snapshot without any special mechanism to enforce consistency is equivalent to pulling the plug of a physical machine. In most cases that won’t do more harm than losing just the open transactions still in the cache - the cache of the OS running on that hypothetical machine, i.e. the guest in case of a VM.
Yes, I know, thats why I’m asking. If I need to ensure consistency and cannot relly on the UI tools for that specific use case I need to know and that proper measures.
So, if I need to ensure consistency of the snapshots of the ZVOLs can I use only the tools provided in the UI?
There’s a vm property thats not exposed in the web interface that you can access from the vm service, called “suspend_on_snapshot”.
It’s always false and I don’t know if changing it to true will make the TrueNAS out of the box scripts take something into account to ensure proper consistent snapshots of VMs ZVOLs.
ZFS doesn’t know that the zvol is the backing store to a VM’s virtual “disk”. To my knowledge there is no mechanism in TN guaranteeing consistancy.
Only option: shut down the VM, take snapshot, boot VM.
I’d love to be corrected, even if it’s only on the road map for a future release.
ESXi can do it as (if I am not mistaken) does Proxmox. But they both do not just use a filesystem snapshot on the block layer without any connection to the state of the VM.
Yeah, it would be nice if the scripts generated by the TrueNAS would send the commands to quiesce the VMs (dont shutdown, but integrate to Windows VSS or qemu guest agent) and handle this gracefully.
Maybe someone has scripts ready for this?
How to properlly quiesce the VMs before the snapshot, without doing a shutdown?
This mode provides the highest consistency of the backup, at the cost of a short downtime in the VM operation. It works by executing an orderly shutdown of the VM, and then runs a background QEMU process to backup the VM data. After the backup is started, the VM goes to full operation mode if it was previously running. Consistency is guaranteed by using the live backup feature.
suspend mode
This mode is provided for compatibility reason, and suspends the VM before calling the snapshot mode. Since suspending the VM results in a longer downtime and does not necessarily improve the data consistency, the use of the snapshot mode is recommended instead.
snapshot mode
This mode provides the lowest operation downtime, at the cost of a small inconsistency risk. It works by performing a Proxmox VE live backup, in which data blocks are copied while the VM is running. If the guest agent is enabled (agent: 1) and running, it calls guest-fsfreeze-freeze and guest-fsfreeze-thaw to improve consistency.
Implementing that guest-fsfreeze-freeze/guest-fsfreeze-thaw mechanic would be welcome.