We currently have a cluster of 4 ESXi hosts with production VMs running off a vSAN. There are also 30+ non-important or testing VMs that are currently running off a datastore that’s a TrueNAS NFS share (4 vdev of 2-way mirrors using 8x SAS disks).
Since vmware and NFS both use synchronous writes, we’re thinking about carving out 16gb from the vSAN to add as a virtual disk for TrueNAS to use as a SLOG device. The vSAN uses intel P4510 SSD for storage tier and intel P5800X optane for cache, distributed on all ESXi hosts.
Has a virtual disk for SLOG device been done before, and what are some considerations I should think about before doing it?
Provided via vSANs new-ish iSCSI target feature? I presume your TrueNAS is a standalone box outside of the VMware environment?
I’m almost certain nobody’s tried this around here but that won’t stop a half dozen people from telling you it will certainly eat your data, curve your spine, and deliver peace without honour to your kin.
You’re adding a layer of abstraction and overhead to something that is supposed to have the lowest possible latency… So, with respect to performance it’s a guaranteed loss over a physical SLOG, and with respect to data safety you’re in unchartered territory.
Your data, your choice.
But for latency and performance, I would think SSD vSAN storage backed by 40gbe fiber network will still be faster and lower latency than ZIL on spin disks?
You test and report… It’s not obvious to me that virtual remote storage beind an agglomerated QSFP+ link (so potentially just a single SFP+ link in use) will beat local storage on latency.
Oh, and my implicit comparison was with a P5880X, or even a P4510, as local SLOG. Your above post is the first mention of ZIL.
@koifish59 the key challenge you’re going to have here would be getting said virtual disk to be mapped to the physical TrueNAS machine. You’d have to set up some manner of method (hardware iSCSI initiator?) that would let the virtual-SLOG device be present early enough in the boot process to allow it to be there for importing the pool.
In addition you’re also paying a latency penalty of traversing the local (TrueNAS server) iSCSI stack, the network, and the vSAN storage layers for every single write to TrueNAS. While it’s likely to be faster than SLOG on spinning disks, that’s not exactly a high bar to set.
Given the context of them being “non-important or testing VMs” - is this a plausible option? Run periodic (hourly?) snapshots perhaps, with the VMware Integration option to have crash-consistent snaps at the VMware level, and then enjoy the boost in performance this way.
To go further … if these are testing VMs, why put that workload on the Very Important San at all? It’s got better things to do, surely.
If these aren’t testing VMs, and they’re actually doing Very Important Work, then why aren’t they getting full service from the SAN in the first place, why are they like this at all?