Virtual Disks for SLOG device advice

koifish59 · February 21, 2025, 7:25pm

We currently have a cluster of 4 ESXi hosts with production VMs running off a vSAN. There are also 30+ non-important or testing VMs that are currently running off a datastore that’s a TrueNAS NFS share (4 vdev of 2-way mirrors using 8x SAS disks).

Since vmware and NFS both use synchronous writes, we’re thinking about carving out 16gb from the vSAN to add as a virtual disk for TrueNAS to use as a SLOG device. The vSAN uses intel P4510 SSD for storage tier and intel P5800X optane for cache, distributed on all ESXi hosts.

Has a virtual disk for SLOG device been done before, and what are some considerations I should think about before doing it?

richardm · February 22, 2025, 11:01pm

Provided via vSANs new-ish iSCSI target feature? I presume your TrueNAS is a standalone box outside of the VMware environment?

I’m almost certain nobody’s tried this around here but that won’t stop a half dozen people from telling you it will certainly eat your data, curve your spine, and deliver peace without honour to your kin.

NugentS · February 22, 2025, 11:39pm

Wether the TrueNAS is physical or virtual - I’d say this is brave.

No-one I know has even tried this

koifish59 · February 23, 2025, 5:44pm

Welp, there’s a first for everything. I’ll set this up next week and will report back.

In theory it should work and I can’t think of any negatives unless I’ve missed something.

etorix · February 23, 2025, 6:20pm

You’re adding a layer of abstraction and overhead to something that is supposed to have the lowest possible latency… So, with respect to performance it’s a guaranteed loss over a physical SLOG, and with respect to data safety you’re in unchartered territory.
Your data, your choice.

NugentS · February 23, 2025, 6:57pm

Either way you look at it - its a virtual disk. And we warn about virtual disks being a recipe for disaster when paired with ZFS

koifish59 · February 23, 2025, 7:17pm

I agree with the extra layer of abstraction.

But for latency and performance, I would think SSD vSAN storage backed by 40gbe fiber network will still be faster and lower latency than ZIL on spin disks?

etorix · February 23, 2025, 7:51pm

You test and report… It’s not obvious to me that virtual remote storage beind an agglomerated QSFP+ link (so potentially just a single SFP+ link in use) will beat local storage on latency.

Oh, and my implicit comparison was with a P5880X, or even a P4510, as local SLOG. Your above post is the first mention of ZIL.

koifish59 · February 24, 2025, 4:51am

Oh I forgot to add that there are no more available PCI-E or M.2 slots left. Otherwise this wouldn’t be a problem.

Hence the only option for synchronous writes is either to the same spin disks on the pool, or a remote virtual disk from vSAN.

volts · February 24, 2025, 5:07am

Cool, just turn off sync.

etorix · February 24, 2025, 10:56am

Any free SAS/SATA ports for data centre SSDs with PLP? Anything going through the network stack suffers milisecond delays.

NugentS · February 24, 2025, 12:29pm

This is the way - much better than messing around with a virtual SLOG

richardm · February 26, 2025, 1:18pm

It’s not that bad. I can push about 17k Q1T1 IOPs through an iSCSI connection. 1ms round trip would put this somewhere under 1k.

HoneyBadger · February 26, 2025, 2:41pm

100% in uncharted waters, here be dragons.

@koifish59 the key challenge you’re going to have here would be getting said virtual disk to be mapped to the physical TrueNAS machine. You’d have to set up some manner of method (hardware iSCSI initiator?) that would let the virtual-SLOG device be present early enough in the boot process to allow it to be there for importing the pool.

In addition you’re also paying a latency penalty of traversing the local (TrueNAS server) iSCSI stack, the network, and the vSAN storage layers for every single write to TrueNAS. While it’s likely to be faster than SLOG on spinning disks, that’s not exactly a high bar to set.

Given the context of them being “non-important or testing VMs” - is this a plausible option? Run periodic (hourly?) snapshots perhaps, with the VMware Integration option to have crash-consistent snaps at the VMware level, and then enjoy the boost in performance this way.

volts · February 26, 2025, 5:32pm

Yeah, my suggestion was based on that context.

To go further … if these are testing VMs, why put that workload on the Very Important San at all? It’s got better things to do, surely.

If these aren’t testing VMs, and they’re actually doing Very Important Work, then why aren’t they getting full service from the SAN in the first place, why are they like this at all?