SLOG Benchmarks with NFS share for VM's to TrueNAS and XCP

mauzilla · June 12, 2024, 12:22pm

We have conducted some testing between TrueNAS to XCP with NFS share for VHD’s. The original intention was to test the speed for our SLOG’s but decided to run a variety of other tests to see the differences:

On TrueNAS host (within /mnt/poolname)

Write speed without SLOG
Write speed with SLOG
Write speed with sync
Write speed without sync (async)
Write speed without compression

On XCP (host itself, not a VM):

Write speed with SLOG
Write speeds with sync disabled

On a VM within XCP

Write speed without SLOG
Write speed with SLOG
Write speed with XCP Tools installed
Increased the resources of the VM to see if this makes any changes

The hardware in this case:
TrueNAS Server:
Dell R630
2 CPU’s
64GB of RAM
2x 400GB SSD’s for pool
2x Optanes DC 4801x 375GB

Network:

10GB Fibre Connection between hosts (LACP enabled on Arista Switches)

XCP Host:
Dell R720

The results has been a bit mixed, overall we can see a speed performance with the Optanes, but can see a considerable drop in performance performing the same test over NFS. We initially thought it could be a throttle of the VM itself, but performend a direct NFS mount on the host and the tests are approximately the same:

We made use of fio to do the same benchmark on all of the tests (apart from 1, last in the list) to see if we just have 1 job instead of 10.

sync ; fio --randrepeat=1 --direct=1 --gtod_reduce=1 --numjobs=1 --name=test --filename=test --bs=4k --iodepth=64 --size=1G --readwrite=randwrite --ramp_time=4 --group_reporting

We’ve tested everything we can possibly think that could impact the speed. I fully understand that direct to host would always be the fastest tests, I did however not anticipate such a performance drop over NFS (even with sync and async disabled). This seems to go much further than whether a SLOG is needed or not, but rather a question of if the tests are inline with what we can expect (as we have now done a single benchmark VS not having other environments to compare to)

I also understand that this test is not a true reflection of what to expect running VM’s over NFS as this is mostly a random write test, and we can see that even with 1 or 10 concurrent jobs the speed remains the same. In real world this is simply a benchmark

Would love to get some feedback and maybe suggestions. Ultimately the road for us is to NFS sync enabled (we simply tested without sync to see what to expect).

Sara · June 13, 2024, 6:15am

Maybe I did not have my morning coffee yet, but at first glance, the setup looks a little bit confusig.

What are you trying to benchmark?
Sync or async?

There could be multiple things here, like NFS4 or how an NFS mount influences sync settings. But in general, yeah NFS will impact your performance, especially for small random writes. Maybe you could get better results with iSCSI.

To rule out benchmark errors, I would set the sync settings on TrueNAS instead on fio (and NFS mount).
Just run your test with disabled or with forced.

mauzilla · June 13, 2024, 10:42am

We use XCP to host VM’s. The VM’s are currently setup as local storage, but this obviously means that there is a major single point of failure, so we’re implimenting some TrueNAS servers (which also replicates to secondary servers for further redundancy) which allows us to create a pool of XCP’s servers and have a bit of redundancy in place (seperating storage from hypervisor)

The benchmarks initially was to test the the impact / improvements we may see with a seperate Optane SLOG (as the underlying storage will be on the primary TrueNAS SSD but on the replication TrueNAS servers these would be normal spindle drives, we did not include any testing on them as our primary tests was made for the performance impact for SSD with Optane)

As it would be seamingly pointless to test the speed of the SLOG in conjuction with the TrueNAS on the TrueNAS level, the additional benchmarks was to see how the performance tests would be for testing the same performance / fio command (some with sync; and some without) to see what performance we would get from hosting the VM’s on a NFS mount point, hence the numerous tests

The ultimate conclusion further to the above over the last couple of days that NFS would be slower (which we expected, just not so much)

We initially did not want to use ISCSI simply on the fact that atleast from a XCP perspective we would not be able to use thin provisioning.

This however has somewhat changed as we setup an ISCSI share this morning (creating a “file” ISCSI located on the same data pool that has the SLOG attached to it) and we saw a 40% increase between ISCSI and NFS)

So the idea now is to then utilize both NFS and ISCSI so that we can have best of both worlds, using larger disks (data disks) over NFS so that we can utlize thin provisioning and then creating additional ISCSI disks for higher requirements like databases etc.

The setup between TrueNAS ISCSI and XCP has been extremely easy in comparison to something like an Equallogic, most of it worked first time around.

I hope this gives a bit better clarity?

Sara · June 13, 2024, 11:48am

While that is true, real HA is very expensive! And by real, I mean “everything redundant” and not “some stuff redundant”. These two docs explain it pretty well

https://pve.proxmox.com/wiki/High_Availability

My concern is that by making an error in the benchmarks, you could be coming to wrong conclusions.
We have a saying in German “Wer misst, misst Mist” with translates roughly to “who measures, measures crap”.
By setting sync an async on the TrueNAS side, you can rule out any misconfig on the NFS or fio side.

I never used XCP, but maybe VHD vs. RAW is also something you could look into. Maybe also ceph or Gluster as storage.

Unfortunately I can’t help you further, since I moved away from Hyper-V Failover cluster to not so redundant Proxmox

Good luck with your project and report back your findings please.