Fastest RAID-Z3 configuration for 10-Gigabit network?

Lylat1an · November 10, 2024, 9:36am

Greetings all,

I just installed TrueNAS Scale 24.10.0.2 and set up a RAID-Z3 Samba share for the Windows PCs on my home network:

The server and my main PC both have 10-Gbit network cards installed, and they are connected to 10-Gbit ports on my router.

I’d like to know whether I can make the initial write rate of about 1Gigabyte per second last longer if I re-arrange or upgrade the SSDs?

My hardware:

The main system is a Supermicro X10SRH-CF motherboard with an Intel Xeon e5-1620 v4 (3.5GHz) and 64GB ECC RAM.

The pool is eight 4TB 7200RPM SAS drives in RAID-Z3.

The pool’s cache is a Samsung SSD 990 PRO 1TB with Heatsink. (PCI-e)

The pool’s LOG is Samsung SSD 990 PRO 4TB with Heatsink. (PCI-e)

And the pool’s Metadata drive is a Samsung_SSD_870_QVO_1TB (SATA)

Thanks for reading!

etorix · November 10, 2024, 10:05am

Too large! Advice is 5*RAM < L2ARC < 10*RAM.

Very wrong choice! No PLP and ridiculously large: SLOG is NOT a write cache and ZFS would never keep more than 10 seconds (2 txg) worth of transactions.

If this a single drive you’ve put your whole pool at risk. With a raidz3 you’d need a 4-way mirror to match redundancy.
Remove the L2ARC, which is largely useless if there’s a special vdev, and get three more 1 TB NVMe drives to make a 4-way mirror as special vdev.

Lylat1an · November 10, 2024, 10:24am

I suppose that answers my question about swapping the roles of the two NVME drives.

Should I shrink the partition size for the L2ARC, or install a 512GB SSD for that?

I assume PLP is Power Loss Protection? I have the system on a battery backup in case the power goes out.

Would you advise using the 4TB drive for Metadata until I can get more SSDs for that purpose?

etorix · November 10, 2024, 11:11am

If you keep the special (metadata) vdev—and you cannot remove it without destroying your pool—you most likely do not need L2ARC at all.
What’s the output of arc_summary?

This is not an adequate substitute to PLP. What if someone trips on the power cable, or something goes wrong inside the NAS itself?
Do you have sync writes? If not, you do not need a SLOG at all. (And if you’re serving iSCSI, raidz3 is not a good layout.)

I advise mirrorring the metadata vdev as soon as possible (lose it = lose the whole pool). So, yes, throw in the current L2ARC and SLOG for this purpose if that’s all you have at hand.

Protopia · November 10, 2024, 12:28pm

@etorix seems to have covered all the bases that I would have covered, and I agree with a lot of what he has said…

L2ARC is probably not going to benefit you much - and if you need more ARC hits you would be better off adding memory.

SLOG is only for synchronous writes - and Windows SMB is always asynchronous and won’t use SLOG.

If you are going to run VMs or apps, then set up an SSD/NVMe pool for those, ideally mirrored, but you can always replicate it to HDD as a backup if not. Because VMs use zVols and will almost certainly create synchronous writes, an SLOG might be useful for these, but generally only if the VM data is too big for SSD and is on HDD (because SLOG on SSD/NVMe is much faster than HDD) OR if data is on SSD and the writes to the SSD/NVMe are so heavy that you will benefit from the ZIL being on a separate drive to split the ZIL and data I/Os.

In my personal opinion a metadata vDev is typically not going to give you that much benefit unless you have a particular unusual workload that does a lot of access to a multitude of small files - but it does introduce another layer of complexity and points of failure to your pool. The ARC will typically hold a lot of the metadata you might want, and if you find that you are getting a lot of metadata ARC misses you can preload ARC with metadata you can run a cron job to scan all directories every so often. Or if you have a particularly hot set of data, put it on an SSD/NVMe pool.

The difficulty is that I don’t think you can remove a metadata special vDev once it is added to a RAIDZ pool. And if you are going to keep it then it needs to be a 2x mirror at a minimum because if the drive fails you will lose your pool. From a technical perspective it doesn’t have to be a 3x or 4x mirror to match the RAIDZ3, but as Etorix says if you want Z3 protection on the data then you should really match it with a 4x mirror on the metadata vDev.

So, since this is a new install (and so hopefully not too much data on it yet) my advice would be to take any data off the RAIDZ3 pool and recreate it without special metadata, SLOG or L2ARC vDevs. (You can remove the SLOG and L2ARC vDevs if you want to use them as a temporary home for the data whilst you recreate the pool.)

(As an aside, there is nothing wrong with a RAIDZ3 on 8x4TB drives if you really feel that 3x redundancy is necessary for the risk profile you want, but most people seem to feel that RAIDZ2 is sufficiently low risk for a vDev that has a medium number of drives (in the middle of the range 3-12) and where the drives are relatively small (4TB rather than 18TB) and so a resilver isn’t going to take too long.)

Constantin · November 10, 2024, 7:10pm

See the resource section re: sVDEVs and learn what they do and don’t help with. As @etorix mentioned, if you lose your sVDEV, you lose the pool, so the sVDEV should be as redundant as the rest of the pool (my sVDEV is a quad-mirror for a Z3 pool).

To get a consistent 1GB/s file transfer with standard 7200RPM HDDs likely requires four Z3 VDEVs in your pool. That’s usually a lot of HDDs.

NAS pools / datasets meant for fast I/O are frequently set up as a bunch of parallel mirrors. You lose 50%+ of pool capacity to parity but gain IOPS.

Lylat1an · November 11, 2024, 1:07am

Perhaps it would help if I specified what my server is used for:

Its sole purpose is to store system images and other data backups on my personal home network, and is turned off when not needed.

I want it to be fast and redundant network storage, and I want the storage to be as fast as possible.

My server’s case only has space for eight 3.5" drives, but I can add a few more PCI-e SSDs and up to 8 more SATA SSDs if necessary.

I’d also like to upgrade to eight 8TB drives (or larger) in the future, would that require more RAM?

Constantin · November 11, 2024, 1:34am

If you want 1GB/s transfer speeds out of an eight-drive pool, i wonder if SSDs using SATA can deliver that kind of speed. If you go with an all-SSD NVME array, that seems possible but very pricey.

Given your use case, I’d be content with just a single-VDEV consisting of inexpensive HDDs swallowing ISOs at 100-250MB/s.

Lylat1an · November 11, 2024, 1:39am

I agree that an NVME SSD array would probably be fast enough, but way too pricey for me.

I had thought that using SSDs to compliment the array would speed it up a bit, but I’ll try using the HDDs on their own and see how they perform.

volts · November 11, 2024, 1:47am

You could use the 4TB SSD as an ingress drive, push backups to it. And then a job to move data to the HDD pool.

Constantin · November 11, 2024, 1:50am

They can complement nicely via a sVDEV for small files and fast metadata read/writes. But as others have mentioned, you need to have a lot of redundancy in a sVDEV for your pool to have good reliability.

As for fast writes, I suggest making sure the files you’re writing do not consist of bazillions of little files. That kills throughput. Big archives are the friends of high speed throughput to HDDs as it minimizes overhead.

Also, avoid SMR HDDs at all costs.

Constantin · November 11, 2024, 1:59am

Certainly. I’m not aware of any automated solutions or apps to do this sort of movement, however. Are there solutions that do not require bash expertise?

Lylat1an · November 11, 2024, 2:24am

Copying from an SSD in my main PC to the 4TB SSD in the server is sustaining an average of 700MB/s, going directly to the pool was about 300MB/s.

I can copy from the 4TB SSD to the pool from Windows, but will that use my Windows machine as a go-between for the data?

Constantin · November 11, 2024, 4:20am

Unless you come up with a script or find an App to do this, I’d suspect you’d be copying it from the SSD pool to the HDD pool via your computer. So a read and a write via the network. I’d avoid that unless you can find a reliable way to get the data off the SSD pool and into the HDD pool at the NAS level - can be done via CLI, for example, with a mv command.

However, you’d have to be careful to on the one hand balance the simplicity of using the SSD pool as a ingress cache, along with the CLI, to mv stuff to the slower HDD pool because 1) you have to be very observant re: naming stuff well to avoid accidental over-writes / misnaming 2) not shut the NAS down (as you expressed a desire to) while the thing is flushing the SSD pool to the HDD pool.

One GUI option could be perhaps using rsync to back up a particular folder on the SSD pool on a one-way basis? Can the built-in rsync be set as a one-way flush - i.e. ignore everything at the destination except if the source has a new file and the destination doesn’t? Could be worth exploring. Or find out if syncthing or any of the other App-based backup / synchronization solutions allow this sort of flush and once complete shut the NAS down.

Protopia · November 11, 2024, 10:49am

I am uncertain what exactly you mean by this. What is your “main PC” and why would you copy via Windows?

The whole benchmarking of network writes is complex.

You need to understand how ZFS writes work, and to understand the role of memory in storing data before writing to disk, and especially whether your writes are synchronous or asynchronous.

The first question you need to ask is whether you are using synchronous or asynchronous writes. If you are backing up from Windows over SMB, the writes are definitely asynchronous. If you are using any other network protocol, the chances are they are synchronous by default. That said:

suggests that these are asynch writes which don’t get written to disk immediately but instead get stored in memory, but when the memory write buffer is full the network has to wait for some data to be written to disk and the network speed then slows down to disk speeds. I suspect that you wouldn’t achieve these network speeds for a single stream of synch writes.

But synch writes are not only slower because every packet has to be written to disk before the next one is sent, they are also slower because they do substantially more writes:

For every packet, synchronous writes create a ZIL write.
Then both synchronous and asynchronous writes group together data held in memory and do batched writes to disk.
The ZIL writes that have now been written to disk then need to be removed from the ZIL - which is another write.

So if you have (say) 10 network packets grouped to write to disk, that might be one write-to-disk for asynchronous traffic and 12 writes-to-disk for synchronous traffic.

And this may explain why you are getting faster writes to disk from Windows than from your “main PC” which I am assuming is a Mac or Linux (and which probably does synch writes by default). IMO backups probably do not need every packet confirmed written to disk like e.g. a transactional database would require, and so you should consider turning these off for the backups.

There are various ways to turn off synch writes, but the easiest would probably be to do so on the dataset you are writing your backups to - and this is probably worth an experiment to see if it helps.

Lylat1an · November 12, 2024, 12:14am

I’ve removed the SATA SSDs and am now booting from the 1TB NVMe, and I set up the 4TB NVMe as its own network share.

I also set up a network monitor on my main Windows PC and found that it’s not being used as a go-between when copying between the two TrueNAS shares.

I’m getting good speeds when I transfer between SSDs on the network, so I can live with writes from the 4TB share to the pool being 200-300 MB/s since they don’t hog bandwidth.

I’d still like to increase the RAID-Z3 write speed (if possible) but that can wait until I get larger drives and more memory first.