Need advice on optimal 10-drive ZFS pool layout for mixed media & personal data

I hope you analyzed the metadata in advance and know what your sVDEV small file / metadata needs most likely are and will be in the future. I’d go for a 3-way mirror but to each his/her/their own.

By default, the sVDEV will be the source for all metadata that the ARC (and optionally fitted L2ARC) can contain. If the sVDEV is already “fast” by virtue of being on a SSD, what is there to gain by caching metadata on the L2ARC too?

Moreover, the L2ARC only comes into play for metadata / files after there was a ARC miss, the system decides to add the relevant info to the L2ARC, etc. In my limited experience, it took about three passes / misses for the system to add all relevant metadata to a L2ARC (and that was on a system w/o a sVDEV).

In an system with a sVDEV, the only function that the L2ARC might be able to perform is as a cache for files that are frequently read but not modified. That use case may exist, but I found zero benefit of a metadata-none L2ARC in a system with a sVDEV in my limited setting.

1 Like

Really comes down to use case. For SOHO, I totally agree. For large VM host / database servers with a bazillion writes and high uptime requirements, I can see the justification to mirror.

Over in the old forum, one of the SLOGs had over 96PB written to it. I also seem to recall a system with Optane, dedup, and a large number of VMs just absolutely crush it re: use case.

1 Like

There is actually some (minor) benefit, as backups could be interrupted with the lid closed.


Also, I personally set sync=disabled for my TM dataset. In case the backup got corrupted, there are snapshots. Not a very robust solution, but ok for home use.

Not sure about the speed boost, though. Didn’t make measurements, but if there is a speed up – it’s less than x2.

Based on what I’ve seen, a lot depends on the busyness of the computer backing itself up, the speed of the connection, and so on.

For example, there is a noticeable impact if I run a regular TM backup on the 10GbE connection vs. 1GbE or WiFi. That suggests the link speed is the biggest factor for my use case, it likely would be different if the server / CPU were saturated with work.

Sync surely plays a factor. While my NAS has a SLOG (optane), what likely really helps is that the server isn’t otherwise bothered with work and so the pool can lay down that data pretty quickly.

FWIW, the difference can be two orders of magnitude.

macOS uses sync writes for SMB.

1 Like

Without the SLOG, the bottleneck would be your sync write speed (to an HD raidz pool think sub 10MB/s).

With a good SLOG (such as an nice old Optane drive) the bottleneck becomes the network.

I think the reason the backups are slow if the client is busy is the backup process is deprioritized.

While I’ve only disabled sync writes for the dataset (instead of putting optane as slog), I have high doubts about that “two orders of magnitude” boost.

Mine didn’t even get to x2. Subjectively, it went down from 60-70min to 40-45min – but I didn’t (and don’t even know how to) perform scientific fair tests. And now I have some 100% CPU diskimagesiod spikes from time to time (not on every backup).

From 2-3MB/s to 200-300MB/s.

1 Like

How did you exactly measure this? The speed of (my) TM backup is not (and was never) a constant.

I have a similar use case. my systems are in my signature. I use4 the ssd’s for the metadata vdev and i use raidz2 for my data. since most of my files are large video files i have everything set to 1m block size and 8k for the metadata. That’s it…pretty simple. the backup machine is just that…for backups…nothing more.

BLUF: For now, I’m keeping both the mirrored NVMe SLOG and the L2ARC because they align with my Time Machine sync write needs and frequently-read large file access. I’ll monitor real-world results, and if the data shows they’re not helping, I’m open to removing them.

Thanks for all the follow-ups — appreciate the mix of perspectives here. I’ll try to address a few of the points raised.

On SLOG and async vs. sync writes (@Sara)
I agree: async writes never touch the SLOG. My reason for setting sync=always on the Time Machine dataset is that macOS uses sync writes over SMB for many operations, and for backups, I prefer the durability guarantee even at the cost of some throughput. Without a SLOG, sync writes go to spinning rust in the RAIDZ2 vdev, the slow path. With a fast mirrored NVMe SLOG, I’m aiming to shift that bottleneck closer to the network link speed. I’m not expecting async performance here, just better sync performance where it matters.

On SLOG mirroring (@Sara)
I understand the argument for a single SLOG device, but for me the mirrored SLOG is about peace of mind. If the pool crashes and the single SLOG also fails in that window, I’d rather not lose in-flight sync writes. It’s low probability, but my NVMe slots are already populated, so I’m fine paying the small “opportunity cost” in capacity.

On L2ARC with a special vdev (@Sara & @Constantin)
Point taken that with a fast mirrored special vdev holding metadata and small files, L2ARC may offer minimal benefit for metadata access. However, in my own usage I’ve seen the L2ARC hold ~700 GB even with the special vdev active, which tells me it’s caching more than just metadata — likely frequently accessed larger files that don’t live on the special vdev. I plan to keep it for now, monitor ARC/L2ARC hit rates, and remove it if the benefit turns out to be negligible.

On special vdev redundancy (@Constantin)
I did look at my current metadata usage, and the 2-way mirror gives me headroom for growth. I appreciate the 3-way mirror suggestion; if I see sustained growth or increased risk tolerance needs, I’ll revisit.

On SLOG impact for Time Machine (@Stux & @swc-phil)
@Stux — thanks for confirming what I’ve read and seen elsewhere: in the right conditions, the jump from HDD-bound sync writes to NVMe-accelerated sync writes can be dramatic.
@swc-phil — I think the big difference in results comes down to hardware and layout. If the base pool can already do decent sync speeds, the uplift will be smaller. In my case, the pool is 10 × 18 TB HDDs in RAIDZ2, so raw sync write speed without a SLOG is in the ~10 MB/s range. I’m expecting improvement into the hundreds of MB/s with the SLOG, but I’ll measure and report back so it’s data, not just theory.

Thanks again for the constructive debate — even where we disagree, it’s helped me sanity-check the design.

3 Likes

looking forward to seeing the outcome, if you can share the performance for your various use-cases.

Just a FYI on macOS : you can change a few SMB client configs that can help with Time Machine backup performance:

  1. Disable package signing (if you completely trust the SMB client and server) by adding the following line to your /etc/smb.conf file

[default]
signing_required=no

  1. Disable delayed TCP acknowledgements by running the following command as root via your terminal:

sysctl -w net.inet.tcp.delayed_ack=0

I previously used Time Machine pointing to a QNAP all-flash NAS, over SMB, and with 10GbE on all devices. After these changes, I recall seeing drastic performance improvements over SMB. I don’t have metrics available, but I’ve saved these in my “must do on all Macs” since then.

1 Like

By default it saves everything (within the set L2ARC feed limit) that has been evicted from ARC. In case your L2ARC and sVDEV have the same speed, you should set l2arc_exclude_special=1.
Also, you can consider tuning l2arc_mfuonly to store only frequently accessed files blocks (if it is what you expect from L2ARC).

You guys still didn’t answer how exactly you come up with these miserable numbers. Did you just multiply the suggested single drive IOPS (100) and recordsize (128K)? Curious because I’m considering performing the same tests.

Also, AFAIK, sync writes can actually slow down async writes and vice versa (I’ve heard something like “async writes become virtually treated as sync in a mixed type of writes”; can’t find the exact GitHub issue ATM). Thus, (async) write-heavy pool can have poor sync write performance. This is another reason I’m asking.

1 Like

the video is not wrong, the question is if you need that extra speed for anything - for example it won’t affect streaming for plex as previoulsy stated

where an SSD/nvme metatdata vdev can help is things like browsing folders over smb that have tens of thousands of files in them

the question is how often do you need that level of snapiness and are you willing to invest in protecting the metadat vdev to the same level as the spinning disks it is servicing.

For example if you have a RAIDZ2 HDD with a RAIDZ1 or mirrored metadata VDEV your pool can only cope with the failure of one disk if it is the metadata vdev… undoing the intent of RAIDZ2. This is both a philosophical and practical question about the nature of risk / benefit based on your approach to risk.

hope that helps, i had the same mental model challenges when i started, ZFS isn’t intutive as its so different (and cool) from what came before

for the record i do what i shouldn’t, i have a RAIDZ2 with a mirrored metadat vdev - but for me my NAS is still in test mode, not production… i have run it like this for ~4 months of testing… my plan is to make the metatdata vdev a 3 way mirror or recreate the pool and do it as RAIZ2 metadat vdev… maybe :wink:

also i am still trying to get my head around how it changes when the primary access is native and local to the box vs via smb and the overall performance of the machine

for example i saw large gains in SMB reads and writes when using an nvme based set of special vdevs, was very surprised when i didn’t see them on a more powerrful server - but i think thats because on the more powerfull server i could saturate the 10gbe link with no pressure on the vdev or the ARC (it had 128GB of ram vs the smaller machine having 64gb RAM), whereas the 10gbe access on the smaller machine pressured the rest of the system - either pcie bandwdith, memory or cpu - juts not sure which

another example is the assumptions and recommendations that are true for local access to the disks files is not the same as when those files will be primarilly accessed via SMB, my noob-ish assumption is one needs to think about ZFS in terms of the access method, not just ‘how fast can i run a benchmark locally on my NAS’…

I do plan to try and redo these tests when i have the disks and time to do that (all my disks got moved from the small system to the bigger system when i needed much better PCIE subsystem - went from ZimaCube Pro to an EPYC 9115 bases system, lol)

I feel i have learnt a lot, but have soooo much more to (re)learn about how ZFS works :slight_smile: its fun

But AFR for SSD in sVDEV and HDD is probably very different. I think that @Sara made a good point:

Purely anecdata, but I have amongst my 2 x 6 Z2 array 5 drives with over 72000 power on hours and 3 with over 91000 power on hours and they’re only NAS grade drives not enterprise. That’s 8-10+ years of trouble free use. They are all HGST drives which probably explains it. RAID isn’t a backup so, as always, act accordingly with precious data.

Onto the OP question and subsequent responses, I haven’t found any real issues with keeping the whole thing dead simple and limiting points of failure and unnecessary complexity.
I use 2 x 6 Z2 as I get higher IOPS with 2 VDEVs, and the performance is enough for me. There’s also a spreadsheet kicking around somewhere that details the inefficiency of certain setups (block padding?) with certain drive counts. 6 drive Z2 i.e. 4 “data drives”, for want of a better term, is the most efficient. 10 drive Z2 or 11 drive Z3 i.e. 8 “data drives” is pretty good. Someone more knowledgeable will need to fill you in on rebuild times with these setups and hence whether 11 Z3 might be preferable to 10 Z2 etc.

For anything needing higher performance I use a separate SSD pool - containers with databases, VMs etc. Nextcloud config etc is on the SSD pool and the data on the HDD pool.
The SSD pool is snapshotted and replicated to a dataset on the HDD pool.
The whole system is replicated across to a secondary TrueNAS server (pull replication), but that’s just a luxury I have from gathering electronic cruft over the years and never retiring anything until it breaks or is simply incapable of doing anything useful.
Any “can’t possibly lose” data is backup up to Backblaze B2.

With 128GB RAM you shouldn’t have any issues. I didn’t with 64GB but moved to 128GB so my new CPU could run more VMs/containers without impacting ARC.

I think most people think they need way more performance than they really do.

I’m starting to think that the only reason sVDEV is so unpopular (amongst users with so-called separate ssd pool for apps and VMs) is that the ix-apps dataset cannot be effectively moved to the sVDEV.

Because with HDD pool with sVDEV layout you get (almost) all the pros/features of HDD pool + SSD pool layout. And get metadata performance boost as a bonus.

Perhaps, I could even create a feature request for making ix-apps dataset management less restrictive.

But I don't use apps in truenas, so that's

tenor

Ok, but are these frequently accessed larger files really faster from the NVME than from the RAIDZ? For me, the 10GBit NIC is the bottleneck for these large files read from my RAIDZ2.

Real world testing :slight_smile:

This has IMHO nothing to do with sync or async itself. It is more that sync writes probably are smaller and random writes and not a sequential writes of a 100GB file. So yeah, if your disks are busy writing 4k random, the async write performance will suffer.
IMHO this mostly comes from the old days, when VMs were not on flash.

I think you are not getting my argument, but at the same time make a great example that speaks for my argument :slight_smile: So if you don’t mind, I use your array as an example.

The security problem is not about how long drives last. It is how likely it is, that they fail at the same time. Why? If I have trashy SSDs, it does not matter if one fails after 6 months. I replace it, resilver it and get on with my day.
Your RAID array on the other hand consists of extremely old HDDs, all from the same brand.
By using the bath tube curve, failure rate grows with age.
So there will be a point in time, when they will fail closer to each other.
If that point in time is 3y from the start, because they are NAS HDDs or 10y from the start because they are enterprise HDDs does not matter.

That is why IMHO your pool is way more at risk, than my made up mirror with two brands.
For my pool to fail, I would have two drives that by pure accident fail at the same time, despite them being totally different drives with different hardware.
For your pool to fail, only 3 of your 6 drives in a RAIDZ2 vdev need to fail at the same time.

That is another point I try to get across, in a home lab scenario you are less likely to replace drives just because of age. You use them until failure. This makes the risk of simultaneous failure go up with time.

What speaks against my theory, is that you might be able to detect aging drives sooner with ZFS, than in a traditional RAID. It might not fail during the pressure from a resilver, but maybe have checksum errors before. Since I never hade an aging old ZFS pool, I unfortunately can’t tell.

Ok. Just describe it. Like, what was the macOS machine? How has it been connected? Was the backup hourly, daily or weekly? What is this speed? Average? Max? Min? Did you just see it in the activity monitor? Or was it zpool iostat on the truenas side? What was the pool layout? What was SLOG model? Was there any other activity on the nas during the backup?

Etc., and so on.

No, I saw a discussion on GitHub. It was specific to (Open)ZFS.

FWIW, I saw some (klarasystems) articles state that unusually high util or wait on one drive via iostat can indicate pre-failure. One, two.