ZFSisms that are not true, or no longer true

Arwen · February 1, 2025, 10:49am

There have been many changes to ZFS over the decades. And yes, it is now 19 years since ZFS was released by Sun Microsystems. (ZFS was in development for several years before then.)

Some ZFS information has been distorted or changed over the years. Or was never a hard and fast rules. Here is some of the more common misinformation:

Long ago it was suggested to have 1GByte of memory per 1TByte of ZFS disk storage. But, this was never a hard and fast rule. And it does not apply to casual users. Nor to applications where data is read once and not again for a long time, like media files.
Ram to HDD TB size question | TrueNAS Community

One old rule involved ZFS RAID-Zx vDev widths, which had optimal widths per parity level, (1-3 disks worth of parity). However, with data compression, that “rule” is not applicable today. Here is the reference:
ZFS RAIDZ: How to Stop Worrying About It | Perforce Software

In the past, you could not expand a RAID-Zx vDev. Today you can, though with some caveats:

Pre-existing data maintains it’s data to parity ratio. A simple copy, (perhaps through a re-balance script), will solve that.

Free space reporting seems to reflect the old data to parity ratio.

For quite some time ZFS supports a full hybrid HDD/SSD pool using Special Allocation vDev(s). It is possible to select what is stored on the Special vDevs, small files, metadata only, De-Dup tables, etc…

It is possible to remove top level Mirror or Stripe data vDevs from a pool. However, if the pool contains a RAID-Zx vDev that removal is not possible.

Redundancy exists in the vDev, not the pool level. One non-redundant data vDev puts the whole pool at risk.

Note that SLOG / LOG, L2ARC and Hot Spare(s) are not data vDevs. However, a Special Allocation vDev IS critical to a pool.

The ZFS Scrub of Death without ECC memory IS real! However, unless you have really bad non-ECC memory, you are so unlikely to experience problems, you can stop worrying about it. And if you insist about worrying on The ZFS Scrub of Death, get ECC memory and quit worrying so much. (You’ll live longer too.)
Here are references:
Will ZFS and non-ECC RAM kill your data? – JRS Systems: the blog
ECC memory vs Non-ECC memory - Poll! - #65 by DAVe3283

There is no hybrid pool configuration that includes a write cache. SLOG / LOG vDevs are not a write cache, and never were. SLOG is only useful for Synchronous writes.

Asynchronous writes are performed from RAM only. Or using direct I/O, skipping RAM and writing directly to storage.

While it might be nice to have a SSD or NVMe Special Allocation Class vDev to act as a async write cache, to compensate for slower HDD data vDevs, it does not exist today, (2025/04/24).

Recent changes to ZFS’ L2ARC handling mean that larger devices and smaller memory sizes are now usable. It is still recommended to add memory over L2ARC, as memory will always be faster. Users must test for viability themselves.

The old rule about not adding L2ARC with less than 64GBs of memory is no longer true. However, low memory servers, 8GBs without Apps / VMs, or 16GBs with some Apps / VMs still may not be suitable for adding L2ARC.

Further, monitoring your ARC hit rate and size is still important, because if ARC is both small and hit rate very high, a L2ARC won’t help.

The old “rule” that L2ARC should be about 5 times RAM, and no more than 10 times RAM is no longer applicable.

The individual record overhead for L2ARC in RAM was reduced from 180 bytes to 96 bytes.
ZFS Cache / L2ARC adding as reduced size | TrueNAS Community

More recent versions of ZFS have improved the eviction logic from ARC to L2ARC

Persistent L2ARC improves it’s usability, even for lower memory servers

Maintaining original compressed records in L2ARC has improved how much can be stored

On the other hand, the old rule of not adding a L2ARC unless you have a known use would still apply. For example, a media server that serves up a media file once and not again for a long time would not receive any real benefit from L2ARC. However, a weekly backup server might benefit from a L2ARC.

If you have any to add, reply with the information and I will review. If applicable, then I will update the first post.

Arwen · February 1, 2025, 11:01am

I am thinking about adding a L2ARC entry. Some people think that a low RAM sized NAS, (8GB or 16GB RAM), with very large L2ARC device, (like 1TB), is the way to go. Need to find some clear references.

With the new compressed L2ARC and reduced memory foot print L2ARC, the rule gets really messy to describe.

Anyone have references?

Stux · February 1, 2025, 11:58am

I haven’t been able to find any refs, but I can provide an anecdote.

I can definitively say that a 128GB system with spinning disks is unusable as a hypervisor, (20s to login to a vm), but if you add a 1TB L2ARC (actually a stripe of 512GB sata SSDs), everything is fine.

SmallBarky · February 1, 2025, 4:05pm

Latest T3 Tech Talk just covered an ARC & L2ARC question. Question asked was similar.

rungekutta · February 2, 2025, 2:57pm

“Adding L2ARC on any system <64GiB RAM will be counterproductive as the (large) L2ARC memory overhead would have been better used for ARC.”

I have yet to see anyone actually demonstrating this in practice through actual data and real-world examples. In addition, the L2ARC header (the part for each cached record that must be stored in RAM) was reduced from 180 bytes to 70 bytes a number of years ago. For a humongous 1TiB L2ARC with the default 128KiB recordsize, this works out to 640MiB of RAM. Or, if you’d like, 256GiB L2ARC with average size 32KiB records. Which could be a pretty appetising addition to all manner of configurations <64GiB. Lastly, more recent ZFS versions have improved the eviction logic from ARC to L2ARC, combined with more sensible default values for modern hardware.

Arwen · February 3, 2025, 1:29am

@rungekutta - Yes, those numbers do show that L2ARC is now more usable with less memory than 64GBs. See the addition I’ve added to the main post, and let me know if it can be improved.

If people can find the web references to the L2ARC sub-items, please let me know. I will add the links to the first post. (I am searching myself…)

RetroG · February 3, 2025, 2:03am

probably gonna get massive flaming for this but…

the whole “don’t use RAID5/6/RAIDZ because It will take forever to rebuild, more disks will fail, your cat will explode and your car tire will fall off, etc etc… mirrors are so much better and more reliable, etc etc”

All it takes is a second of thought to realize, you are limited to how long it takes to read/write an entire drive anyways. And when replacing a failed device mirrors actually hurt reliability in this setting since a mirror will have zero ability to survive additional corruption/failure once you are replacing a failed device, a RAIDZ2 (and RAIDZ3) VDEV has additional abilities to heal in the event of multiple and varying levels of data corruption/failure. (we won’t consider 3+way mirrors as they have even worse efficiency)

Maybe this was true back before sequential scanning resilver/scrub was a thing but definitely not anymore…

and while not specifically ZFS related the above copypasta can sometimes get combined with people being afraid that hard drives are getting bigger faster then their speed is increasing and that this also will cause your cats to explode and the rest of your disks to fail. this fear seems to have gotten more popular when Linus had some silly rant. as long as you aren’t Linus and do regular scrubs you should be confident that your ZFS pool is in good enough physical condition to have disks not magically fail during a rebuild.

for the record, mirrors do have legitimate benefits especially with less slack space from tiny records (IE ZVOLs) and higher IOPS, easy adding and removing of VDEVs and attaching/splitting of mirrors, etc

tangofan · February 3, 2025, 2:31am

In the previously mentioned TrueNAS Tech Talk video it was mentioned that the overhead is “as low as 96 bytes per record”, so a bit higher than the 70 bytes you mention. No idea though, which number is more accurate…

Arwen · February 3, 2025, 2:52am

Found a recent text reference in the old forums. Add that link.

SmallBarky · February 3, 2025, 6:12am

If changing the ‘rules’ we need to change the reference posts and articles.

I point to these a lot.

BASICS

iX Systems pool layout whitepaper

Special VDEV (sVDEV) Planning, Sizing, and Considerations

Arwen · February 3, 2025, 6:50am

@SmallBarky - Yes, those probably need some updating. 2 are iX docs and the 3rd is @Constantin.

My “goal” for this Resource was to have bite sized information packets. The L2ARC was longer, but needed to describe why we have a change.

rungekutta · February 3, 2025, 7:15am

There is some slightly conflicting information out there as to whether the overhead is 70, 80 or 96 bytes. But in any case, actual usage can be seen with

cat /proc/spl/kstat/zfs/arcstats | grep l2_hdr

In my case it’s ~700MiB for 256GiB l2arc which is occasionally useful in front of raidz2 spinning disks. As it happens, I’ve got 64GiB RAM but I would have taken that trade off all the way down to 16GiB. YMMV.

tangofan · February 3, 2025, 7:38am

Just out of curiosity: What percentage of read requests are served from your arc cache and what percentage are served from the l2arc?

Sara · February 3, 2025, 7:40am

I don’t entirely agree with this.

If you use a 3 way mirror, you also can survive two failing drives.
But yes, two way mirror is similar to RAIDZ1 in that you can only survive 1 failed drive.
But I get that having RAIDZ2 is easier than having 3 way mirrors in most setups.

Second thing get’s often overlooked in all these “RAID reliability calculators” is that drives fail in a bathtub curve (some in the beginning, almost none in the middle and then and then after some time like +3y) and that you can buy drives from different vendors. Different vendors will probably have a different failure curve, so the probability of them failing at the same time is lower. If you have a bad batch of drives, with mirrors it is way easier to avoid a pool loss than with RAIDZ2.

I think that most home users are looking at the situation they have at hand right now instead of 5y down the line.
Imagine you can get good deals from two vendors. WD and Seagate.
With mirrors, you buy 3 drives each and put them into mirrors. With RAIDZ2 you do the same and put them into one pool.
Now we jump 7 years into the future. The drives are very old and the WD ones are leaking helium.

Now one WD fails.
You replace that drive in your RAIDZ2.
This put stress on your other drives.
The other two WD drives also fail during the rebuild.
Your pool is gone.
With mirror, this is not a problem.

Some github writing about that topic (I was too lazy to check the math).

rungekutta · February 3, 2025, 7:53am

Depends. My NAS serves a proxmoz cluster with VM storage over NFS, iSCSI for a couple of Windows PCs, and SMB for backups and stuff. Most of the time it doesn’t have to work too hard, and ARC hit rates are very good (high 90s %). Then occasionally it is asked to do something a bit more unexpected - many VMs rebooting, large i/o over iSCSI etc - in which case ARC hit rates can drop down to 10-20ish % and l2arc hit rates instead in the 25-50% (typically). It makes a very meaningful difference in perceived performance in those cases.

tangofan · February 3, 2025, 9:47am

Well that isn’t really a problem, because you do have a backup. Because. RAID is NOT a backup.

Seriously though, your example is a bit flawed and also incomplete. If three drives in a mirror fail, your pool is also gone. Now you might say that the chance of three drives in the same mirror failing is small. That’s true, but you also have to consider that you have to spend much more money for that “benefit”.

If you need the equivalent of 3 data drives in storage space, you either buy 9 drives for a 3-way mirror (3 VDEVs) or 5 drives for a raidz2 (single VDEV). With the money you’ve saved on the raidz2 configuration (ignoring the compounded interest), you can easily buy 5 new drives after year 5, because drives of that particular capacity also will be cheaper then. So 7 years into the future your 5 new raidz2 drives are only 2 years old and happily chugging along, while your 3-way mirror pool is slowly falling apart and is requiring extra investment to start replacing failing drives. So the problem was the mirror. It was too expensive…

I’m not saying that my example is more realistic than yours, but it goes to show that one has to be careful with those kinds or arguments. I suppose that 3-way mirrors are a valid use-case under certain conditions, but I seriously doubt that this is the best configuration for most users.

rungekutta · February 3, 2025, 11:48am

Regarding ”scrub of death” - I think it’s time to bury that once and for all, as it’s been fairly thoroughly debunked (e.g. Will ZFS and non-ECC RAM kill your data? – JRS Systems: the blog) by now. Sure, really broken hardware can break your data, but that has nothing to do with scrub per se and in fact in that case you’re probably just as (or more) likely to not have written it correctly in the first place. And all considered, ZFS will still improve your chances vs many other (most?) filesystems under the same circumstances.

The original idea of ”scrub of death” was that ZFS without ECC was actively dangerous and you’re better off looking for another fs in that case. Which is patently false.

dan · February 3, 2025, 12:03pm

Agreed. There are risks involved in running ZFS without ECC, but they’re no greater than for any other filesystem. The “scrub of death” requires a scenario that’s astronomically improbable (and if you’re really concerned about it, you can decrease the probability by orders of magnitude by selecting a different hash algorithm for your data).

richardm · February 3, 2025, 2:26pm

l2arc_mfuonly=2 doesn’t get nearly enough attention.

Combine with l2arc_noprefetch=0, raised l2arc_write_max, and raised l2arc_headroom for better feeding. More cream rises to the top each day I run like this.

For over a year it’s been living on a crappy QLC flash device that my barber, mechanic, and brother-in-law swore up and down would die very quickly when pressed into L2ARC duty. There’s a second L2ARC device in the form of a 2nd partition (gasp!) on a small Optane device where the first partition is used for SLOG [clutches pearls].

This is a 6.34T pool with 2.48T allocated. The system has 96GB of RAM but I wouldn’t hesitate to run this with 12-16GB. I’ve booted with 8GB on several occasions to run tests and benchmarks. Nothing fell apart. My data didn’t get eaten. The machine did not emit smoke. If anything it seemed to more readily push MFU into L2ARC.

Here I drop the ARC cache then boot a frequently-used VM. 32 seconds to the logon prompt (it’s a domain controller and all of these boot slowly in my lab for some reason):

I powered off the VM, offlined both L2ARC devices, dropped the ARC cache, then powered up said VM again. Nearly three minutes to the logon prompt:

Note 1.8GB of glorious metadata sitting in there with zero special vdev brain damage.

Yesterday the numbers were even better but I had to reboot it. Reddit clowns will wear out a keyboard in a single day giving us all the reasons why this setup cannot work. Perhaps they were right in 2012.

DjP-iX · February 3, 2025, 3:14pm

@SmallBarky @Arwen (or anyone else in here), we’d love to get community input on any recommendations in the ZFS Primer | TrueNAS Documentation Hub that feel outdated or lacking in nuance. You can use the “Feedback” button on the right side of that page to submit an issue report for us to take a look at.

(The whitepaper is maintained by a different team, but we can pass on any suggestions for that doc as well)