ZFSisms that are not true, or no longer true

SmallBarky · February 3, 2025, 3:39pm

I submitted the question on L2ARC for the T3 Tech Talk and it seemed Kris and Chris were happy with the current Documentation recommendation on when to add L2ARC.
I had also inquired about changing the minimum RAM values for Scale docs and they felt the current values were good.

I personally, try to point to iX Systems and the documentation as a source of truth.

Whattteva · February 3, 2025, 4:00pm

I always tell people to always max out their RAM first before fooling around with L2ARC regardless of how much they have currently. Especially since most people run prosumer boards anyway where they only have 2-4 RAM slots max. It’s not like it’s a server board where they need to buy 8+ RAM modules.

The cost is kind of trivial and it opens up a lot of possibilities for them to run more VM’s and apps as well as boosting ARC performance. I don’t know, the choice is rather a no-brainer for me at least for non-server boards.

RetroG · February 3, 2025, 4:05pm

it’s funny… I’m still using some 7-year old WD helium drives (now in a secondary server since they are only 8TB) and they aren’t all magically leaking and failing. the rest of the drives are slightly newer 12TB+.

technically true, but this “bathtub curve” is a huge average representing what you can expect running a fleet in a datacenter running hundreds possibly even thousands of disks… expecting multiple drives in a VDEV (lets say 2-16) to fail all at the same time is statistically much more unlikely even when they are the same vendor, model, etc. but we still account for it with RAIDZ2/3 and 3-way mirrors…

it’s just that with mirrors you are:
using a (bunch of) 2-way mirrors that put your pool in jeopardy with no ability to heal corruption on that VDEV it may encounter while it’s degraded, that are worse data efficiency than most RAIDZ setups for cold storage.

using a (bunch of) 3±way mirrors. which do have the ability to heal corruption while degraded but have even less data efficiency, so using these for anything but the most critical of data doesn’t seem practical…

however, regarding stress and load, if you do regular scrubs like you should it is the same stress as a resilver (heck, it’s nearly all the same code too). a raidz2+ with a failed device still has the ability to self heal corruption as it finds it, a failed 2-way mirror does not.

you can of course have a disk fail during a rebuild, I’m not denying that, but with RAIDZ2/3 it will just fault out and the resilver (should) continue to completion, with a 2-way mirror however you just pray the one disk you are stressing isn’t going to fail…

Whattteva · February 3, 2025, 4:21pm

RetroG:

it’s just that with mirrors you are:
using a (bunch of) 2-way mirrors that put your pool in jeopardy with no ability to heal corruption on that VDEV it may encounter while it’s degraded, that are worse data efficiency than most RAIDZ setups for cold storage.

using a (bunch of) 3±way mirrors. which do have the ability to heal corruption while degraded but have even less data efficiency, so using these for anything but the most critical of data doesn’t seem practical…

however, regarding stress and load, if you do regular scrubs like you should it is the same stress as a resilver (heck, it’s nearly all the same code too). a raidz2+ with a failed device still has the ability to self heal corruption as it finds it, a failed 2-way mirror does not.

you can of course have a disk fail during a rebuild, I’m not denying that, but with RAIDZ2/3 it will just fault out and the resilver (should) continue to completion, with a 2-way mirror however you just pray the one disk you are stressing isn’t going to fail…

This is not a good comparison. You’re assuming that the chance of another drive failing in a mirror resilver is the same as a RAIDZ resilver, which is woefully NOT true as it is significantly less due to mirrors being much more time and load efficient while resilvering.

A mirror resilver puts far less load on the pool as a whole because the only drive being read is the sibling drive and the impact on the performance of the pool as a whole is fairly minimal.

In a RAIDZ vdev, a block must be read from each of the surviving drive in the vdev. If you have a RAIDZ2 with 6 drives for example, you’re going to put 5x more IO load on the pool than a simple mirror. Compounding the issue is the fact that resilvering a RAIDZ vdev will take orders of magnitudes longer time (think days or even weeks depending on how big your drives are) than resilvering a mirror making your window of time for another failure while resilvering significantly higher and that’s assuming you’re not using the pool at all! If you’re using the pool while resilvering, your resilvering time will take even longer!

HoneyBadger · February 3, 2025, 4:23pm

I believe it was 88 before the pL2ARC merge, and it’s 96 bytes now.

root@core01[~]# vmstat -z | grep arc_buf_hdr_t_l2only
arc_buf_hdr_t_l2only:     96,      0,       0,       0,       0,   0,   0,   0

Thanks for submitting those questions I could probably do an hour-long talk on this stuff, maybe that needs to just go into a longer video outside of the normal T3 cadence.

I assume you’re referring to this guidance in the docs re: deploying L2ARC

As Kris mentioned as well, we have to make any broad recommendations or system defaults somewhat generalized; trying to nuance in “you can do X and Y if you’ve met prerequisites Q and P” - we’re happy to continue expanding on tweaking scenarios and you’ve seen many a post from me here and on the old forums with goodies behind a [HERE BE DRAGONS] spoiler tag.

l2arc_mfuonly=2 is indeed a new option and I think I poked at it in another thread - I should create a longer, L2ARC-specific resource thread perhaps to discuss it. There’s some plans to make changes to the overall feeding and behavior of L2ARC generally as well - its current behavior as a ring buffer is still sub-optimal vs. primary ARC and makes people infer certain benefits onto it that it can’t quite do. But tweaking the existing engine with the l2arc_ tunable family can still offer great benefits.

But this whole thread is an excellent place to gather, summarize, and discuss knowledge of OpenZFS and TrueNAS that’s constantly evolving.

RetroG · February 3, 2025, 4:51pm

I’m sorry… what???

no, they do not take that long, my experience is it generally takes as long as reading the data allocated to a given disk does which would also be true for a mirror (as it’s done in parallel, about a day for a modern drive if it’s full, maybe like two or three if the pool is massively wide and it has to scan through trillions of tiny records, but your mirror pools also do this while scanning). As someone who as resilvered out my entire personal server’s pool one disk at a time to convert it to 4kn (at the time 48 disk wide, 8-raidz2 vdevs) and as someone who has also replaced disks to upgrade capacity. This is blatantly wrong yet you call it a fact? sequential scrub/resilver has been in avaliable for openzfs for 8 years now which of course benefits all types of VDEVs.

are there any statistics on this? genuinely curious but the chances of a disk failing at the same time as another with regular scrubs (that quite literally load the disk the same as a resilver), exactly during the time that you are rebuilding, still seems rather unlikely… but if it does it’s still leaves a RAIDZ2/3 functional. but as it is, any type of RAID is about managing risk…

rungekutta · February 3, 2025, 5:02pm

The original Scrub of Death idea was bourne out of a misunderstanding of how ZFS actually works. For the described scenario to happen, random bit flips need to happen subsequently in such as way that it creates a hash collision, roughly 1 in 10^77 with the default SHA-256 as you say @dan. Obviously in that case there are many other bad RAM failure scenarios that are far more likely to occur and cause damage in other ways. Or for that matter a catastrophical asteroid event - roughly 1 in 300 000 chance every year (the one 65 million years ago was particularly bad…), translates to roughly 10^72 times more likely. In other words @Arwen “Scrub of Death” as coined is not a thing… you are being far too kind in your description above.

SmallBarky · February 3, 2025, 5:03pm

The values very depending where I look on the documents or hardware guides (from old forum)

On guidance on L2ARC I usually reference the docs section on Memory Sizing. SCALE Hardware Guide | TrueNAS Documentation Hub
That is where I thought it was eating more ARC/RAM than current.
“Add approximately 1 GB of RAM (conservative estimate) for every 50 GB of L2ARC in your pool. Attaching an L2ARC drive to a pool uses some RAM, too. ZFS needs metadata in ARC to know what data is in L2ARC.”

and under

"The most important quality to look for in an L2ARC device is random read performance. The device must support more IOPS than the primary storage media it caches. For example, using a single SSD as an L2ARC is ineffective in front of a pool of 40 SSDs, as the 40 SSDs can handle far more IOPS than the single L2ARC drive. As for capacity, 5x to 20x more than the RAM size is a good guideline. High-end TrueNAS systems can have NVMe-based L2ARC in double-digit terabyte sizes.

Remember that for every data block in the L2ARC, the primary ARC needs an 88-byte entry. Poorly-designed systems can cause an unexpected fill-up in the ARC and reduce performance. For example, a 480 GB L2ARC filled with 4KiB blocks needs more than 10GiB of metadata storage in the primary ARC."

rungekutta · February 3, 2025, 5:12pm

Those numbers add up but are of course exceptionally conservative (4k record size).

Stux · February 3, 2025, 5:12pm

8GB is a good recommended minimum.

The true minimum is actually about 4GB, but you really cant do much.

With 8GB you can enable most TrueNAS functionality, even some small apps or a small vm.

More is better of course, and in fact required if you want to run lots of apps or VMs.

dan · February 3, 2025, 5:37pm

The default is fletcher4 unless dedup is enabled for the dataset, in which case it is indeed SHA256. But of course you can set it to SHA256, SHA512, or others if you like.

https://openzfs.github.io/openzfs-docs/Basic%20Concepts/Checksums.html

Arwen · February 3, 2025, 5:41pm

The baiting first line was to capture people’s attention. Then tell them not to worry about it. I did add the reference to the other article on the subject, as I had forgotten about it.

Can the discussion of L2ARC move to another thread?
And the disk failure rate?

This is a Resource intended to update common misconceptions of certain ZFS features or abilities.

Whattteva · February 3, 2025, 5:47pm

Nothing is “allocated to a given disk” in RAIDZ. Parity data is striped throughout the whole vdev so you have to read every drive to resilver a block.

You’re telling me that 6x 8x RAIDZ2 vdev resilver take mere hours? How big are those drives exactly? In your example, whenever you’re replacing one drive, you have to read from 7 other drives in that vdev. In a simple mirror, you’re just reading from 1 other drive, that’s it. It’s not loading 6 other drives, your performance is minimally impacted and you’re also not doing any parity calculations either.

Admittedly, I have no statistics, but what I’ve said is true, the load is NOT the same. Again, going back to your example, you’d be loading 7 other drives at the same time instead of just one other. That’s 6 more of your drives that could be failing while they’re all taking on extra load.

In a mirror setup with same drive count (8), you’d just be loading 1 other drive, one time each, until you’ve replaced all 8 drives and they’re all loaded separately instead of all at the same times and there is no parity calculations.

I’m not saying that the resilver load is more than a scrub, what I’m saying is you’re loading more drives for more time and you have to repeat this resilver process 8 times if you’re upgrading an 8-drive RAIDZ2 vdev. So each drive is essentially loaded 7 times until that vdev is upgraded.

Stux · February 3, 2025, 5:57pm

The time taken to resilver a Raidz disk depends on how full the VDev is.

Whattteva · February 3, 2025, 6:12pm

I mean… isn’t that true no matter what type of vdev you have?

richardm · February 3, 2025, 6:30pm

The L2ARC feature is easily in the top 5 ZFS misconceptions.

Whattteva · February 3, 2025, 6:40pm

Yeah, maybe I am misreading the post, but it seems to me that the last point of the OP specifically talks about L2ARC or at least mentions the acronym no less than 11 times!

RetroG · February 3, 2025, 7:02pm

12-20TB disks… any sensible RAID5/6 implementation does reads/writes to all disks simultaneously. this is one reason why you shouldn’t use USB JBODs or SATA port multipliers that suck immensely at simultaneous I/O. and as long as you aren’t bottlenecked (which isn’t generally the case when using real HBAs…) the rebuild time is equivalent to how long it takes to read a single disk’s worth of data, or less if the VDEV isn’t full. (minus the scanning phase, which can vary depending on what you store). activity use also doesn’t really slow down a ZFS pool that is scrubbing/resilvering much… remember that ZFS syncs transactions in bigger chunks.

I’ll assume that this is in reference to upgrading an entire VDEV with bigger disks to increase capacity?

with any ZFS VDEV, if you initiate multiple replacements ZFS will finish the first, while the rest of the replacements will be “awaiting resilver”. if you want, after you initiate a bunch of replacements simultaneously you can initiate a sudo zpool resilver which will make it resilver all devices (but the data on the first replacement up to that point is still valid and won’t be re-written). you could absolutely do an online replacement of an entire RAIDZ VDEV (or more) if you have a spare JBOD, I’ve done it, it only reads the disks once and better yet the whole new capacity is available intermediately if you have autoexpand on.

so… we can agree that the possible failure of disks may be more significant when used in a RAIDZ VDEV but… is it worth worrying about? is it worth worrying about so much that building a massive pool of mirrors makes sense over RAIDZ2+? that is the real question here, personally I’ve never seen this scenario of multiple disk failures but that’s not denying that on an unmaintained system with no regular scrubs it would happen. (it happened to Linus after all…) if your disks are so toast that at any given scrub they will die (which is what is being suggested) then you definitely have bigger problems to worry about.

I’m also of the opinion that load cycles and physical movement (offline or online) are much more stressful to spinning disks than normal use.

rungekutta · February 3, 2025, 7:15pm

Good point, I stand corrected. Don’t think it’s fundamentally different though. This paper lays it out: https://people.freebsd.org/~asomers/fletcher.pdf (1 in 10^68?). Consider as well that dedup cannot accept the infinitesimal risk that any two blocks in the entire pool have the same checksum i.e. it requires with a huge degree of confidence that all blocks that are not the same have unique checksums in the DDT table. The “Scrub of Death” idea requires that two specific blocks end up with the same erroneous checksum due to bit flipping. And as described in my external link further up, even in that case it would only lead to a checksum error on the pool - not erroneous overwrites.

It’s a non-issue, even theoretically.

rungekutta · February 3, 2025, 7:19pm

I guess my point is, if the point of the thread is to correct common misinformation, then starting a section with “The ZFS Scrub of Death without ECC memory IS real!” is kind of missing the point. Anyway, you asked for feedback, so there it is.