A discussion on SLOG/ZIL device

Okay, so, let’s get it out of the way.

Most home users don’t have a use for a SLOG at all. Your SMB-file-sharing, Linux-ISO-seeding, video streaming workload does not need and will not benefit from a SLOG.

For those of you who do - read on.

Data loss with a single SLOG requires two conditions to overlap.

  1. Failure of the SLOG device (including “failure to read a block” not just “terminal failure”)
  2. System halt

Emphasis added here. Assuming that scenario #1 is given, because it’s the point of comparison between “mirrored SLOG” and “non-mirrored SLOG” - the cause of demise can be any of but not limited to:

  1. External power loss
  2. Internal power loss (single PSU, power distribution board for multi-PSU units)
  3. Critical component of ZFS pool fails (overheated HBA, backplane or midplane failure, permanent damage to internal data cabling from pinch/crush during maintenance)
  4. Kernel panic or other software-based system halt

SLOG is designed to save your bacon from all of those scenarios by being the safe replay location. Having two mirrored devices reduces that already small risk to an even lower level.

Simultaneous failure no, but I have had a SLOG device save my in-flight data when an HBA decided to release its Magic Blue Smoke on a live unit. Shut the system down, swapped the HBA back in, in-flight DB transactions were committed on pool import/ZIL replay, and the VMs carried on once rebooted.

SLOG + sync writes already put your data safety into the very high percentages - probably “four nines” - but that’s not enough for businesses. That’s where mirrored SLOG comes in, and it’s quite achievable for the average home user these days with a pair of the Optane 16/32G sticks that were mentioned by @Constantin - they’re good for roughly 100MB/s and 200MB/s of sync-writes for the 16G/32G size respectively.

Now, that’s sorted.

“But what about an all-flash pool?” you might ask.

Well, it gets a bit more complicated. Now we’re asking your pool vdevs to handle the “immediate cache flush to non-volatile NAND” demand of sync-writes. On SSDs without power-loss-prevention capacitors to protect DRAM, this can be the difference between “hundreds of megabytes per second” and “single-digit speeds”

This is also exacerbated by the newer SSD technologies like TLC and QLC. While some of these have “pseudo-SLC” behavior to help regain some of the performance lost, this burns through drive endurance as well as causes write amplification by having the NAND written twice (once into the pSLC buffer, and then again when folding the data into TLC/QLC blocks.) It’s better for both endurance and sustained performance to have a purpose-driven SLOG absorb those writes - especially when you might be overwriting the same data in-place again before the txg cutoff! - and let ZFS aggregate the transactions in RAM (with the SLOG as the power-failure protection) in order to lay down larger records/blocks to the underlying NAND.

I’m fortunate enough to have a good chunk of PLP-enabled Intel DC SSDs, and can slam them with a pretty heavy workload - but even those will falter under the demands of small block writes and benefit from having something like an Optane or RMS-200 in front of them.

And once you get into the ludicrous levels of performance that our Enterprise clients demand, now we’re into distributed logs across multiple high-endurance NVMe drives, supercapacitor-backed NVDIMMs, and other bespoke integrations.

Missed this one initially. This is another excellent reason to have two SLOG devices - redundancy against a bad block or NAND page. If you have two SLOGs in a mirror, and one of them burns out a block of NAND - the bad data gets caught by checksumming, and the other one can step up and replace it.

A theoretical copies=2 workflow implemented inside the ZIL could guard against this - but at the cost of halving your sync-write throughput (!) and effective drive endurance.

5 Likes