A discussion on SLOG/ZIL device

Lets have a discussion on whether or not redundancy on a SLOG device is necessary or not in a non mission-critical environment (but where resilience is still desired.)

I have read a ton of papers and opinions on this subject, and the conflicting viewpoints are numerous. I will give you my current opinion, which is against much of what I have read, but which I think is backed up by the technical papers I have looked at.

Having a redundant mirror on the SLOG vdev is not nearly as ‘necessary’ as many claim because it would require several simultaneous failures of different and diverse hardware before there would be even the possibility of data loss. My understanding is that it a.) is only used if there is a power-loss or catastrophic failure of the machine, and b.) zfs intent logs are in memory and the persistency of the SLOG is the only use (i.e. when machine fails suddenly and on reboot it reads it.)

So, looking at the above, you wouldn’t just need a SLOG device failure to lose data that is in flight, but you’d need a loss of the machine as well, because ZFS will revert to using the disks if the SLOG were to fail. If it requires to different pieces of hardware to fail in an exact order to cause data loss, I am pretty fine with the infentesimal risk of that ever occuring. Has anyone actually had this happen? It has to be out on the 5 or 6 .9s of rarity and maybe even further. It actually probably only really happens at less than 6 .9s or so when the SLOG itself takes down the entire system when it fails, and then only if there was data in flight at that exact moment of a failure that is so exceedingly rare we can just about ignore it. That is just about the only thing that is ever going to happen to cause data loss.

Do I have something wrong with the above logic? Lets discuss.

3 Likes

I am 100% sharing your opinion.

I agree - and run single SLOGs on two different pools and have done for some time.

I’d re-word that to be “because ZFS will revert to using in pool ZIL, (on data vDevs), if the SLOG were to fail.”

Sort of. OS crashes can cause SLOG to be used.

An OS crash would not allow the in RAM copy of the data to be flushed to Data vDevs. Meaning a SLOG entry fully written and acknowledged to client software, but not yet flushed to Data vDevs will be lost on SLOG failure after an OS crash.


As originally written, failure of a SLOG that had data / metadata, between export, (or forced export during a power loss), and on normal import, would be pool fatal. Only later was the ability to import a pool without it’s SLOG possible with the -m option. (You’d loose any un-flushed SLOG entries, possibly causing application corruption.)

Part of the reason for this, is that synchronous writes for databases or other specialized software may end up corrupt, (as far as the software is concerned), without proper handling of the write. Same way ZFS could have it’s pool corrupted by hardware RAID that does elevator writes, during a power loss.


Edit:

Thinking about this more, perhaps we need a “copies=2” for the times when using a single SLOG device, (aka striped SLOG). This obviously would not help on device failure. But, would help on bad blocks.

4 Likes

Given how unlikely a SLOG crash is, if proper drive selection criteria, etc were followed, I find this debate a bit academic. For SOHO use, a simple Optane drive will likely suffice. For power users that have a ton of sync writes, more bespoke stuff can be cost-justified. But it comes down to the use case and hopefully the power users know what they’re doing.

(still remember that epic Radian drive or whatever it was after several PB of VM / Dedup (?) writes in the show me your SLOG @Honeybadger thread on the old forum)

Simple systems that have the occasional sync write as Time Machine or whatever does it’s sync thing will likely be perfectly happy with a 16/32GB Optane stick that you can get for practically nothing now. As more and more folk switch to flash for storage, the raison d’être for SLOGs will largely wane as well.

2 Likes

I have seen discussion among some of the ZFS developers that even with very fast pool storage the SLOG may be beneficial due to path lengths in the code. In other words, even if the pool and SLOG devices have identical performance, the code to write to the SLOG is faster than the code to write to the in-pool ZIL. There may be benefit to a dedicated, fast SLOG device over fast in-pool ZIL.

I have attempted to verify this with performance testing, but my results have been inconclusive.

4 Likes

I have no doubt that the right SLOG device can still have a place, even with an all flash system for the reasons you mentioned. However, the quantum leap in performance over HDD ZIL isn’t there unless your flash pool is incredibly slow. Instead, you may have a small (and perhaps very valuable!) benefit.

sort of like L2ARC or sVDEV in an all-flash pool. Incremental performance gains may still be possible through the use of one or the other but are unlikely to be world changing vs. making BIG differences for otherwise all-HDD pools with the right use cases.

I agree that the biggest benefit for any of the auxiliary vdevs (SLOG, L2ARC, sVDEV, etc.) are for HDD based pools where you can leverage a small number of very fast devices to make an HDD pool seem faster than it is. When you are trying to get the last few percent of performance out of a system that is already pretty fast every advantage helps. Yes, this is more common in the Enterprise space than the Community, but there are hobbyists pushing the limits as well.

In my testing, for some workloads, a pool of 10 x 2-way HDD mirrors outperforms some SSDs. I’m not rich enough to afford NVMe modules (yet). My production pool has mirrors plus aux vdevs of the right type. How much difference does it all make, I’m not sure but it is a fun exercise for me, and I know I am getting the best performance with the parts I have on hand.

I find differences in SSD performance across makes / models and workloads fascinating.

Side note: I inherited a small pile of 200 GB IBM SAS SSDs. In testing they were pretty mediocre in terms of sequential write/read and random write, but they beat a bunch of other SSDs in terms of small to medium random read operations, so I used 3 of them (in a 3-way mirror) to add a sVDEV to my primary pool. How much did that improve my performance, I did not measure, but they only cost me 3 disk slots (and I have 3 more as spares). Note that I am not using them for small files but just for metadata.

1 Like

The Radian RMS-200 that was an SLOG device?

Data Units Written:                 1,622,149,164,390 [830 PB]

I’ve had a lot on deck recently, but I did see the ping, so I’ll be back with more later.

Short version is “there is often still a justified need for an SLOG in an all-flash pool, but most users don’t need an SLOG at all, let alone with an all-flash pool”

1 Like

Okay, so, let’s get it out of the way.

Most home users don’t have a use for a SLOG at all. Your SMB-file-sharing, Linux-ISO-seeding, video streaming workload does not need and will not benefit from a SLOG.

For those of you who do - read on.

Data loss with a single SLOG requires two conditions to overlap.

  1. Failure of the SLOG device (including “failure to read a block” not just “terminal failure”)
  2. System halt

Emphasis added here. Assuming that scenario #1 is given, because it’s the point of comparison between “mirrored SLOG” and “non-mirrored SLOG” - the cause of demise can be any of but not limited to:

  1. External power loss
  2. Internal power loss (single PSU, power distribution board for multi-PSU units)
  3. Critical component of ZFS pool fails (overheated HBA, backplane or midplane failure, permanent damage to internal data cabling from pinch/crush during maintenance)
  4. Kernel panic or other software-based system halt

SLOG is designed to save your bacon from all of those scenarios by being the safe replay location. Having two mirrored devices reduces that already small risk to an even lower level.

Simultaneous failure no, but I have had a SLOG device save my in-flight data when an HBA decided to release its Magic Blue Smoke on a live unit. Shut the system down, swapped the HBA back in, in-flight DB transactions were committed on pool import/ZIL replay, and the VMs carried on once rebooted.

SLOG + sync writes already put your data safety into the very high percentages - probably “four nines” - but that’s not enough for businesses. That’s where mirrored SLOG comes in, and it’s quite achievable for the average home user these days with a pair of the Optane 16/32G sticks that were mentioned by @Constantin - they’re good for roughly 100MB/s and 200MB/s of sync-writes for the 16G/32G size respectively.

Now, that’s sorted.

“But what about an all-flash pool?” you might ask.

Well, it gets a bit more complicated. Now we’re asking your pool vdevs to handle the “immediate cache flush to non-volatile NAND” demand of sync-writes. On SSDs without power-loss-prevention capacitors to protect DRAM, this can be the difference between “hundreds of megabytes per second” and “single-digit speeds”

This is also exacerbated by the newer SSD technologies like TLC and QLC. While some of these have “pseudo-SLC” behavior to help regain some of the performance lost, this burns through drive endurance as well as causes write amplification by having the NAND written twice (once into the pSLC buffer, and then again when folding the data into TLC/QLC blocks.) It’s better for both endurance and sustained performance to have a purpose-driven SLOG absorb those writes - especially when you might be overwriting the same data in-place again before the txg cutoff! - and let ZFS aggregate the transactions in RAM (with the SLOG as the power-failure protection) in order to lay down larger records/blocks to the underlying NAND.

I’m fortunate enough to have a good chunk of PLP-enabled Intel DC SSDs, and can slam them with a pretty heavy workload - but even those will falter under the demands of small block writes and benefit from having something like an Optane or RMS-200 in front of them.

And once you get into the ludicrous levels of performance that our Enterprise clients demand, now we’re into distributed logs across multiple high-endurance NVMe drives, supercapacitor-backed NVDIMMs, and other bespoke integrations.

Missed this one initially. This is another excellent reason to have two SLOG devices - redundancy against a bad block or NAND page. If you have two SLOGs in a mirror, and one of them burns out a block of NAND - the bad data gets caught by checksumming, and the other one can step up and replace it.

A theoretical copies=2 workflow implemented inside the ZIL could guard against this - but at the cost of halving your sync-write throughput (!) and effective drive endurance.

5 Likes

Maybe I’m missing something, but how is this “achievable” by home users? Sure, these little Optane devices are cheap, but they’re also NVMe, and typical motherboards do not have a bunch of open NVMe connectors on board to plug them into. One of them is probably already being used for the system’s main drive, perhaps both of them if you’re using a mirror, so where are you supposed to plug these ZIL drives into?

The more important question is, is data loss an additional risk beyond a pool without a SLOG? As I understand, if #2 happens without a SLOG, you’re going to lose data anyways.

So really, to mirror your SLOG does not lower the risk of data loss “closer” to a non-SLOG pool. Even an SSD that’s about to die, being used as a SLOG, is no more “dangerous” to the pool or risk of data loss compared to a pool without a SLOG.


Blocked. :cross_mark:

Or you can connect your boot drive to usb (or even to internal usb header) via usb/m2 adapter.

Why is that?

I’m not talking about the boot drive, I’m talking about the main drive. I don’t know about your system, but in mine I have an NVMe drive for things like docker apps, databases, Immich thumbnails, etc. because it’s orders of magnitude faster than spinning rust. The HDDs are used for bulk storage for things where performance doesn’t matter too much. I’m not about to run a database on spinning rust.

1 Like

If you have a 4x4 bifurcating slot you can have 4 M.2 cards on a carrier.

But the reality is a home user probably does not have data so critical that they can’t afford to lose a transaction in the rare scenario that a drive fails during the delay between an unexpected halt and a reboot.

If you’re already fine with the risks of async writes and not using a SLOG, then there’s no need to make a SLOG to be mirrored.

Using a (single drive) SLOG does not add risk to data loss.

I thought you were talking about sync writes. And AIUI, sync writes to the (redundant) pool without slog should be “safer” than sync writes to the pool with non-redundant slog.

You would lose the SLOG but your writes will defer to the pool’s internal LOG and/or just write to the pool.

The data is still in RAM, even if a SLOG fails.

If your system also dies at the same time, you would have lost the data anyways, even if your pool didn’t have a SLOG.

A bifurcating riser - or a single onboard M.2 and a passive M.2 to PCIe adapter if you don’t.

You’ll lose async data with or without a SLOG, as it doesn’t land there - but if you’re running sync without a SLOG, you’re using in-pool ZIL and you won’t lose that data. Your sync-write performance may be rather poor though, unless it’s an all-flash pool.

Unrelated. sync is what gives you the safety, at the cost of performance. A SLOG gives you some of that speed back by offloading it from the presumably slower in-pool ZIL, but it becomes a single-point-of-failure … unless you mirror it.

:malding:

2 Likes

Nah, it’s not just about data loss but about data inconsistency on the client app (or whatever consistency meant).

AIUI, there won’t be inconsistency in the no-slog scenario because zil wasn’t written (and thus the sync write call wasn’t “released”) or it can be restored in the case of failure (because pool has redundancy). In the case of non-redundant slog, the call was “released” (Ok! We have your order for selling over9000 cryptocoins. Our storage subsystem reports that the order has been placed. Moreover, our app already called our mail-api to send you the hitting-Q1-revenue-target-thank-you-letter. Our Q1 revenue target.) and then failure (of slog and system) happened with no ability to restore the write (and the selling order).