Understanding the SLOG function with needed recommendations

mauzilla · April 16, 2024, 10:16am

So we’re getting some Dell R630 servers in which we plan to run 2 pools (4x 4TB Drives Mirrored (“raid10”) so 8TB), and 2 of them, so 2x 8TB striped mirrored vdevs - These are used as VM storage over NFS (where NFS is the only thing we know but have the option in XO to use ISCSI as well) - All storage is SSD’s

The servers are setup with dual power supplies (seperate power sources), dual CPU’s / memory and redundant 10GB network between the devices, backup generator electricity but more importantly we do a replication task of the above servers to a secondary “failover” TrueNASes on an hourly basis.

We don’t have dedicated SLOG’s yet, but considering to get Radium or Optanes from ebay.

So my understanding here is with how data is written:

ASYNC write requests waits for acknowledgement from memory, not persisted disks. This is the fastest possible speed we will get, but the risk is that sudden power less, kernel panic etc will lead to data loss (and possibly corruption of VHD files) as what the other end expected to be written to disk has not been done.
SYNC writes wait until an acknowledgement is done from storage. This is the slowest path but the most guarantee that there would not be data corruption during the unexpected.
To middle ground is to have a SLOG (which I believe is default in Truenas which allows for the data to be written to a SLOG (which by default is part of the pool) but as this happens in default scenario, this will have some degraded performance as the disks is used for both the SLOG and permanent storage. The advantage however is that during an unexpected power loss, the ZIL can be read back from the SLOG to complete the last transactions that was previously considered written and it’s high5’s all around.
The purpose thus of a dedicated SLOG (not part of the pool) is to speed up this process. We want to get a SLOG that is as close as possible to RAM speed so that the SLOG can contain transactions so that a confirmation of storage written is sent back over NFS faster. Although storage has not been written to disk yet (it’s in memory and written to the SLOG) the confirmation was sent to the other end as the SLOG contains a “persisted guarantee” and this may improve performance (considering the SLOG is faster than the SSD’s). Lastly, the SLOG is never used to write to the disks (it still writes from memory), the SLOG is only a “backup” of storage to be written in the event of unexpected power loss. So the SLOG is simply a temp space for in memory transactions to offload storage / intent log to a “persisted storage” allowing for an “earlier” guarantee of storage written over SYNC - Data will still be written to the permanent storage from RAM, but as the SLOG confirmed it has the transactions, it can confirm data is written in the sync request.

So here is the questions:

Is my understanding above correct?
It seems over the years that the opinion of having a mirrored SLOG has changed. I understand the logic behind having it mirrored as this will give us somewhat of a guarantee that even in the unexpected (and further unexpected SLOG failure during this period of time as well), the SLOG will still be available. Without the SLOG (so SLOG failure during unexpected failures), TrueNAS will not be able to mount the pool and we’ll need to boot into an emergency mode and run some commands to bring the pool back up, and then start assessing the damage. It does however seem that the opinion shifted with better storage options such as the Radian or Optane drives. It seems that the Radiam devices offers power loss protection as well as the Optane devices. With these devices in place, is it REALLY needed to have mirrored SLOGs? My biggest concern is that the R630 only has 3 PCI-e slots, and if we have 2 pools / vdevs, we will need 2, and we want to add an additional 10GB SFP+ card for network redundancy. I understand it’s measurement of risk, but curious to know if SLOG’s in 2024 is still mirrored with better hardware in place.
Assume I know nothing about Optane’s or Radian devices (as that would be the safest assumption). From what I can gather, the Radian devices offer an almost unlimited TBW / endurance where Optane’s have a much larger endurance than standard enterprise SSD’s but still has a “limit”. What is currently considered a better option between say a Radian RMS-200 and Optane 900/905P? I cannot really seem to find the RMS-300’s on ebay, but a couple of RMS-200s. As the R630 is DDR4, my logical opinion is that by this simple math, RMS-200’s will by default be slower simply not the memory speed difference. I assume that the Optane 900/905P’s will be faster in this regard, but obviously has a pitfall of “eventually” reaching the TBW limits. What is considered the industry standard with regards to these?
If we go the Radian route (most seem to be 8GB units), will this be suffice on a 10GB network with 4x4TB’s on a pool? If my math is correct, the maximum traffic we will reach over 10Gbit network will be around 1.25GB per second. If the persisted storage has around a hard limit of 500MB/s (SSD), without a SLOG in place we’re working on around 3 seconds to have an entire 1.25GB written (best case scenario). With a Radian RMS-200 on 8gb storage cap, we should still be within the limit (considering best case scenarios) - I know my math here is probably not real world scenario, so hoping someone can give me a better explanation

Davvo · April 16, 2024, 11:15am

Mostly so, the major misunderstanding is thay a SLOG does not hold any real data but a log of what has been safely written and what has not (yet): sync writes tell the SLOG “Hey, I am done!” for every write.

etorix · April 16, 2024, 11:34am

Mostly, yes.
Sync + SLOG is still closer to sync than to async in terms of performance.
ZIL is built-in: It is a stripe of little areas in each drive of the pool. A SLOG is a dedicated device serving ZIL purpose; it is optional—and only makes sense if the SLOG is faster than the pool. If your pool is made of data centre-grade SSD with PLP, even Optane may not make sense as a SLOG over the built-in ZIL.
Mirrorred SLOG depends how paranoid you are, really. You correctly understand that a SLOG is only ever read in the event of an unclean shutdown, so the scenario for data loss is an unclean shutdown or crash AND the SLOG not coming back up on reboot.
You need a SLOG per pool (not per vdev), but if you do have two pools, and want a mirrorred SLOG for each, it would be possible (though not supported by the GUI) to partition two Optane drives and stripe a partition from each to make two mirrored SLOGs. Optane has enough IOPS to tolerate double duty.
I suppose the “industry standard” would be Optane DC P4800X/4801X/5800X over the consumer variant 900p/905p. And a Radian RMS should be faster than Optane, by virtue of being genuine RAM, backed by battery.
Pool capacity is irrelevant. ZFS will cache at most two “transaction groups”; default txg is 5 s, so that is 10 seconds worth of transactions. 10 s * 10 Gb/s = 100 Gbits = 12.5 GBytes (but more like around 10 GB because not every bit will make a data byte)
With a fast SSD pool behind, a 8 GB RMS-200 is possibly enough but you’d want a 16 GB RMS-300 to be sure that SLOG capacity is never going to be a throttle.

Some more reading:

DigitalMinimalist · April 16, 2024, 12:13pm

Get one or two new Intel Optane P1600X 58GB for 40$ each from Newegg or Amazon for SLOG…

My personal conclusion after many threads regarding SLOG and L2ARC:

Use a SLOG, optimally Optane, if you use NFS, iSCSI, or another sync write relevant protocol - just SMB: probably not.

L2ARC: maximize your RAM first and only consider if you have a low hit rate of your ARC