Let us take the SLOG discussions to a new thread, out of the “L2ARC” discussions.
Seems like a “SLOG tuning guide and common misconceptions” Resource is needed.
Let us take the SLOG discussions to a new thread, out of the “L2ARC” discussions.
Seems like a “SLOG tuning guide and common misconceptions” Resource is needed.
Is it still though?
Even a modern prosumer SSD like a Samsung 990 pro has 1 DWPD for almost two years.
Combine that with a 2TB model that has 1200TBW and you get
1200 / 0.675 = 4.8 years L2ARC filling at 100% before you reach TBW.
First, the 0.32 DWPD that Samsung quotes for the 990 Pro are sequential writes. Enterprise drives that quote both sequential and 4K random write endurances usually have 3-10x the endurance for sequential. Even at the low end, that turns the 990 Pro into a 0.1 DWPD over 5 years model.
Second, Samsung likely means it when they say 0.32 DWPD over a 5 year period. If you accelerate the writes to more than that, you void the warranty. So, you’d be paying $80/year ($160 for a 2TB right now) for the drive. You can buy a brand new 1.92TB Samsung PM983 (enterprise M.2) for $135. That’s less than the price of the consumer model and far less per year ($27/year with a 5-year lifespan). Yes, the enterprise drive is only PCIe 3.0, but that doesn’t matter much.
If you can get a more reliable drive for less money, why not do it?
If. For me a PM983 is almost three times the price of a 990 Pro.
Still think that TBW of consumer drives is high enough for most workloads, since you are probably not writing to it all the time. And since most SSDs survive way longer than their rated TBW.
Even more so if you monitor the write load:
Code for TN CORE - adapt paths and device names for TN CE:
#! /bin/sh
PREFIX='servers.'
SMARTCTL='/usr/local/sbin/smartctl -x'
time=$(/bin/date +%s)
hostname=$(/bin/hostname | /usr/bin/tr '.' '_')
drives=$(/bin/ls /dev | /usr/bin/egrep '^(a?da|nvme)[0-9]+$')
for drive in ${drives}
do
case ${drive} in
nvme*)
wear=$(${SMARTCTL} /dev/${drive} | awk '/Percentage Used:/ { printf "%d", $3 }')
;;
da*|ada*)
wear=$(${SMARTCTL} /dev/${drive} | awk '/Percentage Used Endurance Indicator/ { printf "%d", $4 }')
;;
esac
# catch the case that $drive is not an SSD ...
if [ "x${wear}" != 'x' ]
then
echo "${PREFIX}${hostname}.diskwear.${drive}.wear-percent ${wear} ${time}"
fi
done
I realize that I’m many months late to this discussion, but I think it’s interesting and further discussion can help others understand things better.
I believe that your disconnect comes from thinking that the costs (not dollar) involved are the same with RAM vs SSD/nvme when they are vastly different.
Even on an older server that is using DDR3, the read and write performance is much better than even a PCIe Gen 5 nvme. The performance that matter are throughput (GB/sec), latency (ns/ms), IOP/sec, and device wearing (yes even DRAM has device wearing, it’s just so low as to be measured in decades).
For those reasons a cache in RAM is always better than a cache on SSD/nvme.
Then there are huge differences in SSD/nvme performance, which I will simply call SSD for both unless I specifically mean nvme).
When people buy consumer SSD’s, they are often looking at TB/dollar as a primary consideration (and the marketing speeds), but do not look at the details of the drive.
As you use up more capacity on the drive there is less cells available to be used as Psuedo -TLC. This explains why a super fast Gen5 nvme may start copying data at a very high speed, but then slow down to < 500MB/sec for the rest.
Also, any data written to a Psuedo-SLC cache must be re-written to another portion of the drive using it’s native TCL/QLC cells. This background data transfer can further limit the drive’s write performance for a large copy. Some drives may slow to <300MB/sec because of this.
Some drives offer a host based memory cache. This is where a small amount of system RAM is used as a metadata cache. This amount is usually only 64MB. It can make a significant difference compared to no cache.
Better drives have a small DRAM cache. In general, this tends to be about 1GB per TB of drive capacity. Most of this capacity is used for the metadata cache and can usually fit all or most of the metadata. What is left over can be used as a read and/or write cache for data. This part of the cache is small enough that it is generally only useful for small writes, but the drive is able to report back that those small writes are completely almost instantaneously. This can make a difference in perceived responsiveness, especially when the drive is moving data from the SLC cache to TLC/QLC.
Any SSD without a cache (either host or DRAM) must instead use a portion of the flash as a cache for the meta data. Not only is this slow, but it causes more write wear.
The metadata for an SSD is extremely important. Where the controller places the data on the flash is completely different to where the OS/filesystem thinks that the data is. While the OS/filesystem may think that a file is in a single continuous location, in reality it may be in the SLC cache, and then a few seconds later it is on the TLC/QLC flash, which may be split between multiple flash chips.
Commercial SSD’s generally use SLC and sometimes MLC flash. They generally have DRAM caches. Using SLC flash means that they can maintain higher write speeds than consumer drives. They also have much higher write wear levels.
Now we can understand that memory has much much better write wear levels than even commercial SSD’s. The difference is several orders of magnitude. This is an incredibly important consideration when discussing ARC vs L2ARC.
The ARC uses very fast (throughput, IOPS, an latency) memory. The memory can ignore write wear. The ARC has two types of cache that is important when comparing it to L2ARC. MRU (Most Recently Used) and MFU (Most Frequently Used). The ARC happens to cache data as MRU when it enters the cache, and then “upgrades” it to MFU once that data is requested again. How the ARC does that is not important for this discussion.
What is important is that by definition, the ARC caches MRU data.
MRU data has a much higher probably of not being needed again (at least within a relatively short period of time) than MFU. A log file is a good example. Most likely as the computer writes the log file it will not read that data within a short period of time. Another example is someone watching a video. That data will get stored in the MRU cache, but it is highly unlikely that it will be needed again in a short period of time.
If you decide that you will log into a terminal (locally or via ssh) and need to “cat” a file, then the ARC will cache the data for the program “cat” and the data being “cat’d” into it’s MRU cache. If you then “cat” a different file, the data for “cat” would get moved into the MFU cache, but the file being “cat’d” would go into the MRU cache.
That should be simple enough to understand the basics. Of course the ARC is pretty intelligent and does much more in the background.
For best performance, you should generally only used an L2ARC for MFU data. Like I said before, MRU data is less likely to be reused (in a specific amount of time) than MFU data. MRU data that is evicted from the ARC is even less likely to be used again. If, for no other reason than write wear considerations, you had to chose what data to write to an SSD, it will always be ARC MFU evictions.
An ARC eviction is simply data in a cache that the ARC decides is less valuable than newer data. Newer data can either be a new MRU entry or an MFU that gets used more often.
So SSD write wearing is the #1 reason to only cache MFU evictions in the L2ARC, but it is not the only reason.
The bus used by the SSD, which can be SATA or PCIe has limits. Obviously SATA generally has much lower limits than PCIe. SATA III is 6Gbps or about 600MB/sec not including overhead. This is why most SATA SSD’s have read speeds that top out at about 540MB/sec. While a Gen 5 x4 nvme has a bus speed of 16GB/sec, the flash chips are not that fast. While a good drive with a good controller with SLC flash chips might see ~15GB/sec for sequencial write, that is not the kind of writes that the L2ARC will see. They generally see random writes, as the data written to the L2ARC is ARC evictions.
A good commercial Gen 5 x4 nvme with SLC may get about 5GB/sec for random writes, most consumer drives can only sustain that speed for a short time, and then if the controller is copying data from the Psuedo-SLC cache while more write data is incoming, the apparent write speed will be much slower.
It is not uncommon for a SATA drive with a DRAM cache and a decent controller with a good amount of Psuedo-SLC cache to perform nearly as fast as a consumer Gen 5 x4 nvme for large random writes.
Having the L2ARC store both MRU and MFU ARC evictions means more random writes that reduces the write performance of the SSD for data that is likely not to be needed in a reasonable amount of time.
In a server environment, I believe that this concept is even more important. When using a consumer SSD for zfs cache (ie L2ARC, slog, special vdev) this even more important.
The most efficient way to reduce the size of the L2ARC while maintaining it’s efficiency is to try to only use it for data that is likely to be reused. The data that is likely to be re-used is NOT the ARC MRU evictions, but rather the ARC MFU evictions.
Any data that the L2ARC cache evicts is data that increased the SSD write wearing for no purpose and used up resources on the SSD that could be used for other purposes. Those writes that used up bandwidth to the device, used up Psuedo-SLC cache, and internal bandwidth to the flash chips could have been used for other purposes.
The L2ARC also uses compression. This is even more beneficial because the cpu time to compress an uncompressed is much less than the time to transfer that data to/from the L2ARC.
I do not believe that compression is an issue when discussion L2ARC MFU/MRU caching, but wanted to bring it up. I believe that the average compression ratio is about 2:1.
While I agree that a single SSD should never be used for L2ARC, SLOG and/or Special at the same time as that can create serious write contention issues on an SSD, especially a consumer SSD; I feel that using a single SSD for multiple L2ARC caches may be beneficial.
You must configure each ZFS pool with it’s own L2ARC. Depending on your use-case, it may make sense to configure multiple L2ARC’s on a single SSD. To do so you must allocate a partition on the SSD to a ZFS pool for use as L2ARC.
For consumer SSD’s, you should already be creating a partition on the SSD in order to ensure that a decent sized Psuedo-SLC cache is guaranteed. What a “decent sized” Psuedo-SLC cache is will vary greatly by drive type (and it’s features) and by how much writes will happen on that drive (another reason to only use the L2ARC for MFU data). A consumer drive that has a 10GB SLC cache and has DRAM might only need 5-10% of the drive capacity reserved for the Psuedo-SLC cache. A drive without a real SLC cache (most consumer SSD’s do not have a true SLC cache) and no DRAM may need 10-20% reserve, or more.
If you have 3 ZFS pools that you wish to have an L2ARC cache and can only use 1 or 2 SSD’s it may be worthwhile to configure those pools with smaller L2ARC’s. Using a larger SSD you can then configure the size of each partition to match the size of L2ARC you want to allocate to each pool. Larger SSD’s tend to perform better than smaller SSD’s.
Smaller SSD’s (128GB/256GB) tend to use less flash channels than larger SSD’s. Larger SSD’s tend to have better controllers than smaller SSD’s (comparing 128GB/256GB to 1TB and larger. 512GB sometimes are built like small drives and sometimes like large drives).
SSD’s with DRAM tend to use 1GB per TB of size. So a 1TB drive would have 1GB of DRAM while a 4TB drive would have 4GB. While it doesn’t make much sense to spend the extra money for 4GB of RAM, sometimes you can find previous generation (or 2 generations old) 2TB SSD on sale for close to the price of a current generation 1TB SSD.
The 2 TB SSD will generally have twice the write wearing of a 1TB drive, and it’s likely that for heavily random operations that a Gen 3 nvme will perform very closely to a Gen 5 nvme. It is extremely unlikely that an L2ARC is going to be streaming sequential data large enough to make Gen 5 noticeably faster.
Even if you do not have enough RAM in the server to support the 2TB SSD, you can simply partition it like you would have partitioned a 1TB SSD, gaining you a larger DRAM cache and a larger guaranteed Psuedo-SLC cache.
===================
I know that a commercial server should never use consumer SSD’s, let alone put multiple L2ARC caches on a single SSD. Using the same theory, a commercial server should not even use a L2ARC. Simply replace your server with a newer server CPU that can have up to 6TB of RAM, each. For that matter, simply only use SSD’s for the storage pools (using commercial grade SSD’s).
That is not reality. Reality is that small business, churches, small non-profits, etc need file servers. They don’t have the budget for a $10+k cpu, let alone the motherboard, RAM, HBA’s, and drives.
Their file server may be on a motherboard that only supports 8GB or 16GB of RAM, but they need more performance.
There are home users that want a bit more performance from their server. Both of these groups are likely using consumer CPU’s, not Intel Xeons or AMD Threadrippers/EPYC CPU’s. It is said that you should ALWAYS use ECC RAM for a file server, no matter the filesystem. How many file servers are in use without ECC RAM?
Unfortunately when they research (if they research) about SSDs almost everything that they read is about a desktop computer, and more than likely a desktop computer from a gaming perspective. They likely still are hearing that any SSD is better than a HDD and then see the marketing for peak read/write performance for nvme’s and are “wowed”.
They face similar issues if they look into buying/building a new file server. I’ve seen new motherboards with up to 6 M.2 nvme slots using a consumer CPU. They likely don’t understand that an Intel consumer CPU only supports 20 PCIe lanes, 4 of which are routed to the chipset. Most of these motherboards route 8 lanes to the x16 slot and x4 to 2 of the M.2 nvme slots. All other PCIe slots are routed through the chipset which then are shared over the x4 link between the CPU and chipset. This includes the other 4 M.2 nvme slots, any other PCIe slots, and many other IO devices.
AMD consumer CPU’s are not much better, they have 24 PCIe lanes on the CPU of which x4 goes to the chipset.
These are the type of users that can see a performance gain using consumer SSD’s for L2ARC cache.
Using a file server that has 3 ZFS pools, it may be that one pool could make use of 200GB, and the other 2 pools may only need 100GB each.
They may use a single 512GB SSD with 3 partitions to accomplish this, which would leave about 100GB unallocated and reserved for the Pseudo-SLC cache.
An upgrade to that could be using 2 SSD’s. Each partition on the SSD’s could be 1/2 the size or 100GB, 50GB, 50GB. If using two 512GB SSD’s that would leave about 300GB reserved for the Pseudo-SLC cache.
If they used two larger 1TB SSD’s, then in the future if they were able to upgrade the system RAM they could then add more partitions on the SSD’s and assign them to the pools to increase the L2CACHE size.
Adding additional devices to a pool’s L2ARC means that the read/write load is split between them. Continuing to add more devices brings less and less improvements in performance. Two devices increases peak performance by 100%, but three only increases that by an additional 50% (three times the performance of a single SSD).
I have learned the hard way over many years that I only create disk mirrors. If the data is more important I will use a 3 disk mirror. If I create multiple pools, I will try to use the same disks for each pool, and have a hot spare. That way if a single disk fails, I can simply attach the hot spare to the correct pool and detach the bad disk.
I do this with consumer hardware even though I’ve used RAID5 for decades on commercial level hardware.
Using the 3 pool example, it is often likely that only 1 or 2 pools are being used much at one time. This means that multiple L2ARCs on a single (or preferably two) devices can work.
The reason to never combine a slog and/or special vdev with any other use on the same partition is that they are both very heavy write biased and failure of the device means failure of the pool, which is why it is recommended that these devices use at least the same redundancy as the pool.
Having multiple partitions of those types of devices can cause write contention on an SSD, especially a consumer SSD.
An L2ARC cache should stabilize to very little writes over time, if properly configured. That is when they are almost 100% read. Storing MRU evictions means that the L2ARC cache never stailizes as there is almost always new MRU data being cached by the ARC and then being discarded in favor of MFU or newer MRU data. This means that the ARC will evict the discarded MRU data and send it to the L2ARC. The new MRU data will eventually require that the L2ARC evicts other data in order to fit the new data.
The only way to combat that is to use a much larger L2ARC which will require more memory from the ARC which will negatively affect the performance of the ARC, including evicting more data to the L2ARC.
arc_summary provides data on how much L2ARC it thinks you may need.
arc_summary |grep -i l2
That data is only useful after several days of normal usage. You will likely see that “L2 eligible MRU evictions” is much higher than “L2 eligible MFU evictions”. The reason is that MRU evicts data much more often because it is less likely to be needed.
If you create an L2ARC based on “L2 eligible evictions:” you will be basing the size mostly on MRU evictions. Sizing the L2ARC based on MFU plus growth and disabling MRU on the L2ARC means a much smaller L2ARC which needs much less ARC memory and much less write wearing on the SSD(s).
You may find that a 25-100GB MFU L2ARC is plenty to provide a significant performance increase. Some pools might benefit from even a 10GB L2ARC (on a RAM deficient server).
I have seen way to many small file servers where someone tried to “fix” a RAID5/6 (RAIDz/RAIDz2) and destroyed the pool. Furthermore, most small file servers see much more read than write, which is perfect for RAID1.
I also can only see very few use cases for a small file server to have a single pool with multiple raidz vdevs. The idea that a single vdev failure can take out a single huge pool is scary.
I prefer using a couple of pools with 2 drive mirrors plus a third pool if needed with a 3 drive mirror plus a hot spare drive that can be attached to any pool. That is a total of 8 drives maximum.
On one of my home servers using proxmox I have a TrueNAS vm that has 8 passthrough SATA ports. I have 3 pools of two mirrored 4TB NAS drives. Each pool can provide >200MB/sec of read speed.
I then have another VM that has one iSCSI LUN connected to each of those pools. Those iSCSI LUN’s have a target file of 2.2GB. Those target files see a compression ratio of 1.2:1. On the VM, I used mdadm to create a RAID0 from those 3 iSCSI drives. The maximum theoretically transfer rate is 200MB/sec * 1.2:1 * 3 = 720MB/sec, I have seen a maximum over 650MB/sec.
This VM is for gaming and I’m still waiting for more data to determine if using 2 SATA SSD’s for L2ARC caching would be useful.