NVMe Async Buffering - is there a cache for that?

I’ve read hundreds of posts with regarding the types of caches, whether you need them or will benefit you in your setup, and what they can/can not do for you. I’ve read probably 5x as many responses saying “no, you don’t need that” and “my god don’t enable metadata cache and cause a single point of failure”. What I’ve not been able to find is a factual guide that describes what stats/values need to be evaluated to determine if a cache will provide any benefit and what commands and/or tests provide that information. Does such a guide exist, and if so where can one find it?

Secondly, in particular using platter drives, my understanding is that one should get [# vdevs]x[individual drive throughput] read and write speed - roughly. Is that statement accurate, and if not how and why? By far the most confusing feature is SLOG - I have read it stated many times that although the SLOG is a write-cache it provides no benefit unless you are using synchronous writes. This is very confusing - with NVMe drives that have >50x write speeds (network limitations of course being a bottleneck) it would seem that a 4TB NVMe would be able to buffer writes up at full network speed up to 4TB before write speeds slow to the maximum speed allowed by the platter drives as the buffer clears. So why does a SLOG provide write speed benefits only when using synchronous writes? Is there a cache type/configuration that allows asynchronous writes to be buffered until the NVMe cache buffer is full? Whereupon data can be transferred after the fact from the NVMe buffer to the lower speed platter drive storage dataset? My personal observations indicate that transferring to an NVMe dataset complete the transfer at the maximum speed of the network (assuming the NVMe is not full). 10Gbps LAN’s having transfer speed of 1.25 GB/s are faster than platter drive write speed.

I don’t mean to ask the same questions that have been asked hundreds of times before, but I’m still missing something. Any info is much appreciated. Thanks!

I am sorry to have to tell you that almost everything you have written here is wrong. Perforrmance is complicated and unfortunately you have either misunderstood how ZFS works or you understand but have generalised a performance fix for specific use case to being a general use case.

  1. When you are talking about “speed” you need to differentiate between IOPS and throughput. IOPS is important when you have virtual disks/zVols/iSCSI or database files which do very frequent small 4KB random reads and writes - but when you are doing these, avoiding read and write amplification is even more important. Throughput is what you need to be concerned about for normal sequential reads and writes to files larger than 128KB (and NOT IOPS) and throughput is dependent on the number of non-redundant drives and not on the number of vDevs.

  2. There are only two types of actual cache - ARC which ZFS does as standard in main memory, and L2ARC. No other type of vDev is a cache. Metadata vDevs are not cache, SLOG is not a write cache.

Synchronous Writes

Synchronous writes are a specific solution to a specific use case - either to confirm to an app that transactional data has absolutely been committed to non-volatile storage OR to ensure integrity of virtual block devices (virtual disks) where ZFS has no knowledge of the internals and cannot guarantee the integrity itself e.g. in the event of a power outage.

Writes are either asynchronous (where they are stored in memory for 5secs and written out as a group, and where the write is immediately acknowledged so that the app can continue before the write has been written to disk) or synchronous (where the same occurs BUT ALSO a write is made to a special area of your pool called a Zero Intent Log (ZIL) and where the app cannot continue until that has been done). As you might imagine synchronous writes are slower because the app has to wait, but when the pool is HDD the writes are hideously slow because they involve a seek to the ZIL area. So you don’t do synchronous writes unless you absolutely need to, and they are generally only needed for virtual disks/zVols/iSCSI and database files doing 4KB reads - and if you are doing these then you want them on mirrors (both to avoid amplification and to get IOPS). If the data is too big for SSDs then you need to redirect the HDD ZIL writes to a specialised SSD vDev called an SLOG, or ideally store the data on SSDs (for more IOPS and usually avoiding the need for a separate SLOG).

But if you are doing sequential reads and writes to data files you do not need synchronous writes, and ideally you would use normal ZFS files for these and not store them on a zVol/virtual disk because of the sequential write overheads and because you will also benefit from sequential pre-fetch.

A few more misunderstandings to correct:

  • You talk about NVMe buffering network writes - this happens in memory not on the NVMe drives.
  • You ask why SLOG provides write benefits only for synchronous - this is because it is NOT providing a write benefit for synchronous writes, but rather just reducing the bad performance impact of synchronous writes.
  • NVMes don’t have platters - and HDD write speeds are not platter write speeds either.

Metadata vDevs

Like SLOG, these transfer some I/Os and some data to separate devices.

The actual name of these are Special Allocation vDevs - and when ZFS requests allocation of some space in a pool they work by redirecting specific types of allocations for blocks to write to from the main pool vDev to the Special vDev(s). This only happens when you write data, so adding a special vDev to an existing pool doesn’t in itself move any data to the special vDev. There are two types of allocation that can be redirected: metadata blocks, and space for records below a size defined for the dataset being written to (often called small files).

ZFS pools require each and every vDev holding data to be available - lose one vDev completely, and you lose the entire pool, and special metadata vDevs hold data and follow the same rule - which is why they need to be redundant.

So it is NOT a cache - it is just faster access to certain subsets of pool data. And if the specific data is already in ARC, there is no read speed benefit to a metadata vDev.

TL;DR

Your best bet would be to describe your use case - what data will you be storing, what performance do you need - and let us advise you on how to achieve it.

6 Likes

@Protopia Can you point to any measurements that confirm this?

For a zpool built of 2-way mirror vdevs the number of non-redundant drives and the number of vdevs is the same.

For RAIDzN the ratio between number of non-redundant drives and vdevs varies.

The measurements that I have done and seen indicate that for RAIDzN vdevs the sequential write throughput is somewhere between the number of non-redundant drives and the number of vdevs.

During a resilver (yes, a very special case), I have measured the write speed at one drive’s worth. I measure this whenever I have to replace a drive in a RAIDzN and it has been consistent over the years.

I think it’s more of a way of saying that it’s “more complicated than a single easy metric will account for.”

For example, assuming twelve disks, you might see different results between a 2x6wZ2 and a single 12wZ2 - the additional 2 “data disks” in the 12wZ2 could potentially result in higher peak sequential throughput but at the cost of worse performance under any kind of mixed IO. Fewer VDEVs, but more disks.

It bears testing, certainly. :slight_smile:

1 Like

Here is a pdf that describes approximate performance values.

1 Like

@swc-phil - Thanks! this is what I was looking for!

You talk about NVMe buffering network writes - this happens in memory not on the NVMe drives.

@protopia This was the crux of the post - is there a way to “buffer network writes to NVMe as a secondary cache to memory?” For the sake of the question, and to avoid getting hung up on all the possible implications of IOPS, an example use case would be very large forensic drive images - multiple TB - in a single file or a small number of files. L2ARC except for writes instead of reads.

The “write cache” is your RAM. But ZFS will want to write to disk, for safety, and will not keep an indefinite amount of pending writes in cache. (There are tunables… but messing with these is not a good idea.) So no, you cannot keep terabytes of pending writes.

No - async writes are already cached to memory and then written to disk later. Put in more memory then you can cache more data.

There are tuneables that by default limit this to 50% of ARC (in order not to flush the most used items that are already in ARC), but you can override this.

BUT…

An NVMe pool should have write throughput that is more than any network speed - so data should not be held in memory for very long (5-10s) before it is written to disk.

If you are looking for a means to try to keep the image in ARC (or L2ARC) when it has only been written once, then you will need to look at the tuneables list to see if there is anything that might help. BUT if you decide to start tweaking these be prepared for unexpected side effects.

Kind of for sync writes. That is what a SLOG is for. Think about the filesystem like a database.

If you lose power before the disks can write out what was in flight, async writes are lost because memory is volatile, but sync writes will still be stored on the SLOG when power is recovered. Thats why a SLOG has to be so fast…you are intentionally introducing a bottleneck. A write basically has to get aggregated on the SLOG at the same time as in main system memory. No matter how fast NVME is…its slower than RAM.

But that cost comes with a benefit. Given that sudden power cut failure mode, ZFS will basically “Replay” these “logs” on the next import, ensuring those TXGs all get committed by writing them out to disk. You are preserving data that would otherwise have been lost.

I think Jim Salter here has the best explaination I have found. But TLDR this graphic should help you visualize the write pathin ZFS.

https://jrs-s.net/2019/05/02/zfs-sync-async-zil-slog/

In that use case, if the writes are async, the introduction of a SLOG would not help.

2 Likes

The key takeaway is that all writes are cached in memory, but sync writes are also written to the non-volatile SLOG.

I’ve written a detailed resource with some examples of this.

With regards to the tunables, I still have yet to write the resource on that.

Extensive, extreme, or uninformed messing with the tunables is a bad idea. Going from 4GB to 4TB on the zfs_dirty_data_max is likely to result in Novel And Unexpected Behavior - but slightly adjusting it with the knowledge that it will come at the cost of potential read performance (by evicting existing ARC to make room for inbound) may be worthwhile, if you have something that would fit inside a slightly larger burst-write window.

But if it’s larger than that, like a terabyte-sized disk image coming off of FTK, make the NVMe drive a separate pool instead - and if you need to back up the contents to disk, do that with ZFS replication, file-level with an app like Syncthing, or a scripted server-side copy upon completion.

4 Likes

Just correcting the record…

ZIL = ZFS Intent Log

Yes indeed. I haven’t a clue where I got that idea from.

After twelve seconds of sustained writes, the amount of outstanding dirty data hits the 60% limit to start throttling, and your network speed drops. Maybe it’s 990MB/s at first. But you’ll see it slow down, down, down, and then equalize at a number roughly equal to the 800MB/s your disks are capable of.

That’s what happens when your disks are shiny, clean, and pristine. What happens a few years down the road, if you’ve got free space fragmentation and those drives are having to seek all over? They’re not going to deliver 100MB/s each - you’ll be lucky to get 25MB/s.

After doing quite a bit more reading I now have a pretty solid understanding as to why the filesystem does not, and should not, offer a feature to buffer very large writes on the order of multiple TB. I appreciate the thoughtful responses. @honeybadger - I had actually read your “OpenZFS Write Throttle” post before. Good stuff, much appreciated.

I still think that using NVMe as a secondary write cache is a feasible feature that could be reasonably implemented on a dedicated higher speed vdev (preferably with redundancy) - it would be a nice feature to have built into a NAS operating system. I had already implemented an NVMe dataset that I use for caching fast writes before I made this post, but it still seems to slow down after a certain period of time. There isn’t any point in upgrading memory for very large files, at least that I can tell, unless you are able to increase it above the size of the types of files you are transferring. That being said, my caching solution is a bit half ass and leaves a lot to be desired. I’ll also throw my hat in the ring for a dedicated tool for analysing where your specific bottlenecks are based on your use case whether it’s network, disk, cpu or cache.

Thanks again.

I think you are looking for some sort of tiered storage implementation, in the ZFS
filesystem itself, thats not just the special metadata vdev? That can be something asked for in a feature request, but not something possible at that layer today.

If I understand what you are doing in your testing right now, I personally have a suggested approach you can experiment in. I’ve not done this exact thing in practice personally, but it should work. Having a dedicated ingest pool of NVME drives sync’d over to another pool of spinners with Syncthing can probably solve your write problems.

You’d have to have some good hardware and you’d have to make the Syncthing App or VM pretty beefy tho, more cores and ram. Theres also potential nuance with network, permissions, concurrancy and locking, especially in production. Some Windows and other legacy applications do weird things as an example.

Then you’d have to implement some sort of routine or process flow to archive stuff off. A simple shell script CRON’d at midnight to mv -R /mnt/nvmepool/fast/20250530/* mv /mnt/hddpool/fast/20250530/* or something similar may be enough.

All entirely theoretical,
This way the “Hot Data” dataset can be in both places at the same time similar to a tiered storage solution. The “Cold Data” dataset only needs to be on another pool of any kind. New writes come in, they are written to NVME, client is happy and fast. Syncthing will start sending it over to second Syncthing instance on the other NAS, or maybe even on the same NAS.

But whats cool is whether this is two NAS’s or one, the client only ever needs to know 1 server endpoint.

If creating an external SMB share, enter the hostname or IP address of the system hosting the SMB share and the name of the share on that system. Enter as EXTERNAL:ip address*sharename* in Path , then change Name to EXTERNAL with no special characters.

So you’d have \\SERVER\hot_data and \\SERVER\cold_data and thats about it. You have a xx hour read/write fast tier. and a xxx days slow tier. Client would never mount \\SLOWSERVER\hot_data unless you explicitely told them too, and worst case, they have slower performance because they didnt listen?

In anycase, not an official iX recommendation or anything…Just thinking out loud for you here.

Metadata is not the problem, it’s the content. Those are good ideas - what I ended up using was rclone that copies to a fast NVMe vdev over a network share and then a cron that runs on the NAS to move to the correct dataset. I added checks to avoid copying files still in transfer.

I gather that metadata/IOPS and smaller frequently accessed files are the problem most are dealing with, which makes sense.

1 Like

Right but I think my point is only that there was a similar need that was fulfilled when the metadata special VDEV was created. A change of the mignitude would require alot of work in OpenZFS. But thats what the Feature Requests section of the forum are for.

But I’m here to bounce ideas off of if your solution isn’t making you happy.

I got you, and I appreciate it. I was looking to see if there was a built-in feature of some kind that did this sort of thing.

The problem with a write cache is that it would either

  1. Need to be outside the TXG mechanism which ensure data integrity;

  2. If it was inside the TXG mechanism as some sort of tiered storage, then moving data from that tier to another tier would break snapshots.

I am NOT a ZFS architectural expert, but my guess is that this would simply be VERY difficult to achieve within the ZFS core transactional architecture.