Is there already a script out there for using special vdevs as write caches?

zara · February 12, 2025, 11:16am

I pretty much have my build figured out by now. My bottleneck is most likely going to be my network. I might end up bonding two or three 100 GBpe ports eventually to relieve that bottleneck, and allow one machine to copy 1 TB within 3 minutes to my ZFS system. I’m guessing if I do that I’ll need to use a few special vdevs as a write cache, before propagating the data down to the much larger HDD pool.

Asking Grok about this it recommended I build the special vdevs and script according to these specifications.

Write Cache Strategy with Special Vdevs:

Write Cache Using Special Vdevs: You’re on the right track thinking about using SSDs or NVMe drives as a sort of write cache. ZFS does not natively support using special vdevs as a write cache in the way you describe where data is automatically moved to HDDs after being written.

ZFS Cache Devices: ZFS does support cache devices (L2ARC), but these are for read caching, not write caching.

Manual Write Cache: To achieve what you’re describing, you would indeed need to write a script or use some form of automation to move data from SSDs to HDDs. Here’s a conceptual approach:

Write data initially to SSDs configured as special vdevs.

Use a script (like a cron job or a custom daemon) to periodically move data from SSDs to HDDs based on certain conditions (e.g., when write operations are below a threshold or during off-peak times).

You would need to ensure data integrity and consistency, possibly using ZFS snapshots before and after moving data to ensure no data loss.

Is there already a project out there for this purpose? It sounds like writing my own script would not be too complicated. I’d just have multiple special vdevs configured to absorb all writes, right?* Then when the specials get past a threshold and/or no writes are coming in they move their data down to the zraid or draid beneath them.

So, before that script moves the data I take a ZFS snapshot, and after it’s done I take another snapshot, and compare the data to each other with a ZFS diff to make sure everything is the same. If there are any differences I just move the blocks the diff does not match on to the same location?

*I’ll most likely add the normal stuff for improving write performance first.

Arwen · February 12, 2025, 12:24pm

These are not temporary write locations, any “data” written to them stays on them until the Special vDev is full. Then, writes spill over to regular data vDevs.

Now if you want a separate pool with SSDs or NVMe devices, that you write to first, then later copy to the main storage pool, yes that can work.

ZFS is not the end all for NAS, nor designed for ultimate speed. Data integrity was one priority, as was performing tasks on-line, (like scrubs, which are similar to older file system checks, FSCKs). These may slow things down for some people.

I saw a different NAS solution, (not using ZFS), that had some pretty extreme speeds, even for writes. Don’t remember the name.

pmh · February 12, 2025, 12:34pm

AI hallucinating nonsense, as usual.

A special vdev is part of your pool and anything but a write cache.

You could create a separate pool built from SSDs and then copy the data over to a (supposedly) larger HDD pool during idle times.

There is to my knowledge no project trying to create a HSM (hierarchical storage management) system based on ZFS or TrueNAS.

If that is a hard requirement (1 TB in 3 minutes) - assuming that speed can be met by an SSD pool and the network (I have not done the math) - I’d suggest going all in and calculating how much an all SSD system would cost.

Possibly iX have something attractive that would fit your requirements.

zara · February 12, 2025, 3:04pm

It’s not really a hard requirement. It’s just the fastest speed my current drives could theoretically read 1 TB at. The drives are a few years old. I think it might be possible now, since a few years back the Chia people were talking about SSDs becoming super cheap by around now. It wouldn’t surprise me if the high end drives could pull it off today.

Maybe, a PM1743 could pull it off. Definitely in a striped set. But, I’ll probably mirror four of them. Is it possible to build a monstrosity that has the benefits of striped and mirrored disks using 8 drives? I’ll probably wait two to five years before trying this. My thinking is if the disk is large enough it shouldn’t create a bottleneck for typical workloads.

I’m pretty sure if two 100 GB ports are bonded they should theoretically support those speeds over a network if reads and writes are coming across them.

I’ll probably end up going with that. I’ll have to find out how well I can configure the arc to not prefer the SSD write cache pool depending on the data set. I probably only need 10 TB max for it, and probably far less. But, before building that pool I’m guessing it would be recommended to first add SLOGs, and Special Vdevs to the main pool? That main pool is still going to be the bottleneck.

pmh · February 12, 2025, 3:28pm

Probably no and no. Are these SMB shares? Then you definitely do not need an SLOG.

The special vdev will speed up metadata lookup but not bulk data write operations. Probably just add more memory

Arwen · February 12, 2025, 9:36pm

Sometimes people confuse network bonded ports for increasing single client performance. This is not true, under almost all conditions.

Network bonded ports, (aka LACP, MS-Windows NIC teaming, etc…), all MULTIPLE clients to use different ports. Which improves the overall network speed a server, (or NAS), can use for network traffic. Usually it is 5 or more clients before it is reasonable to use LACP.

The one exception to LACP use with few clients, or a single client, is for high availability. If one link goes down completely, in theory, the other will take over the entire load of network traffic. Thus, no outage.

Now other technologies exist that support multiple network paths, for a single client:

Samba Multi-Channel, (I don’t know much about this…)
iSCSI

But, it both cases it is not using “bonded” network ports.

zara · February 13, 2025, 12:22pm

They’re most likely going to be NFS. I’m building the HDD array, and then benching with various programs to find out if anything would help improve performance. Mostly, using Borg, Chia, repo caches, and distcc to stress test the system. Then I go through and add SLOG, special vdev, metadata, etc till I have an HSM that can store a large amount of files, and write and read fast.

I’m pretty sure Borg uses mostly sync writes since it’s a backup program. If I ended up noticing I need a SLOG I’d want four 350 GB or 750GB P4800X overprovisioned to about 16 GB?

I thought it can depending on how the network is setup, and the data transmission protocol used. But Samba Multi-Channel, iSCI would probably make more sense.

I have two 40 GBe ports on my Mikrotik I’m not using. So, I might as well try to saturate that first before getting a 100 GBe connection. It’s probably a write speed I won’t always have, but if I’m cloning a machine onto new hardware I’ll take advantage of it by moving cables around. Would be prohibitively expensive to buy all of the network equipment for all of my machines to have 100 GBe connections.

If I wanted to implement that wouldn’t writing from the SSDs to the HDDs be all sequential writes. Might not be the most optimal but, wouldn’t it be something like organize all data on the SSDs so it’s sequential, take a ZFS snapshot, write sequentially to the HDD pool, take another snapshot, run a ZFS diff. If the snapshots do not match move data from the last known good shapshot to the address on the HDDs that do not match.

I think I already have most of the business logic from writing a char* manipulation class years ago for a transpiler. Isn’t it just move(data_to_copy, start_block, end_block - start_block +/- 1) or so. I probably just need to add ZFS specific stuff related to the data structure used. But, I think I already have most of the business logic for correcting errors. The code I have is able to move forward and back on cstrings to replace data from an XML file with other char*s to produce classes in C++ that compile in Unreal. Wouldn’t that basically be what I need to correct errors when moving between pools?

Is this a good resource for the specifics about the data structure I’ll need to fix copy errors?

pmh · February 13, 2025, 1:18pm

@zara Sorry, this is way beyond what I am capable of. If I was to implement anything in this direction I would write scripts in shell or python that use ZFS send and receive as the backend to actually move the data.

That guarantees that the copy operation did not introduce any errors. If a ZFS send/receive pipeline succeeds the data is guaranteed to be identical. Otherwise the operation fails with an error code.

But I would never consider managing business critical data with a solution that I wrote myself but would go and buy an all SSD system instead.

zara · February 13, 2025, 2:03pm

Alright, so it’s already doing what I was thinking about doing in C. Most of this data I’m throwing away anyways. Hopefully, I can figure out if it works.

I would use SSDs for the entire system. But, that’s way too expensive. I think it’s easily $50k for a much smaller zraid3, while a cache system could be built for $15k max. I just have to figure out intricacies with how Borg would interact with it.

Yeah, borg calls fsync a lot. I think I’ll need a slog. I might add zfs support to Borg too. I think it should allow Borg to serve as an HSM backup solution on zfs. The functionality needed might already be there. It just needs send and receive support in a Python script.