[SOLVED] Planning a SCALE deployment, looking for sanity check on Vdev Structure

Excypher · April 19, 2024, 1:23pm

Hello, Hoping for a sanity check for a new scale deployment, I’ve done some research and I believe I have a good plan for my use case. But I would like second opinions just in case there’s something I might not have considered.

My application:

I’m a 3D artist and want a fast NAS to store my project files and rendered frames, I write large files infrequently (between 10 minutes and an hour) that vary in size between 300mb to 10gb dependant on the stage of the project, when rendering I output small 8-40mb files at the rate of 1-3 a minute.

I may have up to 20,000 of them for a single animation, they would need to be accessed as quickly as possible when being previewed, composited and rendered into a video master, typically after this they’re deleted. It’s not unusual for this to be 1TB of space or more for a single anim.

There’s 3 machines on the network, my workstation and two rendering/simulation machines, all are networked together via 10gb, I’m probably going to want to duplex both the NAS and my main workstation so pulling the rendered frames down is even faster.

There is the possibility to add more machines in the future and I’m also planning to open the NAS up to some VPN connections to more effectively work with my team.

I’m currently using unraid and I’m moving to truenas as my storage is becoming very fragmented and overly complicated in it’s current configuration, I have an NVME pool for the stuff I need fast, I have a spinning rust pool for all my media and such (running a home plex server) , and a second NVME pool shared via a windows machine as a render-to location.

FUSE is also causing writes from programs to slow to an actual crawl, forcing me to use disk shares for an nvme pool for the stuff I need fast, with all said and done it’s becoming clear that unraid was not the correct choice for me.

I would describe myself as interested, but not -that- interested in computer hardware, I hate command line stuff, I want a UI, I like the apps that unraid has, I like the option of spinning up a game server every now and then, so SCALE seemed the preferable option over Core.

I’m storing a total of about 8tb of data, around 2tb of that is actually important and the rest is media, during rendering I use up to 1-2TB to store the rendered frames which is recovered fairly quickly.

The Hardware

This is a mishmash of what I’ve had as I’ve scaled up my NAS needs over the years and new parts I’ve bought for this deployment.

9 HDD:
4x 18tb
5x 4tb

2x 2TB Sata SSD’s

4x Gen 4 NVME SSD’s
2x 2TB
2x 4TB

Epyc 7282 On Supermicro H12SSL-I with 256GB ECC

My Plan / Assumptions

I plan to put the spinning rust into mirrored pairs, which should allow me the best balance of redundancy speed and expandability. It’s also part of my ingest plan as 2 of the 18tb drives I’m using as temporary storage whilst I cannibalise the unraid server which currently is using 2x 18tb drives and the 5x 4tb

This leaves me with a spare 4TB drive which, honestly originally I had as a spare so I’d expect it to find that role again (I think there’s a hot spare vdev type I could maybe throw it into?)

I plan to put the 2x 4TB NVME’s into a mirror and use it as a Metadata/special Vdev, with a file size filter set for less than 64-128mb files to be stored there, so that my rendered frames will always prefer that location.

This is where I get fuzzy on my plan, as between the two remaining Vdev options, I don’t see a huge need for a SLOG, given I don’t have a tremendously huge write demand and a substantial amount of RAM.

The NAS is on a UPS, and anything written to it by the other machines that are not on the UPS usually stores backups locally of files before being written. So I believe I can get away with Async writing, since latency on writes is the main bottleneck I’m running up against with my current deployment.

However, given the hardware I have available, It could be logical to split the remaining 2x 2TB NVME’s one as a SLOG and one as a L2ARC, as my understanding of these is that they shouldn’t require redundancy, and I got the drives, I’m not concerned with the RAM overhead with the presence of a L2ARC, and may as well right?

The alternative being to forgo a SLOG entirely and simply stripe the 2x 2tb drives together and use those as a larger L2ARC, this seems overkill though, again given my limited write demand, and the ram available.

I don’t personally see room for the 2x 2TB SSD’s in this deployment anymore, as they’re not really fast enough to serve in any of the special vdev’s, and It’s a bit too far away from the capacity to sit in the main pool.

They’re only going to gather dust if not used in this though so maybe I should just throw them in the main pool?

So yeah, that’s it, just looking for a sanity check on my plan, and some input on how people might approach the deployment given the hardware I have available.

Thanks for reading!

etorix · April 19, 2024, 2:56pm

For better safety at the same efficiency, I’d put the HDDs as two 4-wide [raidz2] vdevs: 4*18 TB + 4*4 TB. Slightly less flexible than mirrors, but your upgrade path is to replace these 4 TB anyway.

SLOG is not a write cache. As you don’t do sync writes, and do not need it (in case of a crash, you render anew, right?), no SLOG.
L2ARC, with 256 GB RAM is unlikely to be of use.

That leaves a nice striped mirror NVMe pool (2*4 TB + 2*2 TB) for everything fast. If you can do all your rendering in these 6 TB, you don’t need any special vdev for the slow HDD storage.

The SATA SSDs can serve an app/VM pool. (Alternatively, get a third SSD and make a 3-way mirror as special vdev for the raidz2 pool.)

[Edited for clarity. And thanks to @Davvo for fixing a mistake.]

Excypher · April 19, 2024, 3:19pm

I see 2 issues here.

One is that, currently the data I need to save is stored on the 18TB disks, I bought the 2 extra so I could use them as temporary storage for when I load the NAS back, as, as far as I know there’s not going to be a clean migration route for my current data.

So that makes the 4 disk VDEVs kind of impossible for me to set up without buying a bunch of additional hard drives. Though as I write this it occurs to me I can probably pull the 4x 4TB drives and use those. (You are suggesting a 1 disk parity v dev yeah?)

Second, I don’t love the idea of a separate fast Vdev, I kind of love the idea of the Metadata and small files being on the nvme, without me having to consider where to store them.

It also means I end up with another fragmented storage solution, and I kind of admire the simplicity of a single volume.

Excypher · April 19, 2024, 3:39pm

Actually upon further consideration, I think you’re onto something with the fast pool, Though I don’t love the idea still, the reality is if my projects got much larger then the space I have I’d be more inclined to simply buy more drives to expand the fast storage.

Would you clarify what you mean about the 3 way mirror raidz2 pool? That kind of went over my head honestly.

Stux · April 19, 2024, 5:10pm

I may be wrong, but I believe the small file size is actually based on block size, not file size.

Ie if your block size is 1MiB, then all files will be stored on the special vdev as the maximum block size used by all files would be 1MiB.

Excypher · April 19, 2024, 5:31pm

Seems counter intuitive, I could be wrong too, but I do remember this being cited as a use case.

It was definitely along the lines of ‘you can set it to store everything under a certain file size’

etorix · April 19, 2024, 5:49pm

No, I’m suggesting raidz2 vdev with double parity: Raidz1 with 18 TB drives puts too much data at risk in case of drive failure. So the suggestion is double parity so you still have parity after loosing one drive—not more.
And then a 3-way mirror as special vdev (if you go this way) to have a similar level of resiliency on the special vdev.

Having one single pool is nice, but it’s not going to play nice and fast whenever spinning drives get involved. So my suggestion, assuming that your working set while rendering fits entirely in the NVMe drives (“1 TB or more” of base data and 1-2 TB of temporary files while rendering) is to do it all on the “fast” pool, and then store the final result in the “slow” pool.
Tiered storage, managed by user.

If you have “about 8 TB” of data, you can:

Make a mirror pool with the two new 18 TB drives, and transfer everything on it. (Alternatively, all the SSDs together could almost do it.)
Delete the Unraid array.
Make a 5-wide raidz1 pool with the five 4 TB drives and replicate from the 2*18 pool.
Destroy the mirror pool to create a 4-wide raidz2 with all four 18 TB.
Replicate one last time from the 5-wide raidz1 to the 4-wide raidz2.
Optionally, add a second 4-wide raidz2 vdev with the 4 TB drives.

I’m unsure whether ZFS would accept the last 4 TB drive as a spare which could only be used for one vdev but not the other. Probably not. So keep it as cold spare.

Excypher · April 19, 2024, 5:58pm

You know what, You’re absolutely right, this is the better strategy, thank you very much for providing the step by step instructions for the best transition too, that is very helpful!

I have one final question for you:

What makes the raidz2 solution superior to simply a pair of mirrors? If I’m recalling correctly, it’s because parity is stored, in effect, in part across the entire array? And thus this allows more disks to fail then a simple mirror?

Thanks again.

etorix · April 19, 2024, 6:25pm

Happy to be useful.
(At steps 3. and 5., make sure to create another pool and not add another vdev to the pool: This would be a mistake with no easy escape!)

A striped mirror can loose one drive in each vdev; two failing drives in the same 2-way mirror vdev kill the pool. Full stop.
Raidz2 can loose any two drives; striped raidz2 can loose more than two as long as there are no more than two failing drives in the same vdev.
Additionally, there’s the issue of UREs (Unrecoverable Read Error): Occasionally, a drive cannot read a given sector. RAID as well as ZFS would then restore data from parity… except if the array is degraded and has no redundancy left. With traditional RAID, this could kill the array; with ZFS, you cannot retrieve the affected file, which is less severe but still annoying. The URE is usually given as “less than 1 in N bits”, with typical values for N being 1E14 or 1E15 for HDDs (SSD: 1E17). The former value amounts to 12 terabytes… So, if you believe, at least a little, in manufacturer’s spec sheets, you should be worried about resilvering so much data without any redundancy left. (In practice, assume that you’re loosing one level of redundancy to the risk of URE; then you want double redundancy to be fully protected against the loss of one drive.)

At 50% space efficiency, striped 2-way mirrors have more IOPS than 4-wide raidz2, but raidz2 is more resilient to multiple drive failures as well as to single drive failure + URE.
Mirrors are more flexible (can add or remove, evolve to 3-way… and can split), and the pool can grow by two drives at a time.
Raidz2 is quite inflexible, and the pool has to grow by four disks at a time. But it is safer.
Your data, your call.

Excypher · April 19, 2024, 6:29pm

Roger that, All points made clear.

Thanks again for your assistance.

Stux · April 19, 2024, 11:30pm

Of course, since RaidZ2 is less flexible, you could consider acquiring a few more disks, and starting out with 5-way or 6-way, where you see a much larger benefit from RaidZ2. Ie with 6 way you have
4 data disks and 2 parity so circa 66% storage efficiency instead of 50% with 4-way.

Ie this would double your rusty pool size from 36TB to 72TB.

Stux · April 19, 2024, 11:32pm

Mind you, RaidZ Expansion is coming

Excypher · April 19, 2024, 11:50pm

72TB on 6 disks does sound very tasty…

But it will be a long time before I have need of such an amount of space. If expansion is coming that seems like something I can deal with in the future. Thank you for bringing it to my attention though, that is definitely worth consideration.

Stux · April 19, 2024, 11:54pm

Davvo · April 20, 2024, 4:27am

You could create a RAIDZ2 VDEV in a degraded state without two drives, migrate the files, then add the drives… but it’s both risky and a PIA.

Suggested readings: iX's ZFS Pool Layout White Paper and Assessing the Potential for Data Loss.

Re: RAIDZ2 espansion, be mindful it will have its own cost as well. It’s no magic hand. Do note the caveats.

Excypher · April 20, 2024, 10:16am

Yeah honestly after saying this. I did give it a little more thought and decided to pick up a couple more drives to complete the 6 disk vdev straight away.

The better efficiency and numbers for when I end up pulling data from the slow storage attracted me.

I even snagged another 4tb to complete another vdev from those. Though I am not sure how wise that is running two of such vastly different capacities, especially given 4TB drives are very cost inefficient to replace new, and when the pool has 72TB. Not sure an additional 16 will make the difference.

I also bought an additional 2TB Sata ssd to complete the 3 way mirror special vdev, I figure with such a large capacity pool I’m going to want it.

I am going to cannibalise some higher capacity nvme drives for the fast storage to increase the space of that so I can use that for my original ingest. Just so I’m not transferring data back and fourth over a few days.

Is it preferred in this community to mark a thread as solved when it’s solved? I noticed someone marked a post of mine as a solution already.

Davvo · April 20, 2024, 10:20am

Just an annoyance.

Just do note that once added, you cannot remove it without destroying the pool.

Generally, I would say yes. We moved forum recently, and some things are not as established as they should.

Excypher · April 20, 2024, 10:44am

ZFS is smart enough to distribute the data between the two vdevs in such a way to take advantage of all the disks, correct? So given it cost me £40 to get another 4tb, it kind of seems worth it to add those, if not for the performance enhancement and to not contribute ewaste.

Understood, as mentioned those drives would only gather dust in a drawer if not deployed so, makes sense.

Davvo · April 20, 2024, 10:55am

Kinda.

etorix · April 20, 2024, 7:01pm

If you’re going for capacity with 6-wide raidz2 then you probably don’t want to add a second vdev with 4 TB as you could not remove it, and the 18 TB vdev is already well above the capacity you need.