Mixed SSD HD setup

Say I set up a system with 5 26TB HD drives and between 3-6 8TB NVM.e SSDs, how would one approach optimization?
In a world of make wishes true, I’d create a single RAIDz1 pool, and ZFS would create hot and cold zones, keeping hot files on SSD, cold files on HD. Obviously, it doesn’t work like this.
A second-best solution if BSD had a facility that would allow union mounting two file systems, maintaining identical tree structures moving files between the file system based on usage statistics. Not aware of such a solution, either.
I’m aware of the fusion pool approach, but that seems to be more oriented towards a mirrored pair of fast, smallish SSDs for metadata storage, not when SSDs make up a third of total capacity…

There’s of course the option of having two separate pools/file systems, and simply storing different kinds of things on different file systems, but carves things in stone, and may result in one part filling up, while the other still has ample space.

Thoughts?

My first thought is you mentioned BSD, which indicates TrueNAS CORE. Shoot for SCALE which is Linux based, and the platform you would end up with anyway since CORE development is basically done.

Second thought, five 26TB drives in a RAIDZ1 configuration is not smart if you pace any valuable data on that pool. It takes a long time to replace a drive, for the Resilvering process. If the data is not important then RAIDZ1 is fine.

Third, use case is very important in decisions like these.

Fourth thought, if you need speedy access to data, a pool of the SSDs would work, however going back to the Third thought, the use case is needed.

Fifth thought (and final), what hardware and what are your specifications that you need to meet?

There is not one single answer to these kinds of questions, it is more involved than that.

3 Likes

Seems like you want auto tiering/tiered storage. TrueNAS/ZFS doesn’t have this. I’ve heard that Ceph have some kind of this (for the record – I don’t have any experience with ceph). Perhaps OpenStack storage solution could also have this.

In a zfs pool with a special vdev, you can create datasets that will be entirely stored on the special vdev. Thus, some kind of tier separation is possible within a single pool.

Thanks for the thoughts…
…the HW will be this:

with a Sabrent PCIe x4 add-on card that fits four NVMe SSDs & 96GB EEC RAM

I’d love to use a RAIDZ2, but with only 5 drive bays the percentage of total storage used for redundancy is a bit excessive.

If I should ever use a TB/USB4 drive extension to add more drives, moving up to RAIDz2 would be pretty much the first thing I’d do.

I assume the implicit assumption is: if one drive of a batch fails, chances that a second one will fail soon after, is high. That assumes failures at the tail ends of a typical drive’s lifetime (burn-in failure, or wear-out failure) or bad parts affecting all drives of a particular manufacturing batch, rather than a random failure of a single drive.

With an MTBF of 2.5m hours, one would think that the other four drives last long enough to rebuild a single drive…:thinking:

Or is there something else I’m missing here, like that such a rebuild takes a lot longer than I might think?

That is neat. 5 bays is not much, though. Tbh, it looks more like an all-in-one soho server. And IMO, truenas is not perfect for being the only server.

Afaik, usb connection for data drives is highly discouraged by TrueNAS/ZFS.

I’ve heard that manufactures calculate their MTBF in very… intricate manner. You should better look at backblaze AFR.

MTBF is a statistical value calculated over huge number of devices. It should only be used as a relative measure of long term reliability.

The resilver process puts a very large, possibly the largest load, your drives will see. So just when you are most vulnerable (a single failed drive in a RAIDz1) you are putting the heaviest load on the remaining drives.

There is a concept of MTTDL (mean time to data loss) that attempts to take into account MTBF (mean time before failure), MTTR (mean time to repair), and pool topology. While some have criticized the MTTDL analysis, I have found Richard Ellings’ writeup instructive. ZFS data protection comparison Once again, this is for relative comparisons, not absolute estimates of lifespans.

Any RAIDz vdev will resilver at the speed of ONE DRIVE, and the resilver operation is NOT sequential, but not strictly random either, so performance lands in the middle. If one of your 26TB drives fail and let’s assume the pool is 70% full, you will need to write 26TB * 0.75 == 19.5TB. Let’s further assume an average block size of 32KB. That is 654,311,424 write operations. If your drive is very good you might get 250-300 I/Ops. Let’s assume best case, 300 I/Ops. That means it will take roughly 2,181,038 seconds or 25 days to complete the resilver operation. There were lots of assumptions in my calculations, so let’s further assume I am off by a cumulative factor or 4, that is still between 6 and 7 days to resilver. Are you willing to risk your data for that long with NO REDUNDANCY while you push your drives the hardest?

If you use a RAIDz2 you loose capacity and performance, but you gain redundancy. How important is your data?

2 Likes

That depends on how important your data is.

I personally would not make that assumption these days. It was true a decade ago, you “could” get a bad batch of drives.

You are “hoping” for the best case situation, not planning on the worst case situation. That would be like me playing the lottery, buying one ticket, then going out and buying a new car, expecting the ticket to win. Wouldn’t I be shocked if it wasn’t the winning ticket.

The only thing you can plan for longevity is really only one thing, it will fail when it is good and ready. It could be 2 months or 7+ years. I had four drives, all purchased at the same time, same vendor. Three of the drives are sill going 7 years later, one drive started failing around the 6 year point (8 months 15 days ago).

A RESILVER is abusive on the drives to you should not think light of it.

A rather low end platform for your amount of storage and apparent performance expectations.

Thunderbolt is not supported.

Statistics, how they work and what MTBF really means.

Best start anew from your use case—hoping you haven’t bough anything yet.

And IMO, truenas is not perfect for being the only server.

Not sure I understand what you’re trying to say here.

Afaik, usb connection for data drives is highly discouraged by TrueNAS/ZFS.

I understand this has to do with bad USB controler chips and the USB protocol. But USB4 MUST be Thunderbolt compatible, and a Thunderbolt drive enclosure is in effect serialized PCIe. OWC, for example, has multibay TB3 drive enclosures. It’s that sort of thing that I have in mind for potential future expansion, not regular USB enclosures.

OK, thanks for the perspective.
Still, given the old adage “ZFS is no replacement for backup”, then I’d say an array that’s supposed to run years on end, where the alternative is regular hard drives, the risk minimization is still massive.
First, chances of a drive failure are fairly low to begin with in the context of enterprise class (WD Gold/WD Red Pro/Seagate Ironwolf Pro) drives, because chances are they drives are upgraded to larger capacities long before they reach their statistical EOL.
So if there’s a drive failure at all, after initial burn-in, it should be a fluke, with the drives being able to handle a few days/or even a couple of weeks of rebuilding, especially since this is essentially a personal device, i.e. not a lot of concurrent access.

And then, there are still backups, if there’s an actual second drive failure in that window, which while considerably less convenient, still preserve the data.

If a second drive goes down during the rebuild, that’s an indication of the batch having issues, so at that point it would require replacing all drives anyway, meaning a restore from backup is less of a hassle than trying to replace one drive after another, praying that they hold up.

I understand, that if this were an enterprise setup, where things absolutely must remain online at all times, there are other considerations. Here it’s a personal setup, and ZFS is mostly there to avoid having to do restores from backup, when a drive fails, so a convenience factor

A rather low end platform for your amount of storage and apparent performance expectations.

Not exactly sure what’s low end. Compared to a TrueNAS mini, this is essentially a high-end system, in all respects. Same 5 drive bays, a much faster CPU, a lot more ECC RAM, and instead of two small 2.5" SATA drive bays, 3-7 NVMe slots.

Also, what performance expectations? I have not made any specific claims to desired performance. What I was curious about was if I can avoid the inconvenience of having two different file systems (one fast with NVMe and one slower with HDDs), by having a single file system with hot and cold zones. The answer is, as things are currently, no.

So that means two file systems, hosting different types of data (documents on the fast SSD, media archive on the slow drives). This is less convenient because it requires manual management and offers less flexibility in how to use the total available space, but it’ll do.

Thunderbolt is not supported.

Meaning? TB should be transparent. The only issue I can see are potentially wiggly cables, but that’s just as much of a problem e.g. with eSATA; and thankfully, there are solutions to lock even the TB cables in place.

Statistics, how they work and what MTBF really means.
Best start anew from your use case—hoping you haven’t bough anything yet.

You seem to make the assumption that ZFS replaces backups.

Should a second drive failure occur during rebuild, all it means is the inconvenience of having to restore from backup. But it also likely means replacing all drives anyway, because if two drives fail in quick succession, might as well replace the rest, because then that points towards more than coincidence.

Your CPU is way too powerful (aka too expensive) for the sole NAS purpose. Unless you are planning to use very heavy compression.

Thus, it fits more for the role of an all-in-one server. If that’s so, (IMO) you shouldn’t consider truenas as the OS/platform for your single all-in-one home server. Unless you like to suffer.

So, IMO you should reconsider your HW choice or your OS choice.

At about x1 each… And I suspect an Aquantia NIC. It’s too low for what it pretends to host, and the lack of detailed technical specication on the manufacturer’s site is big flashing red warning in my book. The TrueNAS Mini does not pretend to host a NVMe pool.

These that may be implied from someone who plans to set up a NVMe pool for capacity storage… Indeed an explanation of your use case and requirements would help steer the discussion.

Meaning that Thunderbolt is not officially supported. iX does not build for Thunderbolt and does not test Thunderbolt functionality.

You ABSOLUTELY do NOT want a raidz1 pool with 26TB drives, thats not datasafe.
I would personally say add at least 1 drive and make it a raidz2.

Mater of factly, thats possible, using ZFS small-block setting, you can select per-dataset and/or per written blocksize if you want it stored on SSD.

Combined with enough RAM (for ARC) and a L2ARC drive (since the ARC impact of L2ARC is lower since a few years), you would also have frequently read (and/or latest read) data on SSD.

Thats a term iX made-up out of thin air, so I would suggest not using it when talking about ZFS, otherwise non-truenas ZFS users would not even understand what you mean.

For Metadata, yes big SSDs are useless.
However, for small blocks you can make it as crazy as you want and can store whole datasets on it. For example: I personally force storage of DBs on metadata/smallbolck SSDs


In short, I would create 2 Metadata/smallblock mirrors (4x2 8TB NVMe drives), I would add 1 faster, smaller SSD (like an 480gb optane, as those are dirt cheap) as L2ARC.

Then combine it with a 6x 26TB raidz2 or 3 mirrors.
(personally would go for 3 mirrors, due to faster rebuilds, more data safety, more iops on the drives and the ability to delete devs, but thats personal preference)

To be honest, if you want small block storage or iops heavy datasets, 1x pcie with high random small block read/write performance ssds (enterprise obviously), the 1x isn’t really an issue.

One shouldn’t be using SSDs for sequential read/writes anyway imho.

restoring 100TB of data will be hell.
I also have no clue how much money you have to throw away, but if you are willing to pay for a 80TB (80% utilisation) cloud backup, you can just as well start building a second system just for backups lol.

Also: You NEVER want to replace all drives or even use drives from the same batch, preferably mix all different batches, vendors and sub-brands of drives.

You’re also mistaking “total drive failure” with mbtf “dataloss”, you will very likely have some dataloss on rebuild with this setup, without a second drive failing. Drives just have a small chance of a mistaken write or read and that would permanently destroy data with too big drives with RaidZ1.

Without the drive failing.

5 drive bays is a weird setup anyway.


What I find most troublesome, is the way you’re arguing with long-term experts here, like you know a lot.
Yet seems to make so many basic mistakes on storage setup and ZFS.

It sounds like you dont really want advice, but just want people to applaud and tell you to go ahead.

1 Like

I have found most of the OPs responses to be reasoned and thoughtful. They explained why they disagreed for their use case. Everyone’s use case is different, everyone’s tolerance for risk is different, if the OP understands the risk and that risk is acceptable for them, that is sound decision making.

What the OP is proposing is not what I do, but they seem to understand the risks and possible fallout from the design they are proposing. If that is acceptable to them, then it may be the right decision for them at this time. I have made many storage decisions over the years based on conditions at the time, I later changed my mind as conditions changed, that does not make my earlier decisions wrong.

What is weird about 5-bays?

5-bays may be uncommon, but a 5-wide RAIDz2 is one of the configurations I keep coming back to for good reason. For example, in an 8-slot server, 2 x OS mirror + 5-wide RAIDz2 + hot spare as one example.

If you “know so well” what you want, why make a thread in the first place? If it just ends up with you shutting everyone down that advices against your ideas?

Thing is he doesn’t, even his ZFS basics aren’t up-to-date to the slightest. Tháts what makes it weird.

Well not weird persé, its more the argument of being locked into raidz1 due to just having 5 bays.

With 6 bays you’ve the flexibility to choose either-or without significant efficiency loss.

Excluding OS, thats 6 bays.
Precies why I advocate for 6 bays, it gives you the option to do things like 5 wide + hotspare, 6 wide, just 5 wide.

Care to elaborate on this? TrueNAS has been “advertized” as being able to function as a host for virtualized machines and dockerized apps.
Anything specific you care to share?

“Being able to function as” and “being good at” are different. First of all, TrueNAS is a… drum roll… NAS. I don’t have much experience with hypervisors except proxmox. But I do believe that any hypervisor (OS) would beat TrueNAS in terms of hosting VMs.

1 Like

Are you referring to the previously discussed time it takes to rebuild the array in case of a drive failure, or something else? If so, what?

You pre chance know where such a setup is described in more detail? Maybe I was searching for the wrong things, but whatever descriptions I found seemed to be about metadata…

6 isn’t an option, as the unit can only hold five drives, unless I were to bundle a bunch of SSDs into a device, which would act as sixth drive, so to speak, but that would be kind of a waste of performance….
If I had six bays, that’s what I would do, but alas…

That said, I’m a bit surprised at your three mirrors choice, given that you’re otherwise so focused on data safety. If I were willing to give up half the capacity for redundancy, why not RAIDz3? Then ANY three drives can fail. With three mirrors, I’d still be stuck with single drive failure safety, unless multidrive failures carefully pick separate mirror pairs.

That was my thought too. Unless the data is used by a virtual machine on the server itself, the bottleneck is the LAN, where maybe one machine might be connected by TB networking (if TrueNAS has support for that) or else by a 10GB ethernet port, and everything else goes over WiFi, and really performance critical data is local to the machines anyway so the NAS hosts backups, redundantly synced data, maybe a Nextcloud instance, a media server. Key thing about SSD is less latency, quicker file system ops like mv/rm/ls for more frequently accessed, smaller files, or database stores (NextCloud). So I’m not concerned about a x1, as any array is more than capable of saturating the network.

Didn’t say it will be fun, but unless something goes seriously wrong, this also shouldn’t be a regular activity. If I have to deal with this once every 5-10 years (if that often) I’ll deal with it. Small documents will be synced/mirrored to two different could services, which aren’t fast, but affordable. A restore might take a few weeks (unless I finally can get fibre here sometimes soon), but they remain accessible directly from the cloud in the mean time. The rest is a media archive, which will be copied on spare drives, so I don’t need tens of terrabytes of online backups.

Hm, exactly like TrueNAS Mini

One doesn’t learn by uncritically accepting suggestions, one learns by questioning and reasoning.

I originally had ONE question relating to mixed use of SSD and HDD, the other discussion developed from responses, but wasn’t my original question.

Anyway, thanks for the input, there’s some food for thought there…