RAID Reliability Calculators blindspot

Sara · June 3, 2024, 10:19am

Hi forum!
I wrote something I would like to discuss with you and get your opinions.
Here is a rough draft:

Most of us have used the great RAID calculators from WintelGuy. RAID Reliability Calculator - WintelGuy.com

But in the reliability calculator, there is one thing missing from the equation. The calculator assumes drive failures based on MTBF. While this isn’t wrong, it is missing some real world variables. We will get into that later.

For now, let us look, at the calculated probability of data loss. I tried to fill it with variables that represent home lab users.

Mission time 10 years. Drive capacity 16TB. 160MB/s. MTBF conservative real world. Time to replace 48h. Rebuild rate 80% for mirror, 50% for RAIDZ2. I assume we use 8 drives.

Now let’s compare the calculated risk of data loss for an 8-wide RAIDZ2 and four striped mirrors.

	mirror	RAIDZ2
1y	0.0086704138805684	0.0000406224288297
5y	0.0426067985030293	0.0002030956430019
10y	0.0833982577273807	0.0004061500381636

According to this calculator, over 10 years, we have a 205x times higher chance of losing data in the mirror than in RAIDZ2. And some of you will argue, that you feel more comfortable with RAIDZ2 since any given two drives can fail. While for the mirror, two drives in the same vdev could bring down your pool.

But the calculator has one big flaw. It calculates the chances of drives failing by MTBF. But that is not close to the real world for two reasons.

1: Not all drives are created equal. You could have a bad batch, maybe because of hardware problems, maybe because of firmware issues. Even if you do not have a bad batch but just two different drive vendors, the MTBF is not the same for every drive, but it is in the calculator.

2: The drive’s annual failure rate is not static. It will probably be the bathtub curve. So in the beginning, the annual failure rate will be slightly higher, while after 5y the annual failure rate will drastically increase due to wearout. The calculator is static.

With these two points in mind, let’s revisit our scenario. We buy new drives for our new NAS. We buy four Seagate drives and four WD drives. We build mirrors with one of them in each. For RAIDZ2 it is still one vdev. The Seagate batch we bought has a horrible firmware error. All four of them die after one year. This happens within days or during a resilver. For RAIDZ2 the pool is lost since we lost four drives. For the mirror everything is fine.

In another scenario, the Seagate drives are fine, but the WDs have some issues. 8 years from now, they are extremely old. Maybe the helium in the HDD goes bad or the read heads, or the motor. Since you don’t have a hot-swap, you need to shut down the system and put an additional spin-up on them. The resilver also puts an additional load on them. For whatever reason, three of the WDs fail. For RAIDZ2 the pool is lost since we lost three drives. For the mirror everything is fine.

I know I don’t have any real hard evidence numbers to compare that with the calculator. It is to show that the calculator has its blindspots and is not able to model all variables of a real-world system.

winnielinnie · June 3, 2024, 1:25pm

You can’t rely solely on speculation about particular scenarios for general risks.

“What if the Seagates have a nasty firmware?”

“What if all such drives I purchase came from the same batch?”

“What if it’s not the Seagates, but instead the WDs that have this issue?”

Now you’re left with even more questions:

“Should I purchase only the same brand, based on research and other testimony, so that instead of half my drives possibly having this infant death due to a bad batch, none of them will?”

“What if I guess wrong, and now I have all eight drives that came from a bad batch?”

As you keep asking these “what ifs”, and you introduce more speculation, you’ll find that it’s simpler to calculate risk (and make a decision) based on generalities. Just assume all drives have the same chance of failure, averaged along the “bathtub curve”.

As a thought experiment, let’s pretend you have a superpower that can foresee the future.

Future A: If you purchase 8 Seagate drives, they will all die within a year, due to a bad batch.

Future B: If you purchase 8 WD drives, they will hold up for at least 5+ years, give or take.

Now that you “know” ahead of time, why would you bother purchasing 8 Seagate drives? They would never factor into the risk assessment of “mirrors vs RAIDZ2”.

You’re back at stage one, calculating the risk of losing your pool based on 8 equally viable WD drives.

But you don’t have this superpower, so you have to assume that the drives you purchase fall somewhere along curve. (Unless you read some news article about how “Brand X drives manufactured between 2022 - 2023 suffer from a high failure rate.”)

Stux · June 3, 2024, 1:38pm

Only thing you know is that you need a backup. And a backup to your backup if its critical.

Sara · June 3, 2024, 2:07pm

Let me play devils advocate

That case I cover

Highly unlikely, since I am not buying a batch but two (4 from each).

Doesn’t matter for mirror.

No. Just because Seagate drives were bad the last few years, does not mean that they are bad in the future. That is like predicting stock market based on past performance.

That is exactly my point. That assumption is wrong. And it leads to wrong decisions. If we don’t accept that assumption, we reduce the risk by buying different batches.

If I could forsee the future, I would never buy a bad batch. Since I can’t, I spread out the risk by buying different batches.

Failing along the curve is not the issue. A drive failing isn’t an issue. Loosing a pool is an issue (and maybe also read erros on resilver, so 3 way mirror is probably better, but I will leave that one out).

winnielinnie · June 3, 2024, 2:21pm

For my first questions, I was implying that you’re deciding which “better brand” to stick with for all 8 drives, to emphasize and further narrow your “coin toss” chances.

So now you either have all 8 drives equally viable (or equally likely to fail together.)

You’re back at stage one of “mirrors vs RAIDZ2”.

Without any foreknowledge or extra information, you could end up getting half of your drives from a bad batch, when the alternative is none of your drives from a bad batch.

Would you rather have 4 out of 8 drives possibly fail within a year? Or 0 out of 8 drives that will fail within a year?

Why not stick to a single brand and go with the chances that you get 8 good drives (or at least more than 4), rather than introduce the chance of only 4 good drives by “spreading” the brand purchase?

You might say that sticking to a single brand could yield 8 bad drives.

So what’s more likely:

Purchasing 8 WDs, and all of them failing within a year? Or purchasing 8 WDs and less than half of them failing within a year?

Without foreknowledge, how does arbitrarily mixing brands lower your chances of more than half your drives will fail prematurely?

All things being equal, without knowing what’s behind the curtain, how is purchasing 8 BrandX drives more likely to yield a higher chance of 4+ bad drives, compared to purchasing 4 BrandX drives and 4 BrandY drives?

(On average. Because there are just as many scenarios where you only purchase BrandX, and only 0, 1, 2, 3, or 4 drives are “bad”. That’s still better than (or just as good as) 4 out of 8. Which means arbitrarily mixing brands doesn’t justify this, since the target is “better than half my drives being bad”.)

winnielinnie · June 3, 2024, 2:25pm

You have to assume, otherwise, you’re implying that you hold more information to make a better decision.

Do you have this extra information? If so, sure, go ahead and incorporate it into your risk assessment.

By this logic, why only two brands? Why not three different brands?

3 x WD
3 x Seagate
2 x Toshiba

Sara · June 3, 2024, 3:02pm

I am just pulling numbers out of thin air.
Assuming there is a 1% chance of a bad batch.
And that by “bad” we mean, that 100% of the batch fails.

If I buy just one batch, my chances are 1 out 100 to lose my pool in a mirror.
I only need one bad batch to lose my pool.

If I buy two batches, one bad batch is no problem. Two bad batches would be a problem. But the chances of getting two bad batches is 1% * 1% or 0.0001%.

Does that sound right, or am I missing something? Probability calculations were not my strong suite in math class

For mirror you only need two brands, since you can lose half of it.
But yeah sure, for RAID you could even protect yourself from “concurrent bad batch failures probability” by using more brands. If you would use 4 different brands, a whole batch could go wrong without you loosing data in RAIDZ2 8-wide.

Then you would get the same security against bad batches as with mirror.

Or to look at it from a different angle, I assume that after 7y of usage, a single batch has a pretty similar failure rate. If I buy 8 Seagate drives and we leave out the initial bathtub curve, it is unlikely (at least in my experience) that one drive fails after one year, the next after two, and the next after three. More likely, the will fail all in year 7 (just an arbitrary year).

So it is less about when they fail (because they all will fail one day) but more about how concurrent they fail. And by using multiple batches (two for mirror and four for a 8-wide RAIDZ2) you can protect yourself from that.

From a home lab perspective I think it is unrealistic (at least in my country) to get a good deal on 4 different brands. Even two is sometimes hard to pull off.

Of course one of the main question is, how realistic is it that you get a “bad” batch? 1%? 0.000000000001%? I don’t know either.
I don’t have hard numbers, but stories like this from @Stux tell me it is more likely than it might seam at first glance: RAIDZ3 vs 3-WAY Mirror - #6 by Stux

Stux · June 3, 2024, 4:35pm

Stories are by definition anecdotal.

Sara · June 3, 2024, 5:09pm

Sure. But I think that anecdotal evidence is still better than no evidence.
We will never get a scientific number on that one. But we do know from anecdotal evidence that it can happen. I had something similar with Seagate Archive SMR drives

etorix · June 3, 2024, 5:25pm

Reliability calculations are only as good as the underlying hypothesis. So any figure is best taken as a floor rather than a ceiling, to account for unexpected scenarios.

jro · June 3, 2024, 5:55pm

I’ve done a fair bit of work in this area, see:

https://jro.io/r2c2/

https://jro.io/graph/

https://jro.io/capacity/

In my opinion, trying to get an absolute value for a system’s MTBF or overall survivability is impossible. Even if you spend years meticulously modeling failure modes of hard drives, your confidence interval is going to be huge and a single event like a fire, flood, or lightning strike could render everything moot.

Instead, I believe a much more valuable analysis is to consider the relative probability of a pool failure with one vdev layout versus another. I lay out all the underlying math for this in the R2-C2 link above and the other links are for different calculators for exploring the results of the calculations.

On the capacity calculator page, check the “Show Pool AFR” box and play around with the disk AFR and resilver times (you may need to show more decimal places for some layouts to see the data).

The graph page lets you do the same, but you can set a resilver time based on vdev width (i.e., 5hrs/width on a 10-wide vdev would use a 50 hour resilver time).

I’m currently exploring how to estimate resilver time for different vdev layouts based on a few different factors, but it’s interesting to see how different layouts respond to different resilver times.

Sara · June 3, 2024, 10:13pm

Hi jro, thanks for taking the time.

I used your awesome calc before! Thanks for that!

So I even skipped the attached PDF and that also seems to clearly point to the bathtub curve.
They also have not found a strong correlation between temps/usage and AFR. My guess (not said in this study) is that age plays a more important factor.
Batches have different AFR rates, but in the batches themselves, drives tend to fail close to each other. This further gives me the feeling that this is something that gets overlooked.

I played with your calculator. Maybe I am too tired from work, but here is an idea.
AFR calculates over a whole year (duh!).
So even if we assume that all drives will fail in a certain year and set a failure rate of 100%, it is still spread out over a whole year. So for 8 drives that would be on average 45 days apart.
That seems pretty generous for a batch that is breaking down.
So how about we set resilver time from 48h to 960h?
That is a 20 times increase, does that give us the same number as 100% of the drives failing within 18.25 days (365 / 20)? Will have to check if the math works out on that tomorrow

Because in that case, mirror is safer than RAIDZ2 according to your calculator. And we haven’t even looked into the batch problem.

The batch problem:
Assuming this:
Seagate have an AFR of 5% at year 5 and 100% rate at year 6.
WDs have an AFR of 100% at year 5.
We don’t know these numbers beforehand, because our time machine is broken

Assuming we only got Seagates:
We have a 66% chance of zero failures, but it could also be one or all of them. Assuming we are “lucky” and one drive fails in this year. Why this is lucky, we will see later.
Anyway, we replace that drive in the year 5 and thous have a fresh disk in the pool. If we assume a 0% AFR (just for simplicity) for the first year of that drive, the AFR is down to 87.5% for year 6.

mirror 3.628%
RAIDZ2 3.414%

Here, RAIDZ2 is more secure.

Assuming we only got WDs:
We have a 100% AFR for the year 5.

mirror 4.718%
RAIDZ2 4.826%

Here, mirror is more secure. But for both pool configs, the above example is more secure, thanks Seagate spreading the risk out better by failing one drive a year before and get AFR a little bit down by doing so. That is why we were “lucky” that one Seagate drive failed the year before.

Assuming we got two mixed batches and we are looking at year 5:
Our combined or mixed AFR would be 52.5%? Again have to check the math on that tomorrow

mirror 1.318%
RAIDZ2 0.857%

Two things I notice here.
A: Since the failing of drives is more spread out over the years, chances of pool failure go down significantly. That alone should be a huge reason factor for your resiliency calculations. But we almost never talk about that.

B: It might seem like RAIDZ is better at first glance, since the percentage number is lower, but it is not really true. We used 52.5%, but that is just the average failure rate and not what really happens. We have four drives with AFR of 100% and four drives with 5%.
Lets assume the worst case scenario. The four WDs fail on the first day of the year, which would be very unlucky. But it could happen, imaging a firmware bug or some mechanical problems. We would loose all of our data in a RAIDZ2. Not so for mirror. Data is still there, but the four drives left are basically are a 4 drive stripe with a 5% AFR until the resilver is done.

Assuming resilver takes 48h, we have eight days with 4 disks at risk, six with 3 disks at risks, four days with 2 disks at risk and two days with one disk at risk.

19% risk for 8 days = 19/365 * 8 = 0.416% AFR
15% risk for 6 days = 15/365 * 6 = 0.246% AFR
10% risk for 4 days = 10/365 * 4 = 0.109% AFR
5% risk for 2 day = 5/365 * 2 = 0.027% AFR
Total = 0.798%

Even though this is the absolute worst case for the mirror scenario, is still lower than the 0.857% of RAIDZ2.

If we compare a “single batch” system with a “two batch” system, the advantage of the two batch is, that you don’t have these extremes when it comes to AFR. That is thanks to drive failures being more spread more out.

Sara · June 4, 2024, 9:20am

I had my morning coffee and it still looks right to me.

Maybe I can’t see forest for the trees