RAID Reliability Calculators blindspot

Sara · June 3, 2024, 10:13pm

Hi jro, thanks for taking the time.

I used your awesome calc before! Thanks for that!

So I even skipped the attached PDF and that also seems to clearly point to the bathtub curve.
They also have not found a strong correlation between temps/usage and AFR. My guess (not said in this study) is that age plays a more important factor.
Batches have different AFR rates, but in the batches themselves, drives tend to fail close to each other. This further gives me the feeling that this is something that gets overlooked.

I played with your calculator. Maybe I am too tired from work, but here is an idea.
AFR calculates over a whole year (duh!).
So even if we assume that all drives will fail in a certain year and set a failure rate of 100%, it is still spread out over a whole year. So for 8 drives that would be on average 45 days apart.
That seems pretty generous for a batch that is breaking down.
So how about we set resilver time from 48h to 960h?
That is a 20 times increase, does that give us the same number as 100% of the drives failing within 18.25 days (365 / 20)? Will have to check if the math works out on that tomorrow

Because in that case, mirror is safer than RAIDZ2 according to your calculator. And we haven’t even looked into the batch problem.

The batch problem:
Assuming this:
Seagate have an AFR of 5% at year 5 and 100% rate at year 6.
WDs have an AFR of 100% at year 5.
We don’t know these numbers beforehand, because our time machine is broken

Assuming we only got Seagates:
We have a 66% chance of zero failures, but it could also be one or all of them. Assuming we are “lucky” and one drive fails in this year. Why this is lucky, we will see later.
Anyway, we replace that drive in the year 5 and thous have a fresh disk in the pool. If we assume a 0% AFR (just for simplicity) for the first year of that drive, the AFR is down to 87.5% for year 6.

mirror 3.628%
RAIDZ2 3.414%

Here, RAIDZ2 is more secure.

Assuming we only got WDs:
We have a 100% AFR for the year 5.

mirror 4.718%
RAIDZ2 4.826%

Here, mirror is more secure. But for both pool configs, the above example is more secure, thanks Seagate spreading the risk out better by failing one drive a year before and get AFR a little bit down by doing so. That is why we were “lucky” that one Seagate drive failed the year before.

Assuming we got two mixed batches and we are looking at year 5:
Our combined or mixed AFR would be 52.5%? Again have to check the math on that tomorrow

mirror 1.318%
RAIDZ2 0.857%

Two things I notice here.
A: Since the failing of drives is more spread out over the years, chances of pool failure go down significantly. That alone should be a huge reason factor for your resiliency calculations. But we almost never talk about that.

B: It might seem like RAIDZ is better at first glance, since the percentage number is lower, but it is not really true. We used 52.5%, but that is just the average failure rate and not what really happens. We have four drives with AFR of 100% and four drives with 5%.
Lets assume the worst case scenario. The four WDs fail on the first day of the year, which would be very unlucky. But it could happen, imaging a firmware bug or some mechanical problems. We would loose all of our data in a RAIDZ2. Not so for mirror. Data is still there, but the four drives left are basically are a 4 drive stripe with a 5% AFR until the resilver is done.

Assuming resilver takes 48h, we have eight days with 4 disks at risk, six with 3 disks at risks, four days with 2 disks at risk and two days with one disk at risk.

19% risk for 8 days = 19/365 * 8 = 0.416% AFR
15% risk for 6 days = 15/365 * 6 = 0.246% AFR
10% risk for 4 days = 10/365 * 4 = 0.109% AFR
5% risk for 2 day = 5/365 * 2 = 0.027% AFR
Total = 0.798%

Even though this is the absolute worst case for the mirror scenario, is still lower than the 0.857% of RAIDZ2.

If we compare a “single batch” system with a “two batch” system, the advantage of the two batch is, that you don’t have these extremes when it comes to AFR. That is thanks to drive failures being more spread more out.