Simple. Get used HGST He-series helium-filled drives at whatever capacity you want with the warranty you want. They use less power than comparable drives thanks to the helium in them and run really nicely. Only downside is that no one makes them anymore at lower capacity, these are drives destined for data centers, where lower heat / power is a real plus.
No one makes 5,400 RPM 3.5" drives at this point. Even the “5,400 RPM class” drives WD was selling for a while were actually gelded 7,200 RPM drives. Just say no to artificial brakes being installed in your NAS drives. This isn’t DEC circa 198x anymore.
I would also take anything backblaze publishes with a grain of salt. You are not operating thousands of 45-drive Storinators in a 17*C environment. The temperatures your drives will experience will be very different based on location, active cooling, etc. Even within the Storinator pods, some drives at the front will be “cold” while the ones in the rear will be closer to “hot”.
What I focused on in my “Ghetto but Awesome” rig is ensuring every drive gets lots of airflow combined with a fan script. That means leaving adequate room between drives, powerful fans, and a decent ambient environment.
Lastly, we have no idea to what extent the work inside the pods at Backblaze is even remotely close to the duty cycle in your NAS. They have a very funky system to ensure parity that IIRC is very efficient but likely optimized around WORM.
So, I do not run a dedicated AC unit to keep my NAS at 15*C but at the same time I run a Z3 since $76 drives are cheap and AC is not (at least around here). So far, the drives have been happy with me.
I’ll just throw my annecdotal experience out here: last 10+ years I’ve been buying used/recertified enterprise drives from folks like goharddrive and serverpartsdeals. This helps me avoid garbage consumer disks, while also staying away from the left edge of a bathtub curve. My NAS (in a supermicro 2u) case lives in the outside closet, in the same closet where water heater sits. It’s in California and drives temperature as reported by TrueNAS varies between 20C and 40C daily.
Anecdotally, I had one disk develop one bad sector during this decade (and GHD replaced it no questions asked). Before that I had much more (compared to one) failures of new drives. Usually within first few months. (See the bathtub curve reference above.)
So my advice would be — don’t overthink it. Don’t read reviews and Backblaze reports. Don’t pay for new drives. Drives are consumables. They are commodities. They are expected to fail. Get the cheapest one and use raidz1 and replication; run scrubs, and replace those that start failing.
No. Not that. They found no correlation between drive temperature in the certain range and failure rate. And this temperature range is probably not what we see in home use. Their sampling is different from home use sampling.
I feel very similar as well. I may look at Backblaze data for obvious high failure rates, but lasting longer than the warranty, eh. Just my opinion.
So many people actually don’t plan when they build a system that the drives are consumables. All drives will wear out eventually. It isn’t the same as the motherboard warranty where that may be short, but historically the motherboard can last well beyond 10+ years. I have some old stuff that still works.
@chrisolson91 You asked for some recommendations and I think you have some and a bit more as well.
WD Gold Enterprise drives are the way to go in my experience. Never had one die on me. And just let a couple 10 year Gold drives go that were still working perfectly. Just felt like it was time but who knows. They are rated for, like, 25 years. And, yup, they are 7200 rpm drives.
I can prove I sold my Synology NAS with two 3Tb drives. Not sure when they last made a 3TB Gold drive but I suspect a while back. I can’t prove the Smart test I ran indicated the drives had 10 years of use. Will attempt to upload
This is the conclusion I’ve come to of late- nice summary. Consider used enterprise drives (if you can provide the cooling and deal with potentially louder drives), and I believe all the SAS drives I’ve looked at of late run at 7200 RPM. I’m a decade late to that revolution it seems. I’m also gravitating towards the opinion that drives should be viewed as very expendable; the last set of new WD Red’s I bought started generating SMART errors so I’m on the next round currently in perhaps within the last 10 years. And they’re so much more expensive than used enterprise drives which are rated for far longer lifecycles.
But, with a used drive strategy I was thinking to go to Z2, possibly with offsite replication (vs. limited online backup). No life critical data, but a 2 drive failure would be a significant loss. Now you’ve got me questioning whether I should stick with Z1 and just offsite backup with used enterprise drives.
It sounds like just 1 SMART error in 10 years on a drive that was purchased recently enough to get a replacement? And no complete drive failures in a decade? Sure, it’s an anecdote, but your post does have me questioning my path forward as I build a new rackmount system.
I guess it is worth noting you’re running a 2U supermicro and probably the vast majority of folks don’t have that kind of airflow/cooling infrastructure to keep the temps even where you have them in that environment.
On the fault tolerance level: It’s rare that the drive just disappears from the bus; most common failure is when a disk devlops bad sectors. In this case, zfs has a massive advantage over conventional raid: during disk replacement (with zfs replace), if you keep the disk being replaced physically present, zfs resilvers data from all remaining healthy disks onto a new disk, including from the disk being replaced. Thus during the resilvering process the vdev maintains its fault tolerance level. If that hypothetical second disk fails durign resilver, developing bad sectors somewere else – nothing bad happens, resivering will complete.
So for conventinal raid folks pick raid6 to have at least single disk fault tolerance for the duration of repair, in case that secondary failure strikes. Raidz1 maintains fault tolerance level thoughout repair, so no need ot pay in performance and storage for vanishily small improvement in uptime.
Another thought on this hypothetical secondary failure. I keep reading about it everywhere, with exxagerated risk assessments – but it does not make any sense to worry about: if that was a probelm, disks woudl be failign after every other scrub. This does not happen. And if you calculate probability of this happening (Probability of two ucorrelated events both happening within a week from a third uncorrelated event is comparable with a lighting strike taking down power supply and motherboard).
But what about correlated failures? Well, that’s a problem. But if two disks are involved in correlated failure, who’s to say it will be limited to just two? For these cases there is offsite backup and replication.
Yes, it was 4 (or 3) years following purchase. I did not have to replace it, but I wanted to “test” their 5-year warranty. I bought that disk through ebay, and evidently they cannot refund after 1 year on eBay; so they sent me cash via paypal instead and I just ordered a replacement from them beforehand. Zero friction.
The drives sold to me by IXsystems 4 years ago in my MiniXL+ are WDC_WD140EFGX-68B0GN0 which are 14TB, 7200 RPM. So there is your answer about 7200 drives as far as IXsystems sees it.
That is the first time I’ve ever heard that, impressive. My car warranty is significantly less, and my roof on my house would be lucky to last 20 years.
From the WD Gold drive fact sheet - “With a five-year limited warranty supporting up to 2.5M hours MTBF, WDGold® hard drives deliver enhanced levels of dependability and durability.” I believe 2.5 million hours mean time between failure represents 285 years. That seems ludicrous. Maybe I’m interpreting it wrong. Or my math is wrong.
There is a study by google on drives running at higher temperatures. As most would guess, failure rate increases above a certain temperature, which seems to be ~47-48C.
As for 5400 RPM drives availability, I’m using 14 TB WD shucked drives that are allegedly 5400 RPM (SMART says so). What I know for sure is that they run cooler and quieter than a 6 TB Seagate 7200 RPM that I also have. They are also a bit slower on sequential R/W.
I recall a Microsoft study revealed something similar about higher running temps. It’s as if you need to include a realistic range that covers up to 60C to better understand the relationship of heat, wear, and drive life.
This, 100%
“5400 RPM Class” is just WD speak for firmware with performance limiting, mostly in their externals (I suspect for thermal reasons) but they also sell them this way in REDs as you’ve seen…
the easiest way to find the truth is with a spectrograph app like spectroid with your phone against the drive, and looking for the peak (at 90hz for 5400rpm, 120hz for 7200rpm).
Newly made you will only find 5400RPM in 2.5" drives these days, and they are all now SMR unfortunately, with mostly the same braindead and flawed firmware that drive managed SMR disks always have had (WDs are worse because they try to be “smarter” and more aggressively cache SMR storage with CMR zones, re-writing a lot, and engaging the performance disaster faster and harder).
Yeah, that’s definitely not the right way to think about risk tolerance. It’s a statistical model that suggests if you’re running a million drives one would fail every 2.5 hours. But it’s virtually meaningless in terms of risk assessment. The consumer WD Red plus drives have a 1 million hr MTBF rating, but I’ve had significant SMART errors in well under 60K hrs with relatively light use.
There’s a reason you can find tons and tons of 40K hour used enterprise drives online. After 4-5 years the failure rate goes from the 1%/yr range to 5-10%/yr and that’s for ultra-durable drives too. It’s just too much risk for a situation that demands minimal maintenance, and 100% performance and up time. Now, as @saspus pointed out, that’s not the whole story either in terms of NAS risk assessment. For one, even raidz1 would require 2 complete failures within a short period of time for pool destruction. Secondly, that’s under heavy data center IOPS conditions. Third, there’s a big difference between a complete disk failure and replacing a drive as soon as you start seeing SMART errors. So the actual risk is a lot smaller, but, it’s also not going to last 25-200 years either. And, end of the day, no one has ever argued that raidz is a replacement for (offsite) backup and if you’re doing that, that’s part of the risk assessment too.
It is how a lot of companies rate products. They take 250 hard drives, run them for XX hours until one fails. So, 250 drives run for 365 days and then one fails. The math: 25036524=2,190,000 hours MTBF. (2.1 million) That was just a poor example but gives you a general idea.
I’m fairly certain that they use failure data they have accumulated over the years for similar drives (same drive motor, same head electronics, etc. vice actually running a large batch of drives as that costs money to run lengthy tests.
Do not base a decision on MTBF, use the reviews and warranty period to help decide the longevity of a drive. I personally only assume the drive is good for the warranty period for a minimum value. It can fail sooner but overall most drives last much longer than the warranty period, I just do not plan for beyond warranty time.
My HGST drives are 4 years and 7 months out of warranty and are still running fine. I’m lucky that 3 of 4 drives have lasted this long, but they will eventually fail.