Yeah it was a bad example. Was just trying to drive home the fact that it’s not just another extra memory die…it’s a different chip
Edit also:
Your picture is RDIMM vs LRDIMM which is another different thing all together lol
Yeah it was a bad example. Was just trying to drive home the fact that it’s not just another extra memory die…it’s a different chip
Edit also:
Your picture is RDIMM vs LRDIMM which is another different thing all together lol
Yes. But it shows the register chip AND the extra dram chips.
UDIMMs are actually strictly better for a given capacity, since they have a lower latency, all else being equal. ECC is exactly the same with UDIMMs or RDIMMs (or LRDIMMs).
Of course, there’s a crapton of used RDIMM stock, so pricing is a lot more favorable on those.
I learn something new everyday. Thanks for the education.
I used to come down on the side of ECC being optional, but had a RAM stick fail in my parents server that would flip a certain bit so consistently that it not only caused scrubs to report errors, it actually caused ZFS to recalculate the parity data-- incorrectly-- and write it back to disk.
This caused permanent corruption of dozens of files before I realized what was happening, even with a RAIDZ2. Luckily I had backups of most of it, and only ended up losing 1 file that had not been backed up yet.
Just finished building them a new server last week, where I gladly paid the premium for ECC RAM. Not going to repeat that disaster if I can help it.
I’m with team ECC. My main server has it, my backup server doesn’t. If scrubs start to show errors I will have to compare the data between the servers before just running a repair I guess. Even then memory errors might falsify the results.
WOW! You’ve actually observed the mythical Scrub-of-Death in the wild
Thanks for reporting.
(To Moderators: Is there a way to mark the above post #65 as “Solution”?)
The system was 10+ years old (i7-4790) with DDR3 running in XMP mode the whole time. The higher RAM voltages over the last decade could well be the cause of the failure. When I ran MemTest it would find the bit being flipped very quickly. It would seemingly always flip from a 0 to a 1, like that bit was stuck on. Probably the most definitive memory failure I’ve ever seen.
With enough parity (e.g. RAIDZ2) a true random bit flip in RAM shouldn’t cause unrecoverable corruption, but I don’t want to play the odds anymore after this experience.
Drifting way off the core topic here, I agree with the suggestions above to buy used enterprise servers. The only challenges are the power consumption and noise. If you have cheap power (or solar!) the power consumption isn’t a deal breaker. Getting them quiet enough to be in the same room is a whole other story…
I liked that bit:
Backups are Backups, and raidz is about redundancy and uptime.
ECC is the only way to go. Reliability is a required thing, and ECC is a key factor in achieving that reliability.
Whatever the application (be it home lab or enterprise), if it’s valuable data, while it is no replacement for a solid backup strategy, there is no reason to not use ECC RAM for valuable data. It’s just one less factor to contribute to data loss and you don’t have to then dive into your backups to recover.
After all, prevention is always better than cure and ECC is a reasonable preventative measure.
It depends on whether my time is valuable to me.
Finding errors in consumer RAM is a pain. Is it crashing because of an app or driver, or bad memory? Let me run memtest86+ for 5 days or longer to get some reasonable facsimile of trust in my RAM.
Or pay a little bit extra for ECC RAM and know when it’s failing, and when it’s not.
I’ve only had mystery RAM errors three times in my life, but it’s been a pain in the ass to diagnose every time. ECC when I can. Laptops don’t do ECC alas, neither do RK3588 boards. But desktop and server have ECC.
Funny thing,
I’ve diagnosed ram errors on non ECC systems a few times.
I’ve experienced a random bit flip on ECC systems a few times.
I’ve experienced serious ram failures on ECC systems a few times. (Blow out the sockets etc)
I’ve experienced absolutely tonnes of random computer crashes on non ECC systems.
Go figure.
Concur but the worst thing that can happen is silently corrupted data that is then backed up that then leaves you with zero good copies. I agree with @kris that this can be taken to great lengths (i.e. is the RAM on you NIC also ECC?) but given what a small delta ECC RAM made in my build vs. non-ECC RAM, I didn’t see the point of going non-ECC.
To me, it’s a bit like some car OEMs that give you a $700 option for rear seat airbags. If you’re a retiree and never expect to have anyone back there or plan never to get into a car accident, then the $700 option is likely unnecessary. On the other hand, how will you live with yourself if you walk away from a T-bone accident but your kid doesn’t? The car OEM knows this and value-prices accordingly. But if you’re already in for $$$ buying a car you might just say, “it’s expensive but it’s also worth it.”
Going to ECC likely also involved buying server-grade hardware. That’s a whole other ball of wax re: additional cost. But again, likely worth it if the data is precious just because the likelihood that the hardware and drivers will be up to the task is higher with a real server board than a old gaming rig with a realtek NIC, for example.
How did I ever survive riding and driving in cars during the 1950s-70s before seat belts were mandated?
I just had another un-explained lack of console on my 5 year old ASRock DeskMini A300, which can’t support ECC memory. (It was screen locked… but nothing on the screen when I powered on the monitor… nothing restored it.)
This is now starting to happen several times a year… which, while annoying is not bad. But, I use that PC for access to work, (because I work from home).
Now could it have been a bit flip in memory?
Perhaps.
Could it be the dust that has built up over 4-5 years since I last opened it?
Maybe.
Should I clean it out now?
Probably not, because I hope to buy a ASRock DeskMeet X600 soon, which does support ECC memory, (if the CPU does too). And I can’t risk making it worse.
presented without comment
Coincidentally I have been through two driver-side t-bones in my life. One in a car without a side impact airbag, one with.
After the first accident, I didn’t walk straight for a few days. I also took out some of the hard plastic inside the car with my head.
Second accident, several years later, I had a bruise on my arm, a torn shirt, and I was fine. Both times I was wearing a seat belt.
Impacts were at similar speeds though dissimilar cars. Second car was smaller, should have sustained a worse impact.
I’ll never buy a car without a side impact airbag again, ditto for my passengers.
Oof, I was once in a relatively low-speed one and it was bad enough (parking garage, crazy woman would totally have killed any pedestrians who happened to be in her way). Not enough to deploy airbags (probably because the road surface was slippery and my car was pushed off to the side, reducing the initial impact shock), but enough to almost total it. A single random airbag deploying would have probably totaled the car at that point.
I’m shuddering just imagining being in two of these, but higher-energy.