ECC memory for home servers or not?

Arwen · June 21, 2024, 12:00pm

Per various discussions in other threads, lets discuss ECC verses Non-ECC memory for HOME style servers.

I am of 2 minds about this.

First, it appears I have not experienced any memory errors on my home computers, in the last 25 years. These computers have never over-clocked anything, not CPU, GPU or RAM. These vary from smaller desktops, miniature PCs and laptops. I have owned real servers, for example a Sun X2200 M2 AMD dual Opteron, which did have ECC. But they too did not report any RAM errors, (ECC correctable errors).

Of course my sample size is so trivially small, it is worth discounting 100%.

It is also worth noting that vendors, like Intel, give us limited choices for hardware that support ECC for the low end. Like desktops, laptops or home servers. So it is not surprising that Non-ECC RAM is used in the home most often. Even some small business as well.

Now on the other side. I have experienced unexplained crashes and data corruption on disk, (aka NOT a bad disk block). This is when I used something other than OpenZFS.

Was that due to Non-ECC RAM bit flips?
Don’t know.

One thing that I’ve seen in the last 10 years here in the FreeNAS and now TrueNAS forums, are unexplained ZFS pool corruption. We have several known reasons why a pool can get corrupt. Most appear to be either related to virtualizing their NAS, or using hardware RAID controllers.

Yet for those who are not virtualizing TrueNAS, not using hardware RAID disk controllers nor any of the other items, we never got a good explanation for the pool corruption. Could it have been memory errors working their way into the pool? Perhaps. No hard evidence that is the case.

One thing to note. People have assumed that an uexpected power loss could cause ZFS pool corruption. Nothing is farther from the truth! ZFS was SPECIFICALLY designed and TESTED, for unexpected power losses and NOT loose any data. (Except data in flight.) Further, ZERO pool corruption is EXPECTED after unexpected power losses.

Of course hardware can fail during a graceless power loss. Like a disk or disk controller… Or maybe a regression was introduced to OpenZFS breaking it’s programmatically sound recovery from a graceless power loss.

So, lots to think about.
Polite discussions please.

pmh · June 21, 2024, 12:04pm

A dated but IMHO still applicable analysis and take on the topic.

Arwen · June 21, 2024, 12:04pm

On a related subject, a few years back I started writing up specifications for an in memory checksum of R/O Code & Data scheme. A pretty complete system, including shared libraries, kernel segments and user code. Having such in memory would allow a “scrub” of R/O memory segments, and if a fault is found, disable access to that MMU memory page. Then possibly fix the memory page from the original binary and restore access.

To be clear, the intent was for assisting in detection of malware modifying memory. Which also might modify a common binary on disk. My scheme would have checksums in the ELF AND then load those verified checksums in memory. Making it harder, (but not impossible), for malware to infect existing binaries.

Such a scheme could be used to assist in the detection of Non-ECC memory bit flips.

For example, not all code or R/O data is actually used. Or would cause a crash if used with a bit flip. I once had a Solaris 10 server that had a bad OS file. Even though the OS was Mirrored, I am guessing that another SA before me fixed a bad disk. But, at the time they did not run scrubs, so a file was bad on the “good” disk, (aka silent corruption). After changing the “bad” disk, they did not fix the failed file.

Now it turns out it was a foreign language file we never used. So no loss. Further, it was trivial to copy it from another working server. (Backups were of the bad file, and any good version of the language file had expired from backups by the time I got involved.)

Could this be of use to help Non-ECC RAM computers?
Maybe.

But it would be a major effort to make what would be a more secure OS, shared libraries and binaries.

somethingweird · June 21, 2024, 12:34pm

Personally - ECC memory for home servers if POSSIBLE - its another layer of insurance compared to having silent corrupted data. If I did 3-2-1 backups at home - I won’t careless if my NAS had ECC or non-ECC.

Stux · June 21, 2024, 1:04pm

Or have you?

That’s the issue.

I have experienced ECC errors. I have also experienced random reboots on pc hardware which could be explained by memory corruption.

Have you experienced random reboots?

I have also debugged crash dumps which just could not happen except for CPU bugs or memory corruption.

And it’s been CPU bugs some of the time. Or compiler bugs. They’re also fun.

But the point is, without ECC you never know if the RAM may have been the problem any time anything goes wrong. With ECC you generally can exclude RAM as the issue.

The once a decade 1 bit memory errors are cute. The ones that start happening a lot are just failures. Often cured with a reseat/cleaning.

But you wouldn’t know without the IPMI log.

Do you need ECC at home? I wouldn’t say you need it, but it does help exclude a class of issue, and with it you basically have zero chance of zfs failure except for zfs bug or disk failure, and no chance for bitrot.

dan · June 21, 2024, 2:25pm

Is it needed? Certainly not in an absolute, existential sense. Plenty of home servers do just fine without it.

At the same time, if you go to Dell, HPE, Lenovo, etc., every server they’re going to sell you is going to include ECC RAM. If there weren’t a benefit, I’m sure they’d be glad to cut 10% off the BOM cost of their systems–but they don’t do that. Clearly the market sees enough of a benefit to ECC RAM that nobody sells anything else. If it’s important enough to be universal in that space, why shouldn’t we at least want it at home?

The obvious answer is that business data means money, and home data usually doesn’t. The counter to that is (1) that business data is also more likely to be backed up, and (2) some home data (family photos, home videos, etc.) just can’t be replaced.

Like Stux, I’ve experienced memory errors–caught by ECC, and logged via IPMI. I’ve probably experienced memory errors on other systems that didn’t use ECC, and didn’t know about it. I’ll use ECC in my servers when possible.

ericloewe · June 21, 2024, 2:57pm

One thing that bugs me - not to put words in your mouth, just something that came to my mind - is that some people act like ECC is a crazy box to tick, when it really isn’t in most cases.

Sure, ECC SO-DIMMs are a massive pain, but they’re rare outside of niche platforrms. The only somewhat-mainstream platform that uses them is the Supermicro A1SAi/A1SRi line. ECC UDIMMs are easy to acquire, and ECC RDIMMs are ubiquitous and hard to miss.

The systems that would take ECC DIMMs are - and this might shock some readers - systems made for use as servers.

So, to end up without ECC, one needs to buy into a decidedly non-server system (this is worthy of its own discussion some other time) or cheap out and cut a few bucks off the price of memory by going with non-ECC.

So yeah, I find it really annoying when people complain endlessly about ECC. Shoestring budget, hardware on hand, don’t want to commit before trying things out and kicking the tires? You won’t find many people giving you a hard time. Advising others that ECC is unnecessary because $REASONS? Don’t be surprised to be challenged on that one.

dan · June 21, 2024, 3:24pm

It’s probably worth acknowledging that there’s been some FUD in the past on this subject (see, e.g., the infamous scrub of death). And memories are long, and some of us probably still tend to put a bit more emphasis on ECC than it really needs. But really, it’s a good feature to have regardless.

Constantin · June 21, 2024, 3:25pm

… especially in the context of spending megabucks on the motherboard, hard drives, NVME, and so on but then wanting to cheap out on memory.

If there is a budget constraint, buy older used equipment that is still performant enough and use the saved money for higher priorities.

Used SM boxes can be ridiculously inexpensive and feature everything but the hard drives. They’ll likely run for another decade with maybe the need for a new PSU, the odd fan, etc.

somethingweird · June 21, 2024, 4:16pm

No problem. Like you said everyone has there reason for ECC or NON-ECC… and limited to what hardware is available @ home.

Fleshmauler · June 21, 2024, 7:31pm

My $0.02 is that it depends on your budget & risk tolerance. I had a long essay written, but then realized it simply circled around this 1 point.

I can’t find the actual chances of ECC doing corrections or non-ECC flipping a bit & corrupting data (which is something I’d have expected to be highly studied & easily quotable), but it helps me feel like my data is safer & that was worth the few $ for me.

etorix · June 21, 2024, 7:51pm

One of my TrueNAS systems had one a few months ago. No crash, no reported data corruption, just a notification in the GUI.
And, of course, a reminder that memory errors do happen.

My pragmatic answer to the thread question:
The essential functions of a NAS are storage and networking—so no compromises here.
If one wants to repurpose an old desktop as a NAS, and/or try TrueNAS, there should be no compromise on available NIC (no Realtek!) and no compromise on drive controllers (chipset and/or HBA, no fancy “PCIe SATA card”!). But non-ECC is acceptable.
If one is buying a motherboard/CPU to build a NAS (anything more substantial than a refurbished HBA and/or server NIC to “rectify” a desktop motherboard as above…), there is no reason not to go by the textbook. ECC all the way!

oxyde · June 21, 2024, 8:00pm

I admit that my personal experience with servers started only some months ago, when i decided to build my home nas.
I started with and im still using a non-ecc system, not for save money on RAM itself but due to the high price of new mainboard (300~400€), and the difficulty to find something used/refurbished in short time.
When i will find a good board, that is not sell from i dont know place on ebay, i will be happy to upgrade… But until that moment i pray and backup, backup and backup

volts · June 21, 2024, 8:03pm

Even regular DDR5 is using ECC invisibly/internally/on-chip.

A few years ago the Puget folks said they saw lower total failure rates on ECC parts. I wonder if this is one of those cases where buying ECC is also just buying a better-made part.

Anecdata: Apple is no longer selling any systems with ECC.

pmh · June 21, 2024, 8:03pm

I have one legacy server in the data centre that logs a couple of correctable ECC errors per day. It will be replaced soon enough but I am very grateful to have a system that

chugs along
tells me there are memory/chipset/CPU problems

It’s like ZFS or Apple’s latest creation, APFS. APFS checksums metadata but no user data.

If ZFS ever is unable to keep your data intact, it tells you which files exactly are damaged beyond repair.

Constantin · June 21, 2024, 8:04pm

Or put another way: we have all gone through the trials and tribulations associated with ZFS for a reason: we want to avoid bit rot in our data, especially the silent variety since that kind of stuff will quietly spread into your backups, followed by files potentially becoming unrepairable unless you have hard copies (ie LTO, blu-ray, stone tablets, whatever) that go back years.

A simple backup server already exists in the form of readynas, qnap, Synology, etc. They serve a purpose, and they serve it well. But they won’t protect against bit rot, they likely do not incorporate snapshots as well as ZFS does either.

Does ZFS protect against everything? Of course not, that’s why we also advocate for off-site, air-gapped backups, among other best practices. ZFS is but a foundational building stone in a larger system, if data integrity and longevity are your goals.

etorix · June 21, 2024, 8:18pm

That is an enhanced form of error correction, as already perfomed with previous versions, because DDR5 is pushing the enveloppe very far. This is NOT “ECC” as understood here; for what we discusse there are dedicated DDR5 ECC modules.
Regular DDR5 is not “more secure” than DDR4 due to its “inline ECC”, or a substitute for workstation/server ECC. DDR5 without inline ECC simply would not work acceptably.

Prices in euros? Look in this thread.

kiriak · June 21, 2024, 8:27pm

I was wavering between ECC and non ECC build for my home TN,
when this story came out

Linus Torvalds’s faulty memory slows kernel development

it convinced me I should go the ECC way for my TN NAS as it is the source of my precious data.

the problem, for home users, is the very very limited options of hardware supporting ECC, especially for small boxes

Stux · June 21, 2024, 8:53pm

And since it’s Error Correcting Code, and can not only detect, but correct 1 bit errors, this is the expected result.

Stux · June 21, 2024, 8:59pm

Interesting it took Torvalds that long to learn that lesson!

It’s basically what I was referring to above (not Linus’ experience, mine)

I blame Intel for treating an essential reliability feature as a market segmentation point.