ECC memory for home servers or not?

volts · June 21, 2024, 9:13pm

Oh! I agree, yes, I know, and I see - I didn’t intend to imply that it was. Agreed. It’s not the same as visible-to-the-CPU real ECC.

It’s pretty captivating that DDR5 basically requires ECC to function, but that they didn’t just make Real ECC the standard.

etorix · June 21, 2024, 9:21pm

Of course.
I do not know what the error was, nor whether the same error in a non-ECC system would have caused a crash and/or data corruption. Possibly not.
All that I know is that a non-ECC system would not have reported the event, while the ECC system dutifully reported the non-event.

Meme time:

(Rather to pump for nothing to come out of it than not to pump and risk that something bad happens.)

John · June 21, 2024, 10:50pm

There was a study done; if your server OS is Linux you do not need ECC.

Because you have far bigger problems—like determining how you get Linux off your machine

Arwen · June 21, 2024, 11:17pm

If you are going to quote me, at least add the qualification I put in ;

As for home, I hope my next desktop and media server can use ECC RAM. Those ASRock DeskMeet X600 small form factor AMD PCs look good. And ECC is clearly stated as being supported, (if you select a CPU & RAM with ECC support).

On the subject of built in ECC with DDR5, the correct abbreviation is OD-ECC, On-Die ECC.

Wikipedia - DDR5 Features - On-Die ECC

Back to a comment about small desktop PCs with ECC. As I have written above, ASRock’s DeskMeet X600 supports DDR5 with real ECC.

The Intel version of DeskMeet does not support real ECC… probably because of the perception that their are more limited number of lower end Intel CPUs with ECC. That changed in the last few generations, though I am not following how many and what configurations are available.

Stux · June 21, 2024, 11:28pm

Yes. Apologies, it was a rhetorical quote

yorick · June 23, 2024, 9:22am

Here’s the experience from another community that uses servers, just for a completely different use case (globally distributed databases).

They build almost exclusively with NUCs.

There’s not a month goes by where someone doesn’t have an issue that smells like memory issues - and then if they take the advice given, they run memtest86+ in continuous loop for 5 days to see whether it is a memory issue.

With a large enough sample size, consumer RAM will fail, and the symptoms are likely to be mystifying to the user.

ECC just makes it so you know whether your RAM is going bad, and can get an alert email, rather than needing to guess.

Having spent weeks before trying to track down mystery errors that turned out to be consumer RAM, I am happy to pay a little more.

Now to find a gaming board that does ECC for my home PC - alas AsRock stopped supporting ECC in their DDR5 Ryzen gaming boards.

Davvo · June 23, 2024, 9:54am

IMHO for the average user that does not have multiple backups of data but trusts its NAS to hold everything, ECC is a must.

Professionals in the filed have stated multiple times that they want ECC, so ECC is a must there too.

What if you have multiple backups of your data? Well, then you have spent a lot of money for those backups… it doesn’t make sense to save a bit on the ECC hardware and have an annoying weakness.

When can you not use ECC then? Maybe when you don’t care about your data… and that prompts the usual followup question “Why do you need ZFS then?”: it’s a debate more ancient than @joeschmuck, likely preceeding the venerable jgreco… and it ends in the same old way.

Either you care about your data, or you don’t.

My 0,02 €.

joeschmuck · June 23, 2024, 1:24pm

I resemble that, way too funny.

My two cents… I’ve read about bit flips happening the further to the poles on the earth you go due to solar radiation. I don’t know how factual that is but I can see that being the case, I’m gullible at times. Regardless, there was a reason ECC was created, could it have been over-thinking engineers at IBM working on a military contract that said “Zero Data Errors” and this was the solution. I could see that for the Polaris fire control computers which used Ferrite Core Memory, very vulnerable. It’s very important the data is correct with a payload like that. For me it is worth the small extra price of ECC hardware to do our best to ensure data integrity and I’m just a home system, not a corporation. Personally I still value my financial documents and home photos being safe, not to mention that backup of my computer for easy restoration.

ericloewe · June 23, 2024, 10:01pm

Fun story: A workstation at work developed a single-bit failure in a DIMM. We’re talking business desktop hardware, so no ECC.

Things kept failing, and the metaphorical straw was when a deliverable showed up at the customer with a bit flip. It was a .tgz, so naturally the whole thing was unusable.

When the issue was identified, I happened to be on-site, so I immediately pulled the workstation from service and ran memtest. I figured it would be a few hours before hitting something, turns out it took like 15 minutes to hit the bad bit.

Not fun story: Imagine this with your precious memories at home. Not fun.

Arwen · June 23, 2024, 10:28pm

The silly thing about business PCs, is that in some cases they are as critical as business servers. I mean, if everything goes through them, image the chaos in a financial or medical field if, (well really when), a business PC has an unknown memory fault causing lost revenue. Like >$100,000 of revenue!

At work we are encouraged to save important files on the network shares, presumably a real server. Like NetApp or such, which I would not doubt uses ECC memory. But, until we “save” the file, we are risking data loss or corruption by using non-ECC memory. (It’s not that much of a risk, but hey, tell someone it can cost 1/10 of million dollars, they perk right up!)

Side note: I’ve worked as an Unix SysAdmin at places where they occasionally had Unix desktops serving up network shares. An inventory of the data centers showed perhaps 5,000 to 10,000 servers, (it was a big company). But, due to these rouge servers, that number likely was another 5,000 servers. Just because the work group could not get approval for a network share, (someone has to pay for it).

Imagine that today those “rouge Unix desktops serving up network shares” are now “rouge MS-Windows PCs serving up network shares”. And has been noted above, (my post and @ericloewe’s), simple business desktops tend not to have ECC RAM. (We won’t mention backups, cleanup and security remediation, that would cost way too much. At least until it cost the company more because the lack of…)

Stux · June 23, 2024, 10:50pm

Yeah, but it’d be very hard to determine the root cause too…

dan · June 23, 2024, 10:58pm

So are we. And then they take the network shares away with no warning in favor of MS Onedrive.

Davvo · June 24, 2024, 3:14am

Funny thing: we made that road backwards (from OneDrive to NAS), likely due to cost savings.

skittlebrau · June 24, 2024, 6:30am

During the time I’ve monitored for errors in RAM in a very unscientific manner, I’ve only seen it once on my personal servers in the last 20 years. Again, small sample size and comparatively small data sets than others.

In my opinion, you’re more likely to experience data corruption or loss from unexpected power failures, software crashes or even just plain old human error (whether yours or from a third party), and not so much from bit flips.

I use ECC in my home server simply for peace of mind and also because there wasn’t a huge premium attached to it for the platform I chose. My offsite TrueNAS backup doesn’t use ECC RAM however, and I’m not losing sleep over that.

There’s always that one team in the whole company (in this case, it’s mine) that actually needs on-prem mass storage because we store multiple terabytes of photography and video footage, and might need to access files that are 5+ years old at a moment’s notice.

We got told everything was being offloaded to Sharepoint/OneDrive. Our department was the only exception because local storage for us (in combination with multiple redundant backup systems offsite) was cheaper, and much more efficient/performant.

Sara · June 24, 2024, 7:27am

Not to mention the god awful performance all of these Cloud solutions offer.
15min to start up the index is probably not their fault.
Dropbox moaning: please don’t use more than 500k files, probably also not.
But 100MBit/s bandwidth their servers offer feels like we are back in the early 2000.

Constantin · June 24, 2024, 10:48am

Hah. Our CTO got some sort of award for removing all local storage and going cloud only. If only it worked as great as advertised… which is why we now have a teams group just for one shared file where we clear with each other before opening / modifying said file to ensure only one person is modifying it at a time.

Otherwise, we frequently get un-clearable merge errors that cannot be undone unless you roll back to a earlier version of the file and start over. Naturally, MS swears up and down that their amazing system should allow multiple users at the same time to modify said file, as long as it’s being hosted on OneDrive. It usually works in Word… but the Excel version simply isn’t fully baked yet.

The latency involved with remote-hosted content is really disappointing, especially when one regularly works with large files and a slow pipe to same.

ericloewe · June 24, 2024, 12:02pm

Yeah, right. It can barely deal with a single person on multiple machines.

Arwen · June 24, 2024, 6:27pm

Yes, that happens with other file systems, including BTRFS.

However, with ZFS, it was specifically designed and tested to have zero faults on any and all unexpected power failures. That’s the part of COW, Copy On Write, in action.

Of course:

Hardware can and does fail during unexpected power failures.
Any data in flight can be lost, just like any other file systems.

PS - The dig at BTRFS is true. A simple rename, aka “mv”, can fail in BTRFS, leaving corruption in the file system. Minor corruption to be sure and fixable. But, it is not true COW for that function;
https://archive.kernel.org/oldwiki/btrfs.wiki.kernel.org/index.php/FAQ.html#What_are_the_crash_guarantees_of_rename.3F

cb88 · June 25, 2024, 12:50pm

I have 2 AM4 DDR4 system with ECC Udimms (one with truenas the other with proxmox)… I suspect the bit error rate is different across ram types also. It’s IMO fairly high on my system (errors every few weeks 120 errors in 2 years) No real need to fix/replace anything though because its still relatively rare and all correctable.

If this were happening without ECC though… I’d probably have a corrupted system.

I imagine the ECC rate on higher end systems would be lower due to fewer bus errors, due to lower transmission line loading (rdimms and lrdimms).

awasb · June 25, 2024, 1:18pm

Levels of risk … @home or not … just count the „desaster postings“ …

no working backup (fundamental error … dom0 so to say)
user error (the nice expression for „dumb stupidity“ … I can see it in the mirror)
power failure (including psu, cabling etc.)
gross negligences (like never changing disks in a 24 wide z2 vdev with 4 hotspares)
sub-standard implementation and/or hardware (like triple daisy chained RGB lit USB-expanders with a bunch of SMR drives, that got „just enough“ uncorrectable sectors with a NVMe2USB3.2 special vdev for dedup)
tampering … just out of curiosity (the difference to point 1. is you knew better … „informed stupidity“)

I would say that if you get these things right - by sheer discipline, it’s not magic - then you can think about ECC. (The longer the system is up, the more important it could become.)