The usefulness of ECC (if we can't assess it's working)?

I am wondering what is the current stance on this years later?
the-usefulness-of-ecc-if-we-cant-assess-its-working.83580/

?

I found it. I thought you were trying to link to a thread on these forums. You meant to link to the old forums.

Here is the link.

I’m on a 5900x - so ecc is ‘supported’, but only OS level reporting & udimm not rdimm… As such it is hard to guarantee that everything is working, but over the ~3-5 years (I forget) I’ve had this system, I’ve seen about 5 ecc corrected logs - so at least, allegedly, it is doing more than nothing.

This helps me sleep better. I’ve also noticed a much higher amount of ‘help I’ve lost my data’ threads where OP’s systems did not have ecc; so this further helps my confirmation bias towards having some level of ecc.

5 Likes

I have been running TrueNAS systems for years and just yesterday I received a warning from one of my larger TrueNAS CORE machines about two correctable ECC events.

I really do not get what this discussion is about. You buy a server barebone, mainboard, complete system, whatever fits your budget that supports ECC. You buy ECC memory. You put ECC memory in server mainboard.

Congrats, you now have a system with ECC.

If we cannot trust vendor server specs anymore, we are doomed, anyway. I buy Supermicro. I can confirm by alerts I receive at the OS level that ECC is working.

Seriously, what is it you expect that my approach does not deliver?

Kind regards,
Patrick

5 Likes

One of the founders of ZFS once stated that ZFS doesn’t require ECC memory any more than other file systems. If you have sufficient funds, using ECC memory is clearly a better option, as it can at least help mitigate the very low risk of system failures. In any case, multiple data backups remain the most effective way to protect data security.

3 Likes

Those are indeed the key words - which in my very biased opinion translates to “use ECC on file systems with data you care about”.

3 Likes

Let me quibble a little on that one, it is not meant as a criticism.

One of the beautiful aspects of ZFS is that it “paranoidly” checksums just about everything. That in turn helps ensure that whatever is written to disk meets its checksum or its marked as potentially bad.

But bit flips in RAM can potentially alter a file before it hits the disk. If the alteration happens before the checksum is calculated, your NAS might be writing a corrupt file to the NAS for whom however the checksum will match. Every replication of said bad file then perpetuates the bad file further into the downstream, polluting your backups.

That is the reason to use ECC RAM, ie the realization that RAM can be influenced by cosmic rays and like events. Hence it’s a good idea to pay the minuscule up charge and invest in “server grade” gear to deliver good power, ECC RAM, and like niceties to maximize the probability that nothing bad happens.

Which also explains why most of us chose ZFS in the first place - instead of Bcache, BTRFS, or like file systems that may offer far faster performance but few to none of the belts and suspenders re: file integrity.

5 Likes

amen brother. i could not have stated it better.

and also the running gag of ‘just buy server grade stuff and stop wanting to know’

is…

…

hmm I shoud refrain from words but you fill in the blanks.

anything can stop working over time includding ECC detection and correction. I’d like to know when that is about to happen but I will put my hopes at RISC-V. hopefully someone will implement it there or I will do it if I am not alone.

You DO NOT need it!

First, properly made cases are FARADAY CAGES, correct!?

That means that the chassis is grounded and NO EMF can go to
the inside!

After that, you don’t need to know much more.

Yes, ECC memory would detect, correct and warn about bad
memory, but non-ECC would either warn also, or simply crash,
and that’s where a good hardware person or a clone system
come handy.

All modern operating systems do have internal protection against memory
issues, and ZFS has check-sums as well, so worry more a proper UPS and
shutdown after power loss!

Just go see/read what the ECC “fanboys” say in their defense:
NEVER any specific, proven or even a case where they can truly show
“how a bit flips in RAM”.

That just shows a person that has no idea of how modern Operating Systems
have internal checksums and would detect and crash the application…

But them ignorant, non EE people would have zero come-back when you
explain to them how a faraday cage works …

hahah that is protecting against outside factor once the setup is in your control.

what about the time from assembly line to your hands?

i rest my case.

we either get the ability to assess it is working or it is just a religion.

Something that was not mentioned yet is troubleshooting.

I don’t know how realistic a uncorrectable bit flip is. So I am not sure if ECC gains you much when it comes to data security. Especially considering that you probably also work with said data on other systems that do not run ECC. Few years ago, almost all HP, Lenovo, Dell workstations for CAD applications came with ECC. These days are unfortunately long gone. So I doubt that more than 1% use ECC on clients.

To get back to my point, in a rare occasion a PC or server starts to act up, you are always wondering if it is the PSU, the mainboard, the CPU, RAM or storage that is causing problems.
Having ECC memory spares you from running Memtest86 for 8 hours.
So yeah, it troubleshooting a little bit easier.

And since it does not really cost more, the question is why not?
Sure, the mainboards often cost more, but you not only gain ECC, you also gain a rock solid Supermicro plattform and IPMI.

1 Like

Allow me to quibble re: computer cases being effective Faraday cages. For certain wavelengths that may be true but it will depend a lot on the construction of the case. Ie.

  • Is it metal?
  • if it’s metal, has been grounded properly?
  • Does it have fan inlets / outlets, plastic covers for hot swap bays, glass windows to admire your build, etc. that are electromagnetically transparent ?
    Etc.
    Most cases are not in fact perfect Faraday cages. Never mind all the connections going in and out of the case allowing the transmission of waves into the case.

Have you ever seen the beryllium copper seals on the hatches to the command center of a navy ship? They mean business for a reason. What frequencies can be kept out depends a lot on the hole size, intensity of the radiation, speed of the object, etc.

Cosmic rays need several inches of lead shielding to be kept out. Neutrinos caused by cosmic rays are detectable miles underground. Even the cray 1 with its much older, larger modules experienced bit flips.

Detecting bit flips came right out of the above issues. Why not take advantage of hardware that can detect bit flips and correct them if the marginal cost is close to zero? Just because a bean counter like Tim Cook decides for us that we are not to have ECC RAM in Macs doesn’t mean that it isn’t a good idea.

2 Likes

Anyone ever wonder why enterprise/datacentre grade equipment frequently features these? :wink:

3 Likes

I don’t have any data to offer, just my anecdotes.

I’ve been building PC’s for over 2 decades since I was 16. 15 years ago, I started building “server” that would run 24/7 using spare parts from all my gaming rigs that I’ve accumulated whenever I upgrade. There is always some issue and unexplained momentary timeouts/freezes/uncommanded reboots, you name it, I’ve seen it all regardless of brand of motherboard/RAM.

When I finally decided I’ve about had it and built one with Supermicro and registered DIMM’s, all of my unexplained random freezes/hangs disappeared. All of the sudden, I no longer have to spend countless hours chasing a wild goose. I no longer have to cross my fingers and pray that when I go on a trip for a few weeks, my server would freeze and not be accessible with me hundreds of miles away without physical access.

I can’t tell you 100% sure that it’s because of ECC, but it sure as hell makes me sleep better and saves me lots of hours of stress and time wasted. I would still buy non-ECC systems for my gaming machine that doesn’t need to be up 24/7, but for my servers, it’s non-negotiable. All this plus the fact that the machine is used for a decade or more just makes the little bit of extra cost for the components a no-brainer to me.

Maybe when I was young, that mattered more, but these days, my time is worth far more than a few hundreds in savings. Actually, now that I think of it, I actually kinda’ saved money cause a lot of the stuff I buy ends up being used decommissioned enterprise gear anyway.

3 Likes

This is a misconception. Bit flips occur either in the data or in the metadata (checksum). The only consequence is a mismatch between the data and the checksum.

ZFS performs checksum checks on all data reads. If a bit flip affects only the original data or the redundant data, ZFS automatically fixes it and logs an error. If a bit flip affects both the original data and the redundant data, ZFS logs an uncorrectable error and does not corrupt subsequent data.

For most people in the community, the chances of an unexpected power outage, misconfiguration, hard drive failure, overheating, or even a cat chewing through a cable are greater than a bit flip. Until these issues are addressed through a healthy operating environment and professional maintenance, there’s no point in worrying about ECC memory.

My guess about ZFS’s storage process:

  • ZFS stores data in memory unencrypted and uncompressed, with a base size of blocksize. Data smaller than blocksize is automatically padded with zeros up to blocksize, and the checksum is calculated based on this data.
  • Data is encrypted and compressed when written to disk, and data is automatically distributed according to ashift. Therefore, the actual size of data stored on disk is typically less than the blocksize, and any zero-padding will obviously be compressed.
  • Data is not checksummed after writing until the next read or scrub.
  • After reading, the data is decompressed and decrypted, and a checksum is calculated in memory and compared with the checksum recorded in the metadata.
1 Like

Your summary neglects to mention that data is also stored in the ARC for varying amounts of time. The ARC resides in RAM.

Faulty or compromised RAM can absolutely change data passing through it in ways that checksumming on data at rest does not catch.

3 Likes

How would you know when you don’t know what the OP is using the hardware for?

No? I guess it depends on how you (specifically) define “properly made cases”. Depending on what they are made of they might reduce RF somewhat, but getting to the level of being an actual faraday cage? Nope, nuh uh. For one, a faraday cage consists of a mesh and the hole size and spacing matters. There are computer cases that are faraday cages, but not all computer cases are faraday cages.

Faulty memory doesn’t necessarily lead to crashes. It all depends on what is passing through the faulty parts of the memory. It’s completely feasible that the OS manages to stay up while non-OS data is silently corrupted. Pool corruption is also a possibility, your “simple crash” could drop you back in a system where a pool is suddenly not accessible anymore.

Can you elaborate on what internal protection you’re thinking of here? ZFS checksumming will not help if data is corrupted before it’s put on disk and checksummed. You say “All modern operating systems”, so presumably this will be easy for you. Before you answer, your example of a program crashing because of a memory fault is not the only way memory failures can impact the bits stored on your system.

Instead of this partisan anti-ECC stance I would recommend a more pragmatic approach, but you do you.

How much sense ECC makes depends on what you use the system for and how you handle backups of any data touching the server. Consider the fault scenarios and adapt your server solution to what best fits your redundancy/data integrity need.

4 Likes

See Figure 1

Anything less would be just asking for trouble.

4 Likes

First off the bat. I want ECC and I (religiously) believe I need it.

I am the OP.
I had difficulties getting back access to my original account thus I made a new one also stating my feelings years later.

I was the one that actually made it possible to assess if ECC is working by means of physical probing.

@Mastakilla was the one that made it possible to assess if ECC is working by his relentless efforts of overclocking.

Now here comes the issue. This was all in the context of the AMD Ryzen Zen 2 era.
And both methods have serious drawbacks and might even no longer be viable as architecture keeps evolving.

So this is why I have this passionate stance on this all.

I just dont like it when please either discard the usefulness of ECC or state ‘just
With my buy server grade hardware and all is bliss’

If we can’t know it works then the rest is religion and means nothing.

1 Like

I want ECC, even on my desktop, there is absolutely no reason it shouldn’t be standard aside from artificial market segmentation nonsense.

but I do not want to (and have refused to when it’s a choice) pay the ridiculous premiums that ECC UDIMMs cost. Let me be perfectly clear, ECC UDIMMs just have a 9 chips instead of 8 per rank… it shouldn’t cost more than 11% extra… so I don’t have ECC on my desktop.

ECC tends to just be a sticking point when someone is planning to build a home server and if someone asks me for recommendations… it’s always build-up, not build-down. You are always better off starting with some almost-free clunker system so you can get a feel for the experience and learn your hardware needs, before spending a good chunk of money on hardware.

1 Like