before I will die and leave this earth.
I state once again. the title of my thread.
before I will die and leave this earth.
I state once again. the title of my thread.
Take a step back. With the multiple posts one after the other and that last post, itâs getting kind of concerning.
might be true. so I will apologize.
Is my point getting across though? or years later again being just rubbed under the rug as something concerning?
I think itâs possible to validate the effectiveness of ECC RAM by running something like memtest86+. With ECC, when an error is reported, memtest never shows any errors. Without ECC, memtest will show errors with bad memory. Essentially, on a server-grade system (like Supermicro + ECC memory), memtest should never show memory errors. In a case where ECC cannot correct the memory error, the system should be immediately halted.
i love that you try. i respect that.
but we cant run memtest on a running system can we? if so I stand corrected
no really.
this is just silly dodging the point.
My stance is that if we dont know ECC is functional we might as well assume it is not.
No, memtest cannot be run on a running system, so if a system is reporting ECC errors, itâs recommended to migrate all workloads to another server and then run memtest on the affected server. Usually it doesnât take that long to reproduce an error (in IPMI) and then confirm the error doesnât come back after the memory module has been replaced. Sometimes weâll let the server run memtest for a week+ to confirm no more errors after replacing a faulty module.
In one case, I replaced a module 3 or 4 times in the same slot which had recurring errors in IPMI (like once every 2-3 months). Replacing the motherboard finally fixed that issue.
In every case that errors were seen in IPMI, they were corrected and the workloads never crashed.
I no longer am sure you and I are posting cross purposes.
I think you and I believe the same thing,.
But we both want to have the last wordâŚ
So I will bow down. You and I are more similar than we differ.
Actually, it is possible to run memtest on a running server. If the server is a hypervisor (like Proxmox), memtest can be run as a virtual machine. The same holds true: memtest should never report memory errors on a server-grade system with ECC memory (either running as a virtual machine or not).
please do know I love ECC
Everything has been said, just not by everyone yet.
@stilluncool You have not yet answered my multiple questions boiling down to:
In which way is ECC different from all the other error correcting going on everywhere in the system? Which makes you want to personally assess the former but still blindly (?) accept the latter?
E.g. the bits as stored on a magnetic platter are known to be full of errors. Thatâs why a heck of a lot of error correction encoding and decoding is going on in any modern hard drive.
You are not questioning that the hard drive will almost always return the data you wrote and that for the few cases when it doesnât, ZFS is designed to catch that.
How is memory different?
This is the reason why AMD doesnât officially claim ECC on their Ryzen stuff because they canât guarantee the motherboard will also have it, while Intel can.
How can Intel make this guarantee? Intel doesnât manufacture all motherboards that support Intel CPUs or chipsets, just like AMD. All either of them can do is support ECC in the CPU and chipset. Motherboard makers are free to implement it correctly, or to screw it up.
While I donât agree with his style, I see where @stilluncool is coming from.
@stilluncool and @pmh can be right at the same time.
I donât trust a Asus B850 motherboard to have working ECC (to be fair, neither AMD nor Asus is really claiming that it works, they just support it).
At the same time, I trust server brands with a good reputation like Supermicro, to not screw me over.
It could also be that Supermicro screws up ECC, even if they donât do it in bad faith.
We had things like these in the past.
I get the PSU comparison, but the problem with that comparison is, that if the PSU is running at 88% instead of the promised 90%, that is not nearly as bad IMHO.
Thatâs an interesting piece of hardware, thank you.
There is EMI originating from within the case. I get ECC correctable errors every year or so. Itâs clearly doing something, âcosmic raysâ is probably BS, but high frequency electronics abutting other high frequency electronics is very real - All field wigglers are subject to other wiggling fields.
Because Intel declares their ECC support officially and enforces it to all their OEM partners. A big part of this is because they intentionally do this to segment the market and cater more to the enterprise market that makes them more money.
Similar to Intel, AMD also does NOT declare official support for ECC on their consumer parts. The only difference is that they donât actively disable the CPU so that consumers are free to have it in a YMMV manner. In other words, theyâre a bit more consumer-friendly. That being said, make no mistake, they also DO NOT guarantee ECC support for their consumer parts, but DO guarantee it for their enterprise SKUâs.
At the end of the day, itâs all about money. Locking ECC out to enterprise makes them more money. This is why no one questions ECC on Intel and all the vast majority of the questions surrounding ECC has always been in the âprosumerâ space with AMD Ryzen consumer-level parts.
I feel so stupid at the moment Passmark did by now have a way to assess it is working.
Why have I not been informed?
Why have I not sought for this?
I will apologize to all for my clearly outdated stance.
I will have to do some relearning of current affairs.
bear with me.
but the core still remains.
ANd, we should not forget, that RAM errors, dont always have obvious symptoms⌠SO it can be undetected for a very long time!
For example, my old Windows system first just started to be slow, then apps crashed randomly, and finally, I ran a Memtest on it, and one of the sticks was faulty.
First I just suspected, that Windows was the culprit.
But I have no idea, how long it was actually faulty, so I cannot even assess, what extent of data loss or corruption I suffered.
My new system is a Chinese, recirculated, X99 MoBo, with 128GB DDR4 ECC RAM and a retired Xeon CPU.
I also migrated all my previous, Intel Atom based servers at home to the X79 and the X99 platforms during the last 3 years.
All systems run ECC RAM now.
Both DDR3 and DDR4 ECC RAM got really cheap on Aliexpress in the last 1-2 years. (Like, you can literally buy 32GB sticks for about 20 EUR each and funnily, it got even cheaper since Trump has introduced those tariffsâŚ)
Those are for EMC, not against cosmic radiation.
And, most likely to protect the device from something coming from the outside.