I’ve used memtest to test consumer desktop ra but have no idea if you should do the same with ECC ram. Would it just detect and correct it’s own errors?
What best practice? I unfortunately don’t have a testbed to burn in the ram. This would delay my NAS migration if I wanted to be sure the ram is good.
Theoretically, Memtest86 - the commercial one, not the OSS Memtest86+ - can inject ECC errors in many relevant platforms. Unfortunately, most firmware shipped by vendors doesn’t have an option to enable this functionality, which is disabled by default for relatively obvious reasons.
Failing that, having a partly-defective DIMM that errors out consistently is awesome for testing ECC.
In any case, the IPMI log should show ECC events, which should also be visible to the host OS and be logged accordingly.
So to spin that back: Run a normal memtest with ECC ram, then look at the IPMI logs to see whether any events were logged.
Is that accurate?
Well, kinda. That doesn’t tell you much, overall - most DIMMs don’t have a particularly high likelihood of encountering an error during the relatively short period covered by the test.
Unless of course you get lucky and run into a random ECC error. Or a defective DIMM that wasn’t too expensive.
I’m fine with leaving the test running for a while if that would help.
By the same logic: if the test doesn’t highlight much because it’s too short, does that also mean file transfers are unlikely to be affected because they are even shorter? Genuinely asking.
I’ll do some more googling.
Not really.
A) you can end up with reproducible errors, like a stuck bit, sometime later in operation
B) You’ll be doing years of file transfers. Each individual one may have a lower error rate than a test run, but there’s many more of them than there are tests.
Overclocking RAM could be helpful in triggering errors and seeing if the IPMI logs or the OS report successful ECC operations. On some consumer motherboards it is the only way to assess ECC capability in the absence of strong clues in the BIOS menu (I am thinking about AMD consumer boards here).
The usual way to do this is either by upping memory speed and/or lowering CAS values in the BIOS. Too aggressive timings could prevent the system from booting so it is advised to proceed slowly and gradually change values until a few errors are logged.