Best practices on non-ECC system

Hello there,

I’m scratching my head trying to find the best HW setup for my personal NAS.

Let say, if I cannot aford the price of a ECC memory capable build, what would the best practices to reduce these silent errors ?

I’ve read running RAIDZ could help, is it true ?

I’m pretty sure that subject have been already covered, please redirect me to the correct thread if so.

Thanks :slight_smile:

1 Like

It can help… but is a $250 embedded motherboard with 20 SATA ports, 10GBbE SFP+ built in, etc. really too expensive? This board is likely overkill for most users but it illustrates a point. Add a bit of RAM and you have a well-qualified NAS that ticks every box, including ECC.

2 Likes

Very interesting board! I was thinking about going with Ryzen for my upgrade, but this one seems to be more capable.

Look harder into what ECC-capable builds cost…
DDR4 RDIMM is cheap. An ECC Ryzen build is not that much more expensive than a non-ECC one.

To be clear, while it is better to have ECC memory than Non-ECC memory, other factors do affect reliability.

For example, if you have to use Non-ECC memory and want better reliability;

  • Don’t over-clock memory
  • Keep heat down, air flow up, especially across the memory DIMMs
  • Use in a reduced dust environment

Now do those prevent bit flips?
No, they do not. But, they reduce the known causes of SOME bit flips.

5 Likes

I have had use a no-ecc build, and to avoid problems

  • 1 full memtest at months or 2
  • specific backup rotation for most important files (family photo, documents): in short, only adding new files without touch anything else

This strategy seems working good on short term, but on long term and with amount of file growing, became painfull/boring/error capable.

As you, was thinking about “ecc support cost too much”… but in reality i was just not hitting what market offers. Just search older components, i have 2 ecc build based on 7 gen Intel, the offsite One cost me less than 100€ (with 2 small disks)

2 Likes

…I’ve used non-ECC for years w/o a problem but maybe is because all my NASs use RaidZ2 ?

Let me fix that for you;

…I’ve used non-ECC for years w/o a KNOWN problem but maybe is because all my NASs use RaidZ2 ?

Let us be clear, unless you verified the files AFTER you put them in the RAID-Zx / Mirroring pool, by comparing to the original source, the error could easily have been introduced BEFORE ZFS check-summed and added parity to one of the files blocks.

Sequence of events:

  1. Source file from another computer
  2. Write file to NAS
  3. File blocks are in memory
  4. File blocks are prepared for writing into Pool’s vDev, with checksum and possibly RAID-Zx parity
  5. File is written to Pool’s vDev

The Non-ECC memory corruption can occur during step 3.

Worse, step 3 is actually multiple steps:
a. Network stack brings in IP packets
b. Network stack organizes the IP packets in to a data stream for the application, (NFS, SMB, iSCSI, etc…)
c. The application, (NFS, SMB, iSCSI, etc…), compiles the file blocks into a block to be written to storage, (aka Pool).
d. ZFS may copy the file blocks into separate writes per disk in a vDev
e. ZFS may copy the file blocks into the ARC

So, as you can see, while ZFS protects data whence it is in a pool with redundant vDevs, (aka RAID-Zx or Mirrors), it can not protect against memory bit flips.

This does not even touch upon the source computer potentially corrupting the file while reading it into it’s own Non-ECC memory.

Some people think I am paranoid. But when computers really are out to get you, that’s not paranoia but taking justifiable precautions.

6 Likes

When ECC options cost as little extra as they do, it’s a bit like the folk who drive around without their seatbelt on. Statistically speaking, they might get away with it for a whole lifetime without serious injury.

But if they have a serious accident, a worn / used seatbelt can frequently be the difference between walking away from the accident and being carried away in a body bag.

It’s no different with our CPUs. Many OEMs like Apple have simply de-prioritized ECC because non-ECC RAM is cheaper and the incidence rate re: bit flips is low enough where you can blame issues on other factors as well (such as the lack of comprehensive checksums in the current edition of the Apple file system)

But just like the students who discovered that their ‘free’ Google accounts can just vanish (and with them, everything the kids had worked on for years) serious data issues can crop up in unexpected places, just car crashes. I have no realistic choice re: ECC on my laptop given my OS preferences but once the data is on the server, it should stay good.

3 Likes

Need to precise I’m based in France, so the cost and availability of HW can greatly vary from what you’re used to.

I have a hard constraint, mATX or iTX MB format. Can’t be ATX.

When I look to ideal config, the cost rise up to the moon, even in second hand market. I think guys on ebay know the value of what they are selling…

Identify a couple of boards of interest, then save your search on ebay. Or even the major server / motherboard manufacturers like supermicro, Asrock Rack, Gigabyte, etc. Stuff comes up for sale more rarely in Europe, but it comes up for sale over there too.

I’m French, based in the Netherlands, so my hardware costs are pretty much the same as yours, @darkbouny.
How many drives? What network speed?
Like @Constantin has his favourite X10SDV-2C-7TP4F (and if he can source these easily on his side of the Atlantic, he’s very right to push these for pure storage uses), I have two cheap ECC proposals for you that I’ve been repeating for some time:

Gigabyte MC12-LE0 µATX, 6 SATA + boot M.2; takes an ECC-capable AM4 Ryzen of your choice (PRO APU, like a 4650G, 5350G or 5650G, which you can find on eBay, for lowest idle power; just about any desktop CPU to use the x16 in x4x4x4x4 mode for some cool little NVMe pool); 10G NIC possible in the x4 (CPU!) slot

Gigabyte MJ11-EC1 mITX, 8 SATA (including 4 from an extra SFF-8654 4i breakout cable, easily available on eBay), 1 M.2, 1 GbE NIC… and you can’t do much about it; takes cheap RDIMM (64 GB DDR-2400 is 60€ here)
More tinkering (3D-printing your I/O shield), but you can’t beat the price!

Please don’t tell me these genuine server boards and their ECC RAM are too expensive.

Why do people ask this question so often? Are they hoping someone will tell them that Non-ECC is just as good as ECC? I even use ECC RAM in my main computer for the same reason. ECC vs. Non-ECC comes down to, in my opinion, the value of your data.

If you can live with corrupt data at of course the worst possible moment, then Non-ECC is fine. If you would kick yourself in the ass over and over again when you lost that data, then use ECC. The same thing goes for drive redundancy.

As for a best practice, no one who uses Non-ECC will like my answer: Turn the computer off, unplug it, stick it into a corner until you can afford ECC RAM. Anything else comes with a risk, even though it may be a small risk.

If you owned 1000 bit coins, would the block chain servers your bitcoins are validated in be okay if they used Non-ECC RAM? This may be a poor example as bitcoin block chains are better protected than that, but it is an example.

5 Likes

…ok, I guess the answer is NO !. There is no way because like it or not everything that goes by a CPU a HDD or …anything, was first in memory.

Buy something older if cash poor and that’s that. Case closed.

“Buy older” is generally good advice because a NAS does not the latest and most powerful CPU.

…unless you’d like to run VMs and a world of Kubernetes.
Also, the water cooler needs lights ( for performance )
…wonder if there is ECC ram with RGB… :thinking:

2 Likes

Best practise for non ecc system. Run scrubs. Cross fingers.

2 Likes

I actually wrote up a system to checksum memory blocks, to try and detect corruption by malware. But, it in theory could also detect “some” types of memory errors. Like multi-bit errors on ECC RAM or any type of error on Non-ECC RAM.

Now this would not be “live” error detection on reads like ZFS. It would require a “scrub” program to be run regularly. Nor would it apply to all used memory. Only R/O blocks of code or data. And it would have to be integrated into the OS. Potentially even into the ELF files.

All in all, a major effort. However, one which might both improve security and reliability.

1 Like

This old theread discusses debug flag ZFS_DEBUG_MODIFY which triggers checksumming in memory. I suppose one could enable it on non-ECC systems, but that only highlights that the best solution is to use ECC RAM in the first place.

I’ve recently build a new system with ECC memory on a very tight budget. This is what I got:

  • Motherboard Supermicro X9DRL-iF (~$75)
    – 10 SATA ports (only 2 are SATA III, but SATA II is enough for hard drives)
    – LGA 2011 socket, Intel C602 chipset (2012)
  • CPU Intel Xeon E5-2667v2 (8 cores, 16 threads | 3.3GHz w/ boost up to 4GHz) (~$24)
  • Memory 2×32 GB ECC LRDIMM DDR3 (samsung m386b4g70dm0-cma3) (~$17,5 each, ~$35 for both)

The whole system is about $135

And no, I don’t live in USA or something. I don’t even have ebay in my country :slight_smile:

I understand you can’t use ATX and therefore this exact motherboard is not suitable for you, but my point is you don’t have to break a bank to build an ECC system

4 Likes