I used to work somewhere, where we built an embedded Linux system from scratch.
For a couple of weeks, we chased spurious errors all over the system. They seemed to accumulate, and correlate with high load, i.e. high temperature.
After a lot of time lost and many headaches, we found out that we missed to do proper DRAM timing configuration. It was only slightly off, which resulted in these (a couple of) memory bit-errors (per hour).
In retrospect, ECC and corresponding alarms would have shortened the debugging by man-weeks.
(But ECC was impossible in that project, yeah.)
For my TrueNAS at home, I bought a used server with ECC memory, then upgraded to a better (used) CPU and bought more RAM. All in all roughly comparable in performance per my small budget to a NUC â and it should be far more reliable.
Sometimes a different development platform than final, with proper debugging tools, like ECC memory, can make a serious reduction in development time. In this case, you would know that the software was good, and that moving to the final hardware indicated a hardware problem or a software configuration of the hardware.
Early in my computer career, I developed embedded software, (before Linux and far smaller than Linux allows).
On one new design, my initial test software would not boot at power up. On reset, it was fine. After verifying the hardware configuration multiple times, we tried hardware trouble shooting. Could not find the problem. In the end, I tried changing the software.
Turns out the ânewâ static RAM chip needed more time to settle on power up. Since one of the first tasks was to call some hardware initialization routines, those would fail since the return address pushed on the stack was not written. Installing a 250 millisecond delay before using RAM worked fine. Since this embedded computer and itâs use would not be impacted by such a trivial change, no problem.
Intel has decreed that you must have workstation or server class processors to enjoy the reassurance of ECC. Motherboard/chipset manufacturers are the limitation for most AMD processors taking advantage of ECC.
In a former life, I built nuclear power plant core monitoring system software and operator training simulator software. The minicomputers computers we used in the beginning and the Sun and SGI workstations we used toward the end all had ECC memory.
To know you are having memory problems, it helps to have parity or ECC memory. Thatâs their job. Without parity or ECC, the symptoms will be something like illegal instruction in OS code or non-present memory reference in OS code or perhaps in a user space process. So, memory issues present as program problems rather than as hardware problems.
I have had machines from this era pull these errors. The Sun was interesting. It would tell you which stick to replace giving the ID screened on the circuit board. No guessing. No need to run the diags to find the bad stick.
So, your file server runs continuously, right? This is where ECC memory pays off. Most memory errors are soft errors that can only happen during operation. A cosmic ray or an alpha particle from decay of trace minerals in IC encapsulation or ceramic cases hits a memory cell just so and flips a bit. BNE is now BGT and a test misbehaves. In code that is trying to figure out how to change the content of your disk forever more. So machine writes some gibberish. Trumpian best words start appearing in your speech for the shareholders. ++ungood
In my systems here, one has run continuously for 7 years and the second for 4 years. Theyâve not logged an ECC error. That may be an acceptable level of risk. The surface area of the memory is much smaller today, a much smaller target. But also a more fragile target as it takes very little energy to flip one of todayâs small cells.
Sure, ECC costs a bit more but youâre only buying 2 or 4 sticks. The cost of the non-volatile storage owns the cost of the build if you are building from new rather than your parts bin. To my way of thinking, it is cheap peace of mind.
Even if you have 3-2-1 backups, if the source content is corrupted, you wont know until you try to use itâŚ
So in the case of the main source being TrueNAS - and the data is copied off that to the other 2 mediums⌠the source is corrupt because one decided to use desktop gear.
Thus, if TrueNAS is your source of said content - why not have ECC and be better off vs wondering, or only finding out you have corrupt data in a few years when you try to restore those family pics you thought you lostâŚ
That would lead you to using ecc ram in your workstation as well. Because if corrupt data coming from your workstation is written to Truenas in the first place, the ecc memory on the server cannot fix itâŚ
Why do you draw the boundary between server and workstation?
Some here have said you get more errors from things other than bit flips, such as power loss. The thing is, when a power loss has occurred, you know it. When a bit flip occurs, you may not know it for a very long time.
My company has also moved to OneDrive, I donât really like it but someone pitched the idea and a large majority of the company shares are now on OneDrive.
We also have our own servers for classified programs. Errors there would be catastrophic.
I use ECC now for any new computer I purchase for myself. Is it required? Nope, but I feel better knowing that Iâm not introducing an error myself.
Iâve been doing the same since 2012 despite the lack of choice in components and some overheating issues when mainboards arenât meant to go in tower chassis. The lack of decorative LEDs on components is an added benefit of avoiding mainstream mainboards.
The only problem is on the mobile side of things: mobile workstations with ECC RAM do exist but tend to be much beefier than the Ultrabook I tend to source for field work. Soldered RAM would also benefit from ECC to future proof the investment in hardware.
ECC should really be the default for all types of computers. All data buses and storage media have some form of ECC for data on the fly or at rest. Why shouldnât the main memory bus not have that as well?
Yup. For work, I carry around an HP ZBook 15 for real work (plus a 13" Elitebook to access company domain stuff). On paper itâs not that heavy, but in practice itâs disastrously heavy to carry around.
Very true, if it is made on said workstation or home system it is. For me, I could lose my desktop right now and I would lose nothing but any recent bookmarks I save because all my data is created and saved direct to my TrueNAS via SMB shares for file stuff and my VMs are running over an NFS share.
To clarify for others, (I am sure Joe knows), ZFS was specifically designed for ZERO data loss on graceless / unexpected power loss.
Of course, any in-flight data can be lost, just like any other file system. And again of course, any graceless power loss can cause hardware to fail, which might lead to data loss.
But, ZFSâ COW, (Copy On Write), architecture was specifically chosen and implemented so that power loss would not corrupt any existing data. This was a goal to keep the file system always consistent and avoid long, (even hours, many hours long), file system checks at boot because of an unclean un-mount.
@Arwen
I understand that ZFS is designed for no data loss however my comment was more towards supporting ECC RAM as it should be used over non-ECC RAM on every computer, not just the server, which I know the topic was about the server ECC so I got a little off topic I guess. Now come on, you know I do that, this old man can ramble on and on. See