I used to work somewhere, where we built an embedded Linux system from scratch.
For a couple of weeks, we chased spurious errors all over the system. They seemed to accumulate, and correlate with high load, i.e. high temperature.
After a lot of time lost and many headaches, we found out that we missed to do proper DRAM timing configuration. It was only slightly off, which resulted in these (a couple of) memory bit-errors (per hour).
In retrospect, ECC and corresponding alarms would have shortened the debugging by man-weeks.
(But ECC was impossible in that project, yeah.)
For my TrueNAS at home, I bought a used server with ECC memory, then upgraded to a better (used) CPU and bought more RAM. All in all roughly comparable in performance per my small budget to a NUC â and it should be far more reliable.
Sometimes a different development platform than final, with proper debugging tools, like ECC memory, can make a serious reduction in development time. In this case, you would know that the software was good, and that moving to the final hardware indicated a hardware problem or a software configuration of the hardware.
Early in my computer career, I developed embedded software, (before Linux and far smaller than Linux allows).
On one new design, my initial test software would not boot at power up. On reset, it was fine. After verifying the hardware configuration multiple times, we tried hardware trouble shooting. Could not find the problem. In the end, I tried changing the software.
Turns out the ânewâ static RAM chip needed more time to settle on power up. Since one of the first tasks was to call some hardware initialization routines, those would fail since the return address pushed on the stack was not written. Installing a 250 millisecond delay before using RAM worked fine. Since this embedded computer and itâs use would not be impacted by such a trivial change, no problem.
Intel has decreed that you must have workstation or server class processors to enjoy the reassurance of ECC. Motherboard/chipset manufacturers are the limitation for most AMD processors taking advantage of ECC.
In a former life, I built nuclear power plant core monitoring system software and operator training simulator software. The minicomputers computers we used in the beginning and the Sun and SGI workstations we used toward the end all had ECC memory.
To know you are having memory problems, it helps to have parity or ECC memory. Thatâs their job. Without parity or ECC, the symptoms will be something like illegal instruction in OS code or non-present memory reference in OS code or perhaps in a user space process. So, memory issues present as program problems rather than as hardware problems.
I have had machines from this era pull these errors. The Sun was interesting. It would tell you which stick to replace giving the ID screened on the circuit board. No guessing. No need to run the diags to find the bad stick.
So, your file server runs continuously, right? This is where ECC memory pays off. Most memory errors are soft errors that can only happen during operation. A cosmic ray or an alpha particle from decay of trace minerals in IC encapsulation or ceramic cases hits a memory cell just so and flips a bit. BNE is now BGT and a test misbehaves. In code that is trying to figure out how to change the content of your disk forever more. So machine writes some gibberish. Trumpian best words start appearing in your speech for the shareholders. ++ungood
In my systems here, one has run continuously for 7 years and the second for 4 years. Theyâve not logged an ECC error. That may be an acceptable level of risk. The surface area of the memory is much smaller today, a much smaller target. But also a more fragile target as it takes very little energy to flip one of todayâs small cells.
Sure, ECC costs a bit more but youâre only buying 2 or 4 sticks. The cost of the non-volatile storage owns the cost of the build if you are building from new rather than your parts bin. To my way of thinking, it is cheap peace of mind.
Even if you have 3-2-1 backups, if the source content is corrupted, you wont know until you try to use itâŚ
So in the case of the main source being TrueNAS - and the data is copied off that to the other 2 mediums⌠the source is corrupt because one decided to use desktop gear.
Thus, if TrueNAS is your source of said content - why not have ECC and be better off vs wondering, or only finding out you have corrupt data in a few years when you try to restore those family pics you thought you lostâŚ
That would lead you to using ecc ram in your workstation as well. Because if corrupt data coming from your workstation is written to Truenas in the first place, the ecc memory on the server cannot fix itâŚ
Why do you draw the boundary between server and workstation?
Some here have said you get more errors from things other than bit flips, such as power loss. The thing is, when a power loss has occurred, you know it. When a bit flip occurs, you may not know it for a very long time.
My company has also moved to OneDrive, I donât really like it but someone pitched the idea and a large majority of the company shares are now on OneDrive.
We also have our own servers for classified programs. Errors there would be catastrophic.
I use ECC now for any new computer I purchase for myself. Is it required? Nope, but I feel better knowing that Iâm not introducing an error myself.
Iâve been doing the same since 2012 despite the lack of choice in components and some overheating issues when mainboards arenât meant to go in tower chassis. The lack of decorative LEDs on components is an added benefit of avoiding mainstream mainboards.
The only problem is on the mobile side of things: mobile workstations with ECC RAM do exist but tend to be much beefier than the Ultrabook I tend to source for field work. Soldered RAM would also benefit from ECC to future proof the investment in hardware.
ECC should really be the default for all types of computers. All data buses and storage media have some form of ECC for data on the fly or at rest. Why shouldnât the main memory bus not have that as well?
Yup. For work, I carry around an HP ZBook 15 for real work (plus a 13" Elitebook to access company domain stuff). On paper itâs not that heavy, but in practice itâs disastrously heavy to carry around.
Very true, if it is made on said workstation or home system it is. For me, I could lose my desktop right now and I would lose nothing but any recent bookmarks I save because all my data is created and saved direct to my TrueNAS via SMB shares for file stuff and my VMs are running over an NFS share.
To clarify for others, (I am sure Joe knows), ZFS was specifically designed for ZERO data loss on graceless / unexpected power loss.
Of course, any in-flight data can be lost, just like any other file system. And again of course, any graceless power loss can cause hardware to fail, which might lead to data loss.
But, ZFSâ COW, (Copy On Write), architecture was specifically chosen and implemented so that power loss would not corrupt any existing data. This was a goal to keep the file system always consistent and avoid long, (even hours, many hours long), file system checks at boot because of an unclean un-mount.
@Arwen
I understand that ZFS is designed for no data loss however my comment was more towards supporting ECC RAM as it should be used over non-ECC RAM on every computer, not just the server, which I know the topic was about the server ECC so I got a little off topic I guess. Now come on, you know I do that, this old man can ramble on and on. See
Sorry for resurrecting an old thread, but I recently put together a new TrueNAS Scale system and am running it as a media server and backup system mostly. I managed to build a system with ECC memory with secondhand parts, and everything seems to work quite well. On the TN main page, it does show â(ECC)â in the memory tile (in addition to BIOS reported it, and some messages in the kernel logs).
My question is this: is there any other support in TN for ECC? For instance, if thereâs any corrected ECC errors, how do I know, other than by looking through the dmesg logs manually? Will there be an alert? Iâve looked all over but havenât been able to find an answer to this seemingly basic question. Iâve read elsewhere online about utilities like edac-utils, but thatâs not included in TN EE, though some people are saying itâs now deprecated. Can someone please summarize the current state of ECC support in TN Scale? Thanks!
Thanks for the quick reply! No, my system does not support IPMI either. But I guess I was hoping there was more support in TNâs UI than just having to manually check the syslog. It seems like itâd be nice for the âmemoryâ file in the UI to show recent ECC errors, especially if theyâre uncorrected, since that can indicate a failing memory module that needs to be replaced soon. Do you think this is worth filing a feature request for, or am I making a mountain out of a molehill?
IMO if you have ECC errors often enough that you need a UI section to keep track of them, you need to investigate your RAM asap.
Otherwise I sometimes remember to run that command after a month or two or uptime. I think Iâve seen ~4 corrected errors in my 3 years of operation.
Second reason why I donât see that request being worth your time is because IXâs source of income is enterprise customers; no way those guys donât get errors logged in IPMI & would ever care about tracking in the GUI. If paying customers donât care, odds are IX spending money on development for something free users can get by running a command manually is slim.
In theory you could clean up my crappy command to actually only report errors, have a chron job run it x hours/days/weeks/etc & email you when you actually get an error with some google-fu.
IMO if you have ECC errors often enough that you need a UI section to keep track of them, you need to investigate your RAM asap.
Again, Iâm new to this; this is my first ECC system. But from my prior research, it seems one possible failure mode is that you see frequent ECC errors on a system, or maybe just one module, when you previously didnât see such errors. At first, theyâre correctable, but they get worse and then you get uncorrectable errors, and of course thatâs very bad. So it seems youâd want a way for your system to let you quickly know that your formerly reliable memory is now spewing errors, and having to run a manual command isnât a great way to handle this IMO.
From what you wrote about IPMI, Iâm guessing the enterprise customers with IPMI systems donât have this problem, because IPMI warns them about it?