ZFS checksum errors

manomamdouh83 · July 3, 2025, 4:46pm

Hello,
I built my TN Scale (V 25.04.0) using 3 brand new WD Red Pros with a RaidZ1. The mainpool has the 3 drives and while the storage shows “no errors” under the VDEV, but the pool shows as “unhealthy” and each of the drives keeps accumulating more checksum errors by time - buy they match so the 3 drives jump from 10 to 15 then from 15 to 30 errors etc.

I ran a scrub and do LONG and SHORT SMART drive tests… I am lost if these errors are hardware or in my data.

I also replaced the SAS/SATA cables and the SAS controller.

Any help would be appreciated

SmallBarky · July 3, 2025, 5:14pm

A detailed hardware list would help

Thanks to Protopia for the following.
‘I have a standard set of commands I ask people to run to provide a detailed breakdown of the hardware, so please run these and post the output here (with the output of each command inside a separate (</> or Ctrl+e) preformatted text box) so that we can all see the details:’

lsblk -bo NAME,MODEL,ROTA,PTTYPE,TYPE,START,SIZE,PARTTYPENAME,PARTUUID
sudo zpool status -v
sudo zpool import
lspci
sudo storcli show all
sudo sas2flash -list
sudo sas3flash -list

Protopia · July 3, 2025, 5:19pm

Checksum errors can be due to disk hardware, but more often they relate to disk controller errors or overheating, power or SATA cable connections, PSU issues or memory issues, and reseating memory sticks, PCIe cards and power/SATA cables can often stop them for continuing to occur.

After reseating the memory run a memory test for a few hours.

Then do a sudo zpool clear poolname for the pool experiencing errors to reset the error counters and see what happens.

Actually my standard list has evolved and is now:

lsblk -bo NAME,LABEL,MAJ:MIN,TRAN,ROTA,ZONED,VENDOR,MODEL,SERIAL,PARTUUID,START,SIZE,PARTTYPENAME
sudo ZPOOL_SCRIPTS_AS_ROOT=1 zpool status -vLtsc lsblk,serial,smartx,smart
sudo zpool import
lspci
sudo sas2flash -list
sudo sas3flash -list
sudo storcli show all
for disk in /dev/sd*; do; sudo zdb -l $disk; done
for disk in /dev/sd?; do; sudo hdparm -W $disk; done
for disk in /dev/sd?; do; sudo smartctl -x $disk; done

though I normally remove any of these I don’t think will be helpful.

NugentS · July 3, 2025, 5:23pm

Given its happening to all three disks - you need to look at what common for the 3 disks.

PSU, Cabling and or HBA/SATA expansion board are likley causes

manomamdouh83 · July 18, 2025, 10:41am

Thank you. In copying the codes above, one returned a file that was actually corrupted, and once removed and ran another scrub everything seems healthy again. I appreciate the quick response.

However, why isn’t there (or maybe there is?) a way to see that file or the same details from a GUI? It feels like this is an error (not some sort of advanced functions) and there could be a way to dive more from the GUI.

Thanks again y’all.

Arwen · July 18, 2025, 10:52pm

In a way, the TrueNAS GUI is lacking some functionality. I don’t even know if the GUI has the function you are looking for in the request above.

On the other hand, TrueNAS, (and FreeNAS before it), were always intended to have Unix Shell access for more detailed trouble shooting.

Some NASes are all about the GUI, and yet don’t cover everything. When odd failure modes occur and are not fixable or troubleshootable from the GUI, those users may be out of luck. Time to restore from backups.

Not saying one philosophy is better than another, just different.

One reason I chose TrueNAS is that Unix Shell is both readily available, and down right useful at times. My prior NAS, an Infrant ReadyNAS 1000S, also had Unix Shell access. I simply could not live with either a NAS having a heavy focus on MS-Windows, (which I don't use at home). Or limited trouble shooting from any NAS GUI.