ZFS checksum errors

Hello,
I built my TN Scale (V 25.04.0) using 3 brand new WD Red Pros with a RaidZ1. The mainpool has the 3 drives and while the storage shows “no errors” under the VDEV, but the pool shows as “unhealthy” and each of the drives keeps accumulating more checksum errors by time - buy they match so the 3 drives jump from 10 to 15 then from 15 to 30 errors etc.

I ran a scrub and do LONG and SHORT SMART drive tests… I am lost if these errors are hardware or in my data.

I also replaced the SAS/SATA cables and the SAS controller.

Any help would be appreciated

A detailed hardware list would help

Thanks to Protopia for the following.
‘I have a standard set of commands I ask people to run to provide a detailed breakdown of the hardware, so please run these and post the output here (with the output of each command inside a separate (</> or Ctrl+e) preformatted text box) so that we can all see the details:’

lsblk -bo NAME,MODEL,ROTA,PTTYPE,TYPE,START,SIZE,PARTTYPENAME,PARTUUID
sudo zpool status -v
sudo zpool import
lspci
sudo storcli show all
sudo sas2flash -list
sudo sas3flash -list

Checksum errors can be due to disk hardware, but more often they relate to disk controller errors or overheating, power or SATA cable connections, PSU issues or memory issues, and reseating memory sticks, PCIe cards and power/SATA cables can often stop them for continuing to occur.

After reseating the memory run a memory test for a few hours.

Then do a sudo zpool clear poolname for the pool experiencing errors to reset the error counters and see what happens.

Actually my standard list has evolved and is now:

  • lsblk -bo NAME,LABEL,MAJ:MIN,TRAN,ROTA,ZONED,VENDOR,MODEL,SERIAL,PARTUUID,START,SIZE,PARTTYPENAME
  • sudo ZPOOL_SCRIPTS_AS_ROOT=1 zpool status -vLtsc lsblk,serial,smartx,smart
  • sudo zpool import
  • lspci
  • sudo sas2flash -list
  • sudo sas3flash -list
  • sudo storcli show all
  • for disk in /dev/sd*; do; sudo zdb -l $disk; done
  • for disk in /dev/sd?; do; sudo hdparm -W $disk; done
  • for disk in /dev/sd?; do; sudo smartctl -x $disk; done

though I normally remove any of these I don’t think will be helpful.

4 Likes

Given its happening to all three disks - you need to look at what common for the 3 disks.

PSU, Cabling and or HBA/SATA expansion board are likley causes

4 Likes

Thank you. In copying the codes above, one returned a file that was actually corrupted, and once removed and ran another scrub everything seems healthy again. I appreciate the quick response.

However, why isn’t there (or maybe there is?) a way to see that file or the same details from a GUI? It feels like this is an error (not some sort of advanced functions) and there could be a way to dive more from the GUI.

Thanks again y’all.

In a way, the TrueNAS GUI is lacking some functionality. I don’t even know if the GUI has the function you are looking for in the request above.

On the other hand, TrueNAS, (and FreeNAS before it), were always intended to have Unix Shell access for more detailed trouble shooting.

Some NASes are all about the GUI, and yet don’t cover everything. When odd failure modes occur and are not fixable or troubleshootable from the GUI, those users may be out of luck. Time to restore from backups.

Not saying one philosophy is better than another, just different.


One reason I chose TrueNAS is that Unix Shell is both readily available, and down right useful at times. My prior NAS, an Infrant ReadyNAS 1000S, also had Unix Shell access. I simply could not live with either a NAS having a heavy focus on MS-Windows, (which I don't use at home). Or limited trouble shooting from any NAS GUI.