Same checksum error count on mirrored disks

ubmike38773 · October 17, 2024, 3:18am

Hi all, I’d like to get some input on what I’m observing on a Dragonfish-24.04.2.1. system.

After running a scrub, the zpool status output looks as shown below. Plz note the large chksum error counts, which is the same for both vdev mirror members.

What I’d like to understand is whether these checksum errors have a material impact, as I don’t seem to see any impact to two iSCSI volumes that part of that pool. Also, the cksum error count at the mirror-0 level shows “0”. Does this mean that the mirror status is healthy, regardless of the cksum errors on the single disks?

What would you do if you see this kind of stats on your own system? I’m planning to run a RAM test over the weekend, to check whether all looks good in that space (32 GBs non-ECC RAM).

I’ve also attached the SMART stats. The disks are somewhat aged but not ancient. Disk 1 has a high Command_Timeout counter but it hasn’t budged for ages.

Any input/opinion will be welcome.

root@nas3[~]# zpool status Pool1 -v  
  pool: Pool1
 state: ONLINE
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
  scan: scrub repaired 0B in 1 days 05:21:41 with 4 errors on Thu Oct 17 03:52:05 2024
remove: Removal of vdev 2 copied 1.62T in 7h50m, completed on Wed Sep 11 17:28:21 2024
        124M memory used for removed device mappings
config:

        NAME                                    STATE     READ WRITE CKSUM
        Pool1                                   ONLINE       0     0     0
          mirror-0                              ONLINE       0     0     0
            sdd2                                ONLINE       0     0  258K
            ata-WUH721816ALE6L4_2CHD6NLN-part2  ONLINE       0     0  258K
        logs
          c8448ba9-714a-4dcf-a35f-92eb393b70af  ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

        <0x13329>:<0x1>
        <0x1c167>:<0x1>
        Pool1/VMs-LAB:<0x1>
        <0x134b6>:<0x1>
        Pool1/VMs-PROD@auto-2024-10-12_01-00:<0x1>
        Pool1/VMs-PROD:<0x1>

smartstats.txt (30.9 KB)

winnielinnie · October 17, 2024, 3:30am

Issue with HBA / “card”. Temperature, loose connection to motherboard, and/or cable connections.

If not that, issue with the motherboard / SATA ports.

If not that, bad RAM.

If not that, bad CPU.

oxyde · October 17, 2024, 4:30am

Your disks are running way too hot, the first one is 51° with a max of 58°, this can really damage disks.
I would start improving airflow before start testing anything, or start a scrub!

sfatula · October 17, 2024, 4:44am

Another example of why smart tests passing doesn’t necessarily mean much. CRC errors. Usually bad cable/port. But definitely running way too hot. If the drives are that hot, the HBA (if any) must be hotter. And that’s not good so it may indeed merely be just temps. The drives are under the max, but, I’d hate to see HBA temps. That case or airflow needs fixing!

etorix · October 17, 2024, 6:48am

The mirror is made up of “sdd2” and what looks like a partition; none is the expected kind of UUID. What’s the hardware (beyond non-ECC RAM) and how is it set up?

ubmike38773 · October 19, 2024, 5:23am

Thank you all for the feedback. Some answers to the various Qs:

I’ve removed the case enclosure of my NAS system a few days ago, that’s why I think that airflow was bad. I’ve re-installed the enclosure and now temp of disks is 40c. This is a small mid-tower build sitting in my office, so can’t be too noisy…
I’ve run a memory test and it reported errors in a memory location, see screenshot attached.
MB is AsRock Rack E3C246D4I-2T, 1 x 32 GB module and Xeon E-2224 as CPU.
The disk names in the output might look odd as this system was upgraded from core. I replaced the disks a while ago and when replacing one of the disks I messed up something in the TrueNAS GUI, so I then had to issue some commands on the commandline and ended up with those odd disk names…

I’ve ordered a new 32 GB ECC RAM module, as there is definitely an issue with the current RAM module…

Disk /dev/sdb: 14.55 TiB, 16000900661248 bytes, 31251759104 sectors
Disk model: WUH721816ALE6L4 
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: gpt
Disk identifier: 60572A8F-6D79-11EF-A745-D05099DC1FA9

Device       Start         End     Sectors  Size Type
/dev/sdb1      128     4194431     4194304    2G FreeBSD swap
/dev/sdb2  4194432 31251759063 31247564632 14.6T FreeBSD ZFS


Disk /dev/sdc: 14.55 TiB, 16000900661248 bytes, 31251759104 sectors
Disk model: WUH721816ALE6L4 
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: gpt
Disk identifier: 3B67369D-6CCC-11EF-8A35-D05099DC1FA9

Device       Start         End     Sectors  Size Type
/dev/sdc1      128     4194431     4194304    2G FreeBSD swap
/dev/sdc2  4194432 31251759063 31247564632 14.6T FreeBSD ZFS

oxyde · October 19, 2024, 7:00am

I totally understand that, nas can’t be too noisy for me too.
I really have had good improvement with a case which has sound-absorbing bulkheads, and off course you can add them yourself.
Another thing that can help (from my experience ) Is add more fan as you can, big, but letting them run as slow as possible: i have 3 for put airflow in (2 frontally direct on disks, 1 lateral for NVME and CPU that Is fanless), and 1 on top for extraction; 1 Is the one in the PSU but just cool itself. The max temp i see Is 37-38°, they run 29~31° 99% of time

ubmike38773 · October 20, 2024, 1:35am

I’m using this SuperMicro case… Came with everything, including fan…

I’ll check whether I can somehow get the fan to move a bit more air…

ubmike38773 · November 11, 2024, 3:03am

Update on this: I’ve replaced the memory with ECC RAM and then tested again with memtest, the errors are gone (… as to be expected).

But that didn’t get rid of the CKSUM errors and/or corruption errors in the ZFS Pool, as I suspect that whatever was corrupted in memory has been persisted to disk…

So I decided to backup data in various ways in order to rebuild the pool. The corruption prevented pool replications to work correctly and copying corrupted files from mounted file systems was also a bit tricky. A Windows tool called “unstoppable copy” helped with that.

Anyways, it looks like I’m on track to have a clean system up and running very soon again…