No checksum errors but "One or more devices has experienced an error resulting in data corruption"

Adrianthefifth · April 12, 2024, 12:08pm

Hi All,

Sorry for long first post. I’ve been having intermittent problems with random files coming up as “permanent errors” (see export below).

This seems to impact:

Small old files such as mp3 which have been carried across through various ext3, ext4, and NTFS file systems over the last 25 years. On review, when restoring from the (Pre ZFS) backup these files may already be corrupted prior to migration into TrueNAS. I have scrubbed several times since install without error and would have expected any of this corruption to be detected then (if it was going to be at all)
New files, although usually only iso type images, and not every time. These files are usually downloaded to the “download” directory and moved to another location within same pool (but different dataset). Day to day type documents and photos etc. have not been effected (yet?).

Errors are coming up with zero checksum or read/write errors. When the errors occur they impact all previous snapshots containing the affected files, not just the latest snapshots.

To clear the errors I have deleted the impacted files AND all previous snapshots that included the file, then two scrubs (each stopped after a few minutes).
I have done this several times over the last 6 months, and randomly more errors pop up and files are lost.

output from: "sudo zpool status -v"

  pool: datastore
 state: ONLINE
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
  scan: scrub canceled on Thu Mar 14 22:28:16 2024
config:

        NAME                                      STATE     READ WRITE CKSUM
        datastore                                 ONLINE       0     0     0
          raidz2-0                                ONLINE       0     0     0
            1a962e00-4c94-45dd-a560-2b8dffd62cbd  ONLINE       0     0     0
            52d950c3-8a5f-4cfa-8881-79e1456b59a9  ONLINE       0     0     0
            fb23a123-3f22-45a4-9b4b-6f9b4b1354f7  ONLINE       0     0     0
            0fe2b142-a647-44ee-81e8-39be5dd5f85d  ONLINE       0     0     0
            18ecb100-ff58-4306-b488-e21fff64dbe0  ONLINE       0     0     0
            2ebbce27-c886-4011-8048-dcba479e18a3  ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

        datastore/storage/family/user1@DAILY-2024-03-28_01-00:/images/systemrescue-11.00-amd64.iso
        datastore/storage/family/user1@DAILY-2024-03-28_01-00:/images/clonezilla-live-3.1.2-9-i686.iso
        datastore/storage/family/user1@DAILY-2024-03-28_01-00:/images/clonezilla-live-3.1.2-9-amd64.iso
        datastore/storage/family/user1@DAILY-2024-04-03_01-00:/images/systemrescue-11.00-amd64.iso
        datastore/storage/family/user1@DAILY-2024-04-03_01-00:/images/clonezilla-live-3.1.2-9-i686.iso
        datastore/storage/family/user1@DAILY-2024-04-03_01-00:/images/clonezilla-live-3.1.2-9-amd64.iso
        datastore/storage/media@MONTHLY-2024-01-01_03-00:/Music/Mp3/amusicfile.cfa
        datastore/storage/media@MONTHLY-2024-01-01_03-00:/Shows/show1/show1 Season 3/show1.mp4
        datastore/storage/media@MONTHLY-2024-01-01_03-00:/Music/musicvideo.mp4
        datastore/storage/family/user1@DAILY-2024-04-02_01-00:/images/systemrescue-11.00-amd64.iso
        datastore/storage/family/user1@DAILY-2024-04-02_01-00:/images/clonezilla-live-3.1.2-9-i686.iso
        datastore/storage/family/user1@DAILY-2024-04-02_01-00:/images/clonezilla-live-3.1.2-9-amd64.iso
        datastore/storage/family/user1@MONTHLY-2024-04-10_02-00:/images/systemrescue-11.00-amd64.iso
        datastore/storage/family/user1@MONTHLY-2024-04-10_02-00:/images/clonezilla-live-3.1.2-9-i686.iso
        datastore/storage/family/user1@MONTHLY-2024-04-10_02-00:/images/clonezilla-live-3.1.2-9-amd64.iso
        /mnt/datastore/storage/family/user1/images/systemrescue-11.00-amd64.iso
        /mnt/datastore/storage/family/user1/images/clonezilla-live-3.1.2-9-i686.iso
        /mnt/datastore/storage/family/user1/images/clonezilla-live-3.1.2-9-amd64.iso

Hardware details

HP Prodesk G2 i5 6500, (in ATX case with 600W PSU)
OS Version:TrueNAS-SCALE-23.10.0.1
Intel i5-6500 CPU @ 3.20GHz
32GB DDR4 non-ECC 2133MHz
UPS

Storage:

1x120g SSD boot pool (motherboard SATA controller)
LSI 9211-8i SAS controller in IT Mode (all other drives)
Datastore Pool: RaidZ2, 6x6TB, mix of Seagate Ironwolf and WD Red (all CMR).
Native ZFS encryption: Automatically unencrypt on boot from passkey stored on USB Key as per: SOLVED - ZFS Encryption USB auto unlock and mount with pass-phrase as I do with GLI. | TrueNAS Community

I understand it’s not ideal TrueNAS hardware, but I wouldn’t have expected these repeated errors.

I have completed the following troubleshooting:

Memtest 86, multiple run throughs, over about 24 hours, no errors
Checked IT mode on LSI 6Gbps SAS HBA 9211-8i
Checked, reseated all HDD cables, nothing obvious
SMART extended offline, all drives - no errors
SMART short (daily) - no errors

I hope someone can give me some pointers on what to try next.
Only similar reports I can find point to a bug in ZFS native encryption (link below). Could this be related?
https://github.com/openzfs/zfs/issues/12014

Thanks in advance,

AdrianTheFifth

Stux · April 12, 2024, 1:58pm

Quite simply, this shouldn’t be happening.

About the only probable cause (other than a zfs bug) is memory errors, or you had a triple failure on collocated blocks, which seems unlikely, especially with no errors logged per drive.

How often are you scrubbing?

Adrianthefifth · April 12, 2024, 10:34pm

I’m scrubbing monthly. There hasn’t been any repairs made from any of the scrubs.

I’ll try further testing on the memory, tomorrow.

My next plan after that is to remove ZFS encryption (migrate away then restore), and run for a few months to see if any more errors occur.

Davvo · April 13, 2024, 4:47pm

Export and import the pool
Scrub the pool
Share the output of sas2flash -list
Put your signature inside [details="Summary"]This text will be hidden[/details] (not necessarely as the last thing).

I strongly suggest the use of ECC RAM.

Adrianthefifth · April 15, 2024, 1:00pm

Thanks, I’ll give that a try and come back to you with the output. Might take a few days.

Re signature, thanks for the tip. I found the BB Code help page on the old forum but not the new one. Perhaps a link and some suggestions could be added to the new community guidelines page. New users are directed to that with initial posts.

Adrianthefifth · April 24, 2024, 12:48pm

Sorry for the delay, life continues to happen. I’ve completed the troubleshooting tests you suggested:
1. Exported and imported the pool. I had to start and stop a scrub twice to clear the ZFS errors before it would let me export.

2. Pool scrubbed, no errors detected.

adrianthefifth@system:~$ sudo zpool status -v
  pool: datastore
 state: ONLINE
  scan: scrub repaired 0B in 13:09:22 with 0 errors on Fri Apr 19 12:39:15 2024
config:

        NAME                                      STATE     READ WRITE CKSUM
        datastore                                 ONLINE       0     0     0
          raidz2-0                                ONLINE       0     0     0
            1a962e00-4c94-45dd-a560-2b8dffd62cbd  ONLINE       0     0     0
            52d950c3-8a5f-4cfa-8881-79e1456b59a9  ONLINE       0     0     0
            fb23a123-3f22-45a4-9b4b-6f9b4b1354f7  ONLINE       0     0     0
            0fe2b142-a647-44ee-81e8-39be5dd5f85d  ONLINE       0     0     0
            18ecb100-ff58-4306-b488-e21fff64dbe0  ONLINE       0     0     0
            2ebbce27-c886-4011-8048-dcba479e18a3  ONLINE       0     0     0

errors: No known data errors

3. sas2flash -list

adrianthefifth@system:~$ sudo sas2flash - list
LSI Corporation SAS2 Flash Utility
Version 20.00.00.00 (2014.09.18)
Copyright (c) 2008-2014 LSI Corporation. All rights reserved

        Adapter Selected is a LSI SAS: SAS2008(B2)

        Controller Number              : 0
        Controller                     : SAS2008(B2)
        PCI Address                    : 00:01:00:00
        SAS Address                    : 500605b-0-04d1-3240
        NVDATA Version (Default)       : 14.01.00.08
        NVDATA Version (Persistent)    : 14.01.00.08
        Firmware Product ID            : 0x2213 (IT)
        Firmware Version               : 20.00.07.00
        NVDATA Vendor                  : LSI
        NVDATA Product ID              : SAS9211-8i
        BIOS Version                   : 07.39.02.00
        UEFI BSD Version               : N/A
        FCODE Version                  : N/A
        Board Name                     : SAS9211-8i
        Board Assembly                 : N/A
        Board Tracer Number            : N/A

        Finished Processing Commands Successfully.
        Exiting SAS2Flash.

Appreciate the comment on ECC ram, I can’t run it on the current hardware unfortunately.

Further testing:
I tried to access the files that had corrupted via SMB (there were a couple I didn’t delete as a test). When I did this the permanent errors came back as before, so it looks like a scrub doesn’t detect them, they only appear when the files are accessed.

I deleted all the known corrupted files and the impacted snapshots. I will monitor over the next few months. I have also left the ZFS encryption in place.

I am still seeing new iso type files randomly corrupting when being moved from one dataset to another. They seem to download and store OK initially, but corrupt when being moved. Could this be linked to the ZFS cache?

winnielinnie · April 24, 2024, 1:10pm

This could be inter-related (or sharing the same underlying cause) as another “silent corruption” issue for upstream ZFS.

Apparently, the silent corruption bug has never been fixed, and still exists (reproducible with synthetic tests) in OpenZFS 2.2.3. (There are parameters to mitigate the issue in the meantime, until a proper fix is discovered.)

In your case, this is not “silent” corruption, per se, since ZFS is detecting it. (Not sure what it’s detecting, since it doesn’t count these as checksum errors?)

I would also consider your network card. Cheap and/or imitation NICs are known to read “corruption”. However, it would not be noticed by ZFS. (Which makes your problem even more bizarre.)

The other culprit is failing/bad RAM, as suggested above. (Or even your HBA.)

Davvo · April 25, 2024, 7:36am

You mean the zeroes replacing bug?

ericloewe · April 25, 2024, 10:19am

To be clear, that has seen a few fixes, and the latest version in the matter branch should have fixed it completely.