Zfs permanent error detected

nikkon · May 16, 2024, 3:58pm

Hi all,
I recently changed all drives in my nas and I suspect during this change an error occured.

zpool status -xv
pool: Tank
state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: Message ID: ZFS-8000-8A — OpenZFS documentation
scan: scrub repaired 0B in 03:09:38 with 1 errors on Thu May 16 17:52:59 2024
config:

    NAME                                      STATE     READ WRITE CKSUM
    Tank                                      ONLINE       0     0     0
      raidz1-0                                ONLINE       0     0     0
        ad5493f3-2be8-450f-be88-df1f197a8232  ONLINE       0     0     6
        6d1648f0-bbe8-4109-a760-878edf75002c  ONLINE       0     0     6
        d9d184ae-08de-4bb5-88a6-8550b452bbb4  ONLINE       0     0     6
        81c604e6-61bd-4a0f-861e-e1d6bcbc5cc5  ONLINE       0     0     6
        820a8773-455b-4a24-b1aa-a29c65e5af1c  ONLINE       0     0     6
        698d56e7-39f4-46df-bfe5-f54f33966611  ONLINE       0     0     6
    cache
      62078fc8-a36a-47b8-bca3-0796ff0f6783    ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

    Tank/Media/Filme:<0x186>

I’ve run several scrub sessions but the error is still here.
Any guidance on how to fix this ?

etorix · May 16, 2024, 4:23pm

The error is in metadata. The only fix is to destroy the pool and restore from a valid backup.

6 checksum errors on each drive could point to an issue with cabling/backplane or overheating controller. How are the drives connected?

nikkon · May 16, 2024, 4:30pm

thanks for the quick answer.
I can’t really destroy the pool (there is too much data I need to move and I don’t have where). would be an option to move the content from that folder in a different folder / poll and move it back after I recreate the folder?

winnielinnie · May 16, 2024, 4:33pm

Check / fix cabling, seating, HBA, etc (like @etorix suggested), then run a zpool clear and another full scrub of the pool.

Let’s not be so quick to destroy entire pools because of an unwanted scrub result.

winnielinnie · May 16, 2024, 4:37pm

@nikkon You never answered this question.

nikkon · May 16, 2024, 4:37pm

let me do this and see how it goes. I need to check the HBA as well and all connectors

nikkon · May 16, 2024, 4:37pm

my bad. missed that. I am using an HBA

Protopia · May 16, 2024, 4:59pm

I suspect that the 6 errors may actually relate to the scrub finding checksum errors.

I assume that Tank/Media/Filme is a directory - which suggests that the directory (which is probably stored in the same way as a file) is corrupt - so you may have probably lost access to everything inside.

I would start by trying to copy the contents off the pool to somewhere else. You might be more successful by looking inside the .zfs subdirectory of the dataset for the snapshots (which may well contain a valid copy of the directory blocks) and copy the files off that.

To remove the error you will need to remove the directory and the contained files (and possibly all the snapshots that contain the corrupted directory) and ensure that the blocks they used are returned to the free pool. I have no idea how to do this when a directory is corrupted.

Then, once you have deleted the directory and files contained, you will need to do a scrub to clear the error. I have, however, read somewhere that scrub > export > import > scrub may be needed to clear errors.

nikkon · May 16, 2024, 8:29pm

I have a few issues here. after the last scrub I ended up with less errors but not good enough.
Problem 1 - my last snapshot is from feb. not a bit issue
Problem 2 - I cannot delete the content.
As the error is pointing on the folder, maybe moving the content somewhere else ad deleting the folder may work.

root@tatooine[/mnt/Tank]# zpool status -xv
pool: Tank
state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: Message ID: ZFS-8000-8A — OpenZFS documentation
scan: scrub repaired 0B in 03:16:26 with 1 errors on Thu May 16 22:16:40 2024
config:

NAME                                      STATE     READ WRITE CKSUM
Tank                                      ONLINE       0     0     0
  raidz1-0                                ONLINE       0     0     0
    ad5493f3-2be8-450f-be88-df1f197a8232  ONLINE       0     0     2
    6d1648f0-bbe8-4109-a760-878edf75002c  ONLINE       0     0     2
    d9d184ae-08de-4bb5-88a6-8550b452bbb4  ONLINE       0     0     2
    81c604e6-61bd-4a0f-861e-e1d6bcbc5cc5  ONLINE       0     0     2
    820a8773-455b-4a24-b1aa-a29c65e5af1c  ONLINE       0     0     2
    698d56e7-39f4-46df-bfe5-f54f33966611  ONLINE       0     0     2
cache
  62078fc8-a36a-47b8-bca3-0796ff0f6783    ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

    Tank/Media/Filme:<0x186>

winnielinnie · May 16, 2024, 8:49pm

Did you run a zpool clear before and after the scrub?

This is looking like an HBA issue, since there’s no way all drives in your RAIDZ1 vdev have the exact same number of checksum errors at the same time.

Maybe it’s an overheating issue, and ironically the scrub is what causes the HBA to run hot enough to be on the brink of checksum errors for some blocks.

nikkon · May 16, 2024, 9:34pm

I did. I will change the cables and swap out the hba. I’ll be back with updates

Stux · May 17, 2024, 2:03am

What us the HBA? Has it got the latest IT firmware (assuming its an HBA which has IT firmware). And do you have it actively cooled… assuming its an HBA which requires active cooling…

(and it probably does, if its a good one, and you’re not using a server chassis)

nikkon · May 22, 2024, 9:59am

I changed the HBA cables.
After the last scrub, ZFS reports green.
no more errors. feeling a lot better