ZFS erros (DEGRADED pool)

Gad_Dayan · December 9, 2024, 3:00pm

Hi all,
I’ve started getting ZFS errors on one of my pools (mirrored 4TB). It seemed initially around specific dataset that i’ve deleted and created. I’ve run long SMART tests without errors and SCRUB tasks but it dosent seem to help.
The error count between disks is the same, so looks like files are getting corrupted on both disks

any ideas on how to debug and fix?

zpool status datapool -v
pool: datapool
state: DEGRADED
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: Message ID: ZFS-8000-8A — OpenZFS documentation
scan: scrub repaired 21.9M in 01:22:46 with 798 errors on Sun Dec 8 14:22:44 2024
config:

    NAME                                      STATE     READ WRITE CKSUM
    datapool                                  DEGRADED     0     0     0
      mirror-0                                DEGRADED     0     0     0
        042bee83-982a-46aa-9670-0220bba72c87  DEGRADED     0     0   112  too many errors
        148eaad7-3b71-48a7-b4c1-749b7bb308ef  DEGRADED     0     0   112  too many errors
    logs
      d00fa804-8c83-48b8-8728-2eadc4dbbf29    ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

    /mnt/datapool/vmbkup/.chunks/ffac
    /mnt/datapool/vmbkup/.chunks/4676

HoneyBadger · December 9, 2024, 3:05pm

Hey @Gad_Dayan

CKSUM errors can sometimes result from bad cabling or connectivity to a drive - but for it to have identical error counts on both drives might indicate something at the storage controller. You also have the scrub history showing 798 errors corrected recently.

Can you provide the model of drives and the system specs, including the storage controller?

Gad_Dayan · December 9, 2024, 3:13pm

Thanks. Most files marked under the status -v are Proxmox backup and when trying to verify the backup it fails, so it seems like data loss…
I took a quick look on cabling and seems OK, but will try later to power off the system and reconnect. I’ve seen such feedback in past posts as well.

Seagate Barracuda HDD, ST4000DM004-2U9104, 3.64 TiB

lspci | grep SATA
00:1f.2 SATA controller: Intel Corporation 6 Series/C200 Series Chipset Family 6 port Desktop SATA AHCI Controller (rev 04)
02:00.0 SATA controller: Marvell Technology Group Ltd. 88SE9215 PCIe 2.0 x1 4-port SATA 6 Gb/s Controller (rev 11)

HoneyBadger · December 9, 2024, 3:16pm

Those are shingled (SMR) drives which have a high likelihood of contributing to this problem. Are they connected to the Intel or Marvell controller?

Gad_Dayan · December 9, 2024, 3:19pm

They are connected to the Intel controller. and the pool have an SSD log drive

HoneyBadger · December 9, 2024, 3:28pm

The Seagate drives are quite likely to be contributing to the problem because of their shingled nature. If they are unable to return data in time this may manifest as a CKSUM error.

The SSD log drive is unlikely to be contributing much of value unless you are connecting from Proxmox over NFS.

Gad_Dayan · December 9, 2024, 3:42pm

Yes. the proxmox backup is using NFS to mount the dataset.
would Seagate IronWolf NAS 4TB Sata III ST4000VN006 be a better choice?

are there other things to check or consider? the system and this setup run for several month before I started seeing this error last week

neofusion · December 9, 2024, 3:52pm

You haven’t really said much about your hardware configuration but, since you ask, why not run a memtest overnight? Can’t hurt and will give you some confidence of the memory stability.

etorix · December 9, 2024, 4:17pm

Yes. Anything that is not SMR.

Gad_Dayan · December 18, 2024, 7:07am

Hi there, some update on the issue
so I’ve replaced the hard-drive with new Seagate IronWolf 4TB drives (ST4000VN006-3CW104). (resilvered) I’ve deleted the dataset that had the issues and re-created that. still got errors.
Next i’ve replaced the drives’ SATA cables and connected to the other controller and still I’m facing with issues with this pool.
My second pool dosent suffer from such issues.

Now when trying to identify the problematic files I get an error:

root@truenas[/home/admin]# zpool status datapool -v
pool: datapool
state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: Message ID: ZFS-8000-8A — OpenZFS documentation
scan: scrub repaired 0B in 01:08:44 with 1 errors on Tue Dec 17 22:32:24 2024
config:

    NAME                                      STATE     READ WRITE CKSUM
    datapool                                  ONLINE       0     0     0
      mirror-0                                ONLINE       0     0     0
        d913e868-e408-4b14-a1df-8445337f5b7a  ONLINE       0     0 26.4K
        cbd9ed6f-0b27-45d4-83df-bf859d326ebc  ONLINE       0     0 26.4K
    logs
      d00fa804-8c83-48b8-8728-2eadc4dbbf29    ONLINE       0     0     0

errors: List of errors unavailable: no such pool or dataset

My system is an Intel I5 (first gen) with 32GB RAM (4x8GB)
Intel(R) Core™ i5-2400 CPU @ 3.10GHz
00:1f.2 SATA controller: Intel Corporation 6 Series/C200 Series Chipset Family 6 port Desktop SATA AHCI Controller (rev 04)
02:00.0 SATA controller: Marvell Technology Group Ltd. 88SE9215 PCIe 2.0 x1 4-port SATA 6 Gb/s Controller (rev 11)

duguying · February 20, 2025, 10:03am

i meet the same problem. have you solve it?

Gad_Dayan · March 5, 2025, 5:37pm

Unfortunately nothing helped. I did end up wiping it all, deleting configuration and building the pool from scratch. then had to restored data from backup. it was painful… since then upgraded the drives to “NAS” models.