Hi all,
I’ve started getting ZFS errors on one of my pools (mirrored 4TB). It seemed initially around specific dataset that i’ve deleted and created. I’ve run long SMART tests without errors and SCRUB tasks but it dosent seem to help.
The error count between disks is the same, so looks like files are getting corrupted on both disks
any ideas on how to debug and fix?
zpool status datapool -v
pool: datapool
state: DEGRADED
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: Message ID: ZFS-8000-8A — OpenZFS documentation
scan: scrub repaired 21.9M in 01:22:46 with 798 errors on Sun Dec 8 14:22:44 2024
config:
NAME STATE READ WRITE CKSUM
datapool DEGRADED 0 0 0
mirror-0 DEGRADED 0 0 0
042bee83-982a-46aa-9670-0220bba72c87 DEGRADED 0 0 112 too many errors
148eaad7-3b71-48a7-b4c1-749b7bb308ef DEGRADED 0 0 112 too many errors
logs
d00fa804-8c83-48b8-8728-2eadc4dbbf29 ONLINE 0 0 0
errors: Permanent errors have been detected in the following files:
CKSUM errors can sometimes result from bad cabling or connectivity to a drive - but for it to have identical error counts on both drives might indicate something at the storage controller. You also have the scrub history showing 798 errors corrected recently.
Can you provide the model of drives and the system specs, including the storage controller?
Thanks. Most files marked under the status -v are Proxmox backup and when trying to verify the backup it fails, so it seems like data loss…
I took a quick look on cabling and seems OK, but will try later to power off the system and reconnect. I’ve seen such feedback in past posts as well.
lspci | grep SATA
00:1f.2 SATA controller: Intel Corporation 6 Series/C200 Series Chipset Family 6 port Desktop SATA AHCI Controller (rev 04)
02:00.0 SATA controller: Marvell Technology Group Ltd. 88SE9215 PCIe 2.0 x1 4-port SATA 6 Gb/s Controller (rev 11)
The Seagate drives are quite likely to be contributing to the problem because of their shingled nature. If they are unable to return data in time this may manifest as a CKSUM error.
The SSD log drive is unlikely to be contributing much of value unless you are connecting from Proxmox over NFS.
You haven’t really said much about your hardware configuration but, since you ask, why not run a memtest overnight? Can’t hurt and will give you some confidence of the memory stability.
Hi there, some update on the issue
so I’ve replaced the hard-drive with new Seagate IronWolf 4TB drives (ST4000VN006-3CW104). (resilvered) I’ve deleted the dataset that had the issues and re-created that. still got errors.
Next i’ve replaced the drives’ SATA cables and connected to the other controller and still I’m facing with issues with this pool.
My second pool dosent suffer from such issues.
Now when trying to identify the problematic files I get an error:
root@truenas[/home/admin]# zpool status datapool -v
pool: datapool
state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: Message ID: ZFS-8000-8A — OpenZFS documentation
scan: scrub repaired 0B in 01:08:44 with 1 errors on Tue Dec 17 22:32:24 2024
config:
errors: List of errors unavailable: no such pool or dataset
My system is an Intel I5 (first gen) with 32GB RAM (4x8GB)
Intel(R) Core™ i5-2400 CPU @ 3.10GHz
00:1f.2 SATA controller: Intel Corporation 6 Series/C200 Series Chipset Family 6 port Desktop SATA AHCI Controller (rev 04)
02:00.0 SATA controller: Marvell Technology Group Ltd. 88SE9215 PCIe 2.0 x1 4-port SATA 6 Gb/s Controller (rev 11)