Well, thats not the email you want to get at the completion of a resilver of new larger drives…
This is quite confusing since I have never had any errors in my array previously, and I run weekly scrubs. I recently added a L2Arc drive and set it to use metadata only, and run this pre-init script: echo 0 > /sys/module/zfs/parameters/l2arc_headroom
I also added a SLOG. Both L2 and SLOG are SAS enterprise SSD’s, used of course. I can’t say they are in perfect shape (they are not reporting any errors), but aren’t SLOG and L2 not pool critical, so even if something with them was wonky, would that result in metadata errors on a scrub/resilver?
Last night I popped 2 new 8 TB WD Reds in (both verified via a bad blocks run, no errors found) to replace 2 of my 4 TB drives (I am going through and replacing drives from 4’s to 8’s), and this is the error I woke up to. Last night prior to the resilver, no errors in zpool status. Upon the resilver, I am seeing:
pool: pergamum
state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
scan: resilvered 4.65T in 09:12:38 with 3 errors on Mon Sep 8 09:50:43 2025
config:
NAME STATE READ WRITE CKSUM
pergamum ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
ab0351e8-44ea-11e8-8cad-e0071bffdaee ONLINE 0 0 6
670dfb97-13fc-4611-bb0f-6680649d4089 ONLINE 0 0 0
8c42800d-40d9-432f-b918-bd4138714187 ONLINE 0 0 0
6ebdcf54-ac93-11ec-b2a3-279dd0c48793 ONLINE 0 0 6
72baec5e-e358-4bbe-a8b0-dd75494f725d ONLINE 0 0 6
8a6e6dd2-465c-4311-b62e-cce797796faf ONLINE 0 0 12
7a9b8d5e-a28d-11ee-aaf2-0002c95458ac ONLINE 0 0 6
d9238765-4851-48c5-b3cc-1650c8de1364 ONLINE 0 0 0
d3a5a104-011f-4602-ab04-90149d8863e8 ONLINE 0 0 6
b1d949c1-44ea-11e8-8cad-e0071bffdaee ONLINE 0 0 6
logs
d4c96b7f-9ca8-46ab-836a-ca387309ac56 ONLINE 0 0 0
cache
8e380a80-b813-448b-9704-ed5689983c76 ONLINE 0 0 0
errors: Permanent errors have been detected in the following files:
<metadata>:<0x1b>
At this point, I am not entirley sure what to do. I run ECC RAM, LSI SAS9305-16I, in a HL15 case (so the drives are all plugged into a backplane, SAS cables have not been physically touched or adjusted in months).
Any thoughts on how I should proceed? At this point I think the best course of action is to shut down the system, but I really don’t want to do anything that could result in further damange.
I do try and space out smart tests and scrubs etc so they don’t overlap. I will reconcisder my scheduels once I sort out the issue.
Yes, all 10 drives going to the same HBA, 4 SAS cables from HBA to the backplane. Of note, two of the drives without errors are the drives I installed last night, the other drive without errors is an older drive. But, a handful of these drives are new, and many of them are entirely different ages as I have had to replace drives over the 10 year life of this system. A few are still original, many have been replaced over the yeras, a few are a few months old, 2 are brand new as of last night as previously stated. Trying to provide as much backstory here as possible…
I have 3 fans in the front of the HL15, none in the middle, and a fan direclty on top of the PCIe slots pointed at the HBA. That said, I do run them at low RPM which is “fine” for normal workloads. The fans are supposed to increase in speed when harddrives (or CPU) get warm based on an old truenas scipt, but I do’nt think the drives or CPU actually got warm… but I bet the HBA did.
I just took the top off the case, I can hold my finger on the HBA heatsink, but its also not doing anything right now since I have halted all data to and from the NAS. After doing another hour of research, I am leaning towards the HBA overheated during the process of resilvering 2 drives last night (I have done this before without issue, but I have more drives populated in the front now with the SAS SSD’s, so overall airflow is likely worse).
Unless you think this is a terrible idea, I think my plan is do a scrub with fans at 100%. Thoughts?
Fans at 100, everything is cool to the touch when I check…
You anticipate it can clear the metadata error? What I don’t fully understand is how it would do that, unless the error is not real? I always fear the long debunked “scrub of death” in such a situation… but scrub is 10% done, so I suppose we will see in a few hours.
It’s possible that because 7 disks in a RAIDZ2 vdev “errored” simultaneously, there was not enough parity available to properly read the ZFS metadata at that point in time. This gets registered as a “permanent error”.
If you correct the cause at fault (perhaps the HBA), it’s possible that all 7 “errored” disks will be able to work with ZFS without issue. A full scrub could possibly result in “0 errors”, because you won’t have 7 disks “failing” at the same exact time for the same exact data/metadata.
Even after a scrub that logs “0 errors”, you’ll likely still see “permanent errors” and “CKSUM errors” in a zpool status -v, but that can be cleared with zpool clear to reset the warnings if there really are no more outstanding errors.
Ah, that does make sense. Thanks for the explainer!
I did a zpool clear prior to the scrub, but I suppose if it doesn’t clear up the metadata “permanant error”, there is no way to clear that fault? Seeing the pool as unhealthy for the rest of my life is probably going to kill me ;(
I was definitely sweating there for a bit. I have had issues in the past due to bad cables or poor plug connections, but usually that’s 1 maybe 2 drives. I had never had a metadata issue which is quite scary.
But agreed with all the above, the fact 7 drives all at once had seemingly related issues… I was worried, but not terrified. Thankfully it worked out.