Permanent errors have been detected in the following files: <metadata>:<0x1b>

Well, thats not the email you want to get at the completion of a resilver of new larger drives…

This is quite confusing since I have never had any errors in my array previously, and I run weekly scrubs. I recently added a L2Arc drive and set it to use metadata only, and run this pre-init script: echo 0 > /sys/module/zfs/parameters/l2arc_headroom

I also added a SLOG. Both L2 and SLOG are SAS enterprise SSD’s, used of course. I can’t say they are in perfect shape (they are not reporting any errors), but aren’t SLOG and L2 not pool critical, so even if something with them was wonky, would that result in metadata errors on a scrub/resilver?

Last night I popped 2 new 8 TB WD Reds in (both verified via a bad blocks run, no errors found) to replace 2 of my 4 TB drives (I am going through and replacing drives from 4’s to 8’s), and this is the error I woke up to. Last night prior to the resilver, no errors in zpool status. Upon the resilver, I am seeing:

  pool: pergamum
 state: ONLINE
status: One or more devices has experienced an error resulting in data
	corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
	entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
  scan: resilvered 4.65T in 09:12:38 with 3 errors on Mon Sep  8 09:50:43 2025
config:

	NAME                                      STATE     READ WRITE CKSUM
	pergamum                                  ONLINE       0     0     0
	  raidz2-0                                ONLINE       0     0     0
	    ab0351e8-44ea-11e8-8cad-e0071bffdaee  ONLINE       0     0     6
	    670dfb97-13fc-4611-bb0f-6680649d4089  ONLINE       0     0     0
	    8c42800d-40d9-432f-b918-bd4138714187  ONLINE       0     0     0
	    6ebdcf54-ac93-11ec-b2a3-279dd0c48793  ONLINE       0     0     6
	    72baec5e-e358-4bbe-a8b0-dd75494f725d  ONLINE       0     0     6
	    8a6e6dd2-465c-4311-b62e-cce797796faf  ONLINE       0     0    12
	    7a9b8d5e-a28d-11ee-aaf2-0002c95458ac  ONLINE       0     0     6
	    d9238765-4851-48c5-b3cc-1650c8de1364  ONLINE       0     0     0
	    d3a5a104-011f-4602-ab04-90149d8863e8  ONLINE       0     0     6
	    b1d949c1-44ea-11e8-8cad-e0071bffdaee  ONLINE       0     0     6
	logs
	  d4c96b7f-9ca8-46ab-836a-ca387309ac56    ONLINE       0     0     0
	cache
	  8e380a80-b813-448b-9704-ed5689983c76    ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

        <metadata>:<0x1b>

At this point, I am not entirley sure what to do. I run ECC RAM, LSI SAS9305-16I, in a HL15 case (so the drives are all plugged into a backplane, SAS cables have not been physically touched or adjusted in months).

Any thoughts on how I should proceed? At this point I think the best course of action is to shut down the system, but I really don’t want to do anything that could result in further damange.

Are those 7 drives with checksum errors on the same HBA?


Not related the your issue, but worth mentioning:

Once a month is probably enough. You also want to lessen the chances of resilvers, SMART tests, and scrubs overlapping with each other.

1 Like

I do try and space out smart tests and scrubs etc so they don’t overlap. I will reconcisder my scheduels once I sort out the issue.

Yes, all 10 drives going to the same HBA, 4 SAS cables from HBA to the backplane. Of note, two of the drives without errors are the drives I installed last night, the other drive without errors is an older drive. But, a handful of these drives are new, and many of them are entirely different ages as I have had to replace drives over the 10 year life of this system. A few are still original, many have been replaced over the yeras, a few are a few months old, 2 are brand new as of last night as previously stated. Trying to provide as much backstory here as possible…

I have 3 fans in the front of the HL15, none in the middle, and a fan direclty on top of the PCIe slots pointed at the HBA. That said, I do run them at low RPM which is “fine” for normal workloads. The fans are supposed to increase in speed when harddrives (or CPU) get warm based on an old truenas scipt, but I do’nt think the drives or CPU actually got warm… but I bet the HBA did.

I just took the top off the case, I can hold my finger on the HBA heatsink, but its also not doing anything right now since I have halted all data to and from the NAS. After doing another hour of research, I am leaning towards the HBA overheated during the process of resilvering 2 drives last night (I have done this before without issue, but I have more drives populated in the front now with the SAS SSD’s, so overall airflow is likely worse).

Unless you think this is a terrible idea, I think my plan is do a scrub with fans at 100%. Thoughts?

That sounds wise. Do the usual, on top of bringing down the temps. Check all connections, on both ends, and the HBA’s seating itself.

When you have multiple drives on the same HBA simultaneously produce ZFS checksum errors, it’s usually the HBA and not the drives themselves.

Fans at 100, everything is cool to the touch when I check…

You anticipate it can clear the metadata error? What I don’t fully understand is how it would do that, unless the error is not real? I always fear the long debunked “scrub of death” in such a situation… but scrub is 10% done, so I suppose we will see in a few hours.

It’s possible that because 7 disks in a RAIDZ2 vdev “errored” simultaneously, there was not enough parity available to properly read the ZFS metadata at that point in time. This gets registered as a “permanent error”.

If you correct the cause at fault (perhaps the HBA), it’s possible that all 7 “errored” disks will be able to work with ZFS without issue. A full scrub could possibly result in “0 errors”, because you won’t have 7 disks “failing” at the same exact time for the same exact data/metadata.

Even after a scrub that logs “0 errors”, you’ll likely still see “permanent errors” and “CKSUM errors” in a zpool status -v, but that can be cleared with zpool clear to reset the warnings if there really are no more outstanding errors.

2 Likes

Ah, that does make sense. Thanks for the explainer!

I did a zpool clear prior to the scrub, but I suppose if it doesn’t clear up the metadata “permanant error”, there is no way to clear that fault? Seeing the pool as unhealthy for the rest of my life is probably going to kill me ;(

I would try it again after a full successful scrub that returns 0 errors after completion.

1 Like

Woke up to a fully healthy pool. No errors reported at all, and the permanent error was no longer being reported.

Thanks for the advice!

1 Like

Sounds like it was the HBA after all. Keep those temps low. :+1:

2 Likes

And one Smiling Mask Award! :clown_face:

2 Likes

I was definitely sweating there for a bit. I have had issues in the past due to bad cables or poor plug connections, but usually that’s 1 maybe 2 drives. I had never had a metadata issue which is quite scary.

But agreed with all the above, the fact 7 drives all at once had seemingly related issues… I was worried, but not terrified. Thankfully it worked out.