Checksum Errors During Resilvering - Help Navigating

PB-Panda · April 15, 2025, 4:13am

Hi all,

Having some issues with checksum errors as I resilver my pool and would like some clarification to make sure I’m on the right path and not going to lose/corrupt data.

The situation: One of the vdevs in my pool is 4-TB mirror, and I plan to swap both drives in series for 16-TB ones to expand the space on my pool. Full system information at the end of this post. I have had checksum errors with this system before (without changing hardware) every few months, and usually a cable swap or reseat fixes it. I admit I haven’t been super rigorous about tracking this issue, but I just want to put it out there that this system has a history. All drives are <18 months old so I think it’s the motherboard if anything, but that’s another story. The last of these errors was over 3 months ago since which the system has run continuously.

Sequence of events

Ensure system is up to date for current release channel (Dragonfish-24.04.2.5) and that there are no pending alerts/errors
“Offline” the 4-TB drive I intend to replace from the GUI storage menu
Shut down system
Physically remove the 4-TB drive and replace it with a 16-TB drive. I did this with minimal disruption to other system components, and the other 4-TB drive did not get touched at all.
Boot system
Replace the 4-TB drive with the 16-TB drive from the web GUI, which starts the resilvering process
I notice the remaining 4-TB drive has two checksum errors. This worries me a little given the system’s history, but I decided to let the resilver complete
Resilver finishes
The 4-TB reference drive and new 16-TB drive each have two checksum errors. I get an alert “One or more devices has experienced an error resulting in data corruption. Applications may be affected.”
Considering how I have solved this problem in the past, I shut down the system, reseat the SATA cables for both drives in question, and boot
The storage menu no longer shows checksum errors for either drive. However, the zpool status command (output below) shows that there is still an error
I started a scrub right before writing this post, since I assumed that would be the next logical step.

I want to mention here that no data was written to the NAS during this process except for internal affairs (e.g. snapshots), and I still have the removed 4-TB drive and have not modified it. I also have a backup of the system but would really prefer to not rebuild a 28-TB pool over my 1-gig connection (yeah yeah I know get better networking).

My questions:
a) Considering the “reference” drive has not has its data changed, are the checksum errors an issue assuming they don’t reappear during the scrub? My understanding is that checksum errors can be corrected by looking at another copy of the data (like on a parity drive), so while the parity is no longer online, the drive hasn’t had its data changed since it was running with its partner (when I shut it down in step 3). I assume the “Permanent error” in the zpool status report is not going to correct itself after the scrub.

b) If the checksum errors do reappear on those drives, what do I do? I could try reseating cables, or can/should it be solved by reinstalling the other 4-TB drive so the two can reference each other, then retry the new drive installation/resilvering process? How would I go about telling ZFS that a drive it should already recognize is “back” so it knows the two 4-TBs should already match?

c) Why did the reboot clear the error from the alerts panel, but not the zpool status output at the command line (I ran the command right after the resilver and after reseating/rebooting, and it seems weird to have the error cleared in only one place.)

d) Is there anything I can do to prevent these checksum errors in the future? The drives are directly connected to the motherboard with Silverstone SATA cables. I’m sure there are better ones, but they’re not garbage. The system isn’t a hot, noisy, or vibration-heavy environment.

Thanks for reading, and I appreciate any help.

root@Tubby[~]# zpool status -v Tubby_Tummy
pool: Tubby_Tummy
state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: Message ID: ZFS-8000-8A — OpenZFS documentation
scan: scrub in progress since Mon Apr 14 22:22:35 2025
10.6T / 23.3T scanned at 5.00G/s, 1.13T / 23.3T issued at 546M/s
0B repaired, 4.84% done, 11:48:28 to go
config:

    NAME                                      STATE     READ WRITE CKSUM
    Tubby_Tummy                               ONLINE       0     0     0
      mirror-0                                ONLINE       0     0     0
        65b327f5-28de-11ee-b17a-047c16c6facd  ONLINE       0     0     0
        5ee564fd-4a3a-4d4d-826e-08b4bac65d84  ONLINE       0     0     0
      mirror-1                                ONLINE       0     0     0
        dcacc904-2e85-11ee-93a8-047c16c6facd  ONLINE       0     0     0
        dc9b3b3c-2e85-11ee-93a8-047c16c6facd  ONLINE       0     0     0
      mirror-2                                ONLINE       0     0     0
        b3cab4b1-3626-11ef-a228-047c16c6facd  ONLINE       0     0     0
        b3dc73fe-3626-11ef-a228-047c16c6facd  ONLINE       0     0     0
    cache
      nvme0n1p1                               ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

    Tubby_Tummy/Plex_Media@Auto_Tubby_Sch05_2024-08-04_00-00:<0x10178>

System Information
TrueNAS Scale Dragonfish-24.04.2.5

MSI Pro B550M-VC WiFi Motherboard
Ryzen 5 4600G CPU
32 GB ECC Memory

2x Kingston SA400S37 240-GB SSD (Boot drives)
1x Intel Optane 128-GB SSD (L2 ARC)

2x Seagate Ironwolf 4-TB HDDs (Swapping to 16-TB versions)
2x Seagate Ironwolf 8-TB HDDs
2x Seagate Ironwolf 16-TB HDDs
^1 pool, 3 mirrored vdevs, all connected directly to motherboard with SATA cables

EVGA 220-P2-0650-X1 PSU
Fractal Design Node 804 Case

PB-Panda · April 15, 2025, 2:06pm

Update: the scrub has halfway done and has 9 hours remaining. The reference 4-TB drive and 16-TB drive now each have two checksum errors, same as I noticed in step 7. The file ZFS reports an error in is an 8-month-old snapshot of my 20-TB media library. I am fine with losing that point it time, but since snapshots are differential, I assume deleting that snapshot would render all future snapshots useless. I have ZFS replication set up to a local backup, and that snapshot on the backup machine uses 333 MiB and references 11.6 TiB. I already manually held it. Is there a way to just replace that snapshot on my main machine from my backup machine (thus copying 333 MiB) rather than restoring the whole 20-TB dataset? I don’t have 20 TB of free space on my main machine, so I would have to directly replace the existing dataset vs the recommended restore strategy of creating a copy.

root@Tubby[~]# zpool status -v
  pool: Tubby_Tummy
 state: ONLINE
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
  scan: scrub in progress since Mon Apr 14 22:22:35 2025
        16.3T / 23.3T scanned at 443M/s, 12.8T / 23.3T issued at 348M/s
        1M repaired, 54.99% done, 08:45:07 to go
config:

        NAME                                      STATE     READ WRITE CKSUM
        Tubby_Tummy                               ONLINE       0     0     0
          mirror-0                                ONLINE       0     0     0
            65b327f5-28de-11ee-b17a-047c16c6facd  ONLINE       0     0     2
            5ee564fd-4a3a-4d4d-826e-08b4bac65d84  ONLINE       0     0     2
          mirror-1                                ONLINE       0     0     0
            dcacc904-2e85-11ee-93a8-047c16c6facd  ONLINE       0     0     0
            dc9b3b3c-2e85-11ee-93a8-047c16c6facd  ONLINE       0     0     0
          mirror-2                                ONLINE       0     0     0
            b3cab4b1-3626-11ef-a228-047c16c6facd  ONLINE       0     0     1  (repairing)
            b3dc73fe-3626-11ef-a228-047c16c6facd  ONLINE       0     0     0
        cache
          nvme0n1p1                               ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

        Tubby_Tummy/Plex_Media@Auto_Tubby_Sch05_2024-08-04_00-00:<0x10178>

  pool: boot-pool
 state: ONLINE
  scan: scrub repaired 0B in 00:00:25 with 0 errors on Wed Apr  9 03:45:27 2025
config:

        NAME        STATE     READ WRITE CKSUM
        boot-pool   ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            sdh3    ONLINE       0     0     0
            sde3    ONLINE       0     0     0

errors: No known data errors

etorix · April 15, 2025, 5:41pm

Non-PRO APUs such as your 4600G do not support ECC memory. Your RAM must be running in non-ECC mode.
You do have metadata corruption in a snapshot. Hopefully, the actual data is still sane, but you’re on track to loosing data.

Possible suspects would be cables, an overloaded PSU, and failing RAM. I suggest running a few passes of MemTest.

No and no. You cannot move snapshot metadata without the underlying data, but the error is in one snapshot and you can delete this snapshot specifically—the question is whether other snapshots would then inherit the bad metadata.

PB-Panda · April 15, 2025, 7:30pm

Thank you for the tips! I will run boot MemTest86 off a USB drive tonight to see what happens. I doubt the PSU is the issue since it’s a 650-W unit and the computer draws <100 W. The cables have been an ongoing issue but have never caused data corruption before. I assume it’s the combination of the cables/RAM and swapping a drive (thus breaking the mirror) that causes the data corruption?

So I learn for the future, how do you know it’s the metadata of the snapshot and not the actual data in the Snapshot?

Regardless, I need to deal with this metadata corruption. I am okay with losing the corrupted Snapshot, but if I delete it, how can I tell if the other snapshots inherit the bad metadata? Would I need to run a scrub again and see if the next snapshot forward in time has this same error? Do you have any other suggestions on what to do (other than run MemTest, which I will post here)?

etorix · April 15, 2025, 8:01pm

How? (Let’s have a talk with the suspect…)

Permanent errors have been detected in the following files:
Tubby_Tummy/Plex_Media@Auto_Tubby_Sch05_2024-08-04_00-00:<0x10178>

Such <0xnumber> aren’t real files but ZFS metadata.
Indeed the test would be to do another scrub after deleting the snapsot and see whether the error was cleared or was inherited by another snapshot.

PB-Panda · April 15, 2025, 8:32pm

Every few months I would get 1 checksum error on a drive or two. Various forum posts here led me to believe it was a cable/connection issue, so I would shut down, reseat cables, boot up, and scrub. Sometimes had to repeat that 2-3 times to fix the error. I wasn’t super systematic about which MB ports/cables were problematic, and I did replace a few cables (these from Silverstone). I believe every drive had this error at one point, and since I was swapping which drives were plugged into which ports, I assumed cable/motherboard ports were to blame. Annnoying, but fixable. I now have a map of which MB port corresponds to which drive, and will be closely tracking it from here out. When I did the swap I mentioned in the first post, I was careful to not change any drive/port pairings.

Another issue I’m having that may be helpful/related is that I seem to be chewing through boot drives every 6-ish months. I get errors shown in the attached screenshots, and I make it go away by replacing the affected boot drive. I can make another thread for this since it isn’t as big of a deal (worth my time to spend $20 on a new cheap SSD for a temporary fix), but maybe it’s a cable or RAM issue and thus connected to my problem stated in this thread. I’d used cheap Kingston ones, similarly cheap PNY ones that I can’t seem to find online at the moment, and recently this Crucial one. The frequency of the “failure” makes me think it might be the system and not the SSD itself. I still have all the “failed” ones untouched

and intend to test them at some point, but that’s low on my priority list.

I will finish the current scrub, do a memory test, and assuming at least one of my two sticks is good, delete the snapshot and rescrub. Thanks again.

PB-Panda · April 17, 2025, 4:55am

Quick update just to leave a paper trail for those who may have the same issue in the future.

I ran two memory tests with MemTest86. The hard drives were unplugged during the tests for good measure. Interestingly the runs were only 8 seconds apart in total time for a 4.5-hr run, 0.05% difference. No memory errors. After the second test I shut down the system and re-plugged the hard drives, making sure to seat the SATA cables well. I kept the same HDD-cable-MB port pairing as the rest of my explanations in this thread.

Upon booting TrueNAS, I noticed the checksum errors for the drives disappeared (sde and sdh each had two before). The output of zpool status still shows that one snapshot has corrupted metadata as expected (see output).

I deleted that snapshot, and ran the command again. The error still shows up just without the name of the snapshot this time, which I guess makes sense since it was deleted. The address on the end matches. I started a scrub and will report back if it fixes the error. The pool status is listed as unhealthy, but from what I can tell it’s only because of the error in the last scrub.

root@Tubby[~]# zpool status -v
  pool: Tubby_Tummy
 state: ONLINE
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
  scan: scrub repaired 1M in 1 days 03:38:58 with 1 errors on Wed Apr 16 02:01:33 2025
config:

        NAME                                      STATE     READ WRITE CKSUM
        Tubby_Tummy                               ONLINE       0     0     0
          mirror-0                                ONLINE       0     0     0
            65b327f5-28de-11ee-b17a-047c16c6facd  ONLINE       0     0     0
            5ee564fd-4a3a-4d4d-826e-08b4bac65d84  ONLINE       0     0     0
          mirror-1                                ONLINE       0     0     0
            dcacc904-2e85-11ee-93a8-047c16c6facd  ONLINE       0     0     0
            dc9b3b3c-2e85-11ee-93a8-047c16c6facd  ONLINE       0     0     0
          mirror-2                                ONLINE       0     0     0
            b3cab4b1-3626-11ef-a228-047c16c6facd  ONLINE       0     0     0
            b3dc73fe-3626-11ef-a228-047c16c6facd  ONLINE       0     0     0
        cache
          nvme0n1p1                               ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

        <0x8a09>:<0x10178>

  pool: boot-pool
 state: ONLINE
  scan: scrub repaired 0B in 00:00:16 with 0 errors on Wed Apr 16 03:45:18 2025
config:

        NAME        STATE     READ WRITE CKSUM
        boot-pool   ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            sdg3    ONLINE       0     0     0
            sda3    ONLINE       0     0     0

errors: No known data errors
root@Tubby[~]#

PB-Panda · April 18, 2025, 1:23pm

As @etorix feared, the completed scrub shows that the metadata in the next snapshot is now corrupted. Outpoot of zpool status below. I think I have a few ways to fix this, listed in order of ease, and would like to make sure they would all work.

The dataset in question, Plex_Meda, is just movies. I only ripped a few since last August (when the corrupt snapshot was taken), so the fastest way to recover may be to copy the movies I added since August to an external location, roll back to a local snapshot pre-corrruption, and then just add the movies back to the NAS. A few hundred GB is far easier move than 20 TB, which is the second option. I only consider this because the snapshot metadata, not the actual data, is corrurpt.
The other option, of course, is to restore the Plex_Media dataset from a recent snashot stored on the backup machine. No file shuffling this way, but I’m (arguably needlessly) moving 20 TB around.

I also added the second drive I mentioned wanting to upgrade in my original post because the corrupt snapshot data is now an issue irrespective of another drive swap. That way if anything goes bad with the second swap, I can restore all potentially corrrupt files in one fell swoop. The checksum errors are also back, hoping those go away after a restore and/or more cable reseating.

root@Tubby[~]# zpool status -v
  pool: Tubby_Tummy
 state: ONLINE
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
  scan: scrub repaired 1M in 1 days 03:05:24 with 1 errors on Fri Apr 18 03:02:07 2025
config:

        NAME                                      STATE     READ WRITE CKSUM
        Tubby_Tummy                               ONLINE       0     0     0
          mirror-0                                ONLINE       0     0     0
            65b327f5-28de-11ee-b17a-047c16c6facd  ONLINE       0     0     2
            5ee564fd-4a3a-4d4d-826e-08b4bac65d84  ONLINE       0     0     2
          mirror-1                                ONLINE       0     0     0
            dcacc904-2e85-11ee-93a8-047c16c6facd  ONLINE       0     0     0
            dc9b3b3c-2e85-11ee-93a8-047c16c6facd  ONLINE       0     0     0
          mirror-2                                ONLINE       0     0     0
            b3cab4b1-3626-11ef-a228-047c16c6facd  ONLINE       0     0     1
            b3dc73fe-3626-11ef-a228-047c16c6facd  ONLINE       0     0     0
        cache
          nvme0n1p1                               ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

        Tubby_Tummy/Plex_Media@Auto_Tubby_Sch05_2024-08-11_00-00:<0x10178>

  pool: boot-pool
 state: ONLINE
  scan: scrub repaired 0B in 00:00:16 with 0 errors on Wed Apr 16 03:45:18 2025
config:

        NAME        STATE     READ WRITE CKSUM
        boot-pool   ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            sdg3    ONLINE       0     0     0
            sda3    ONLINE       0     0     0

errors: No known data errors

etorix · April 18, 2025, 1:42pm

As long as the error is in snapshots, you can keep deleting snapshots and not lose data.
Assuming you do not care to revert the collection to a previous state (“add only, never delete”?), I’d delete all snapshots, scrub again—if it comes clean, pat yourself, and take a new snapshot.