Increasing capacity - 1 x RAIDZ2 | 5 wide | 5.46 TiB: Resilvering repeats on previously successful replacement

Very strange behaviour.

A few days ago I started, with great trepidation, to double the capacity of the above pool using 5 brand new WD Red Pro disks.

The first one, 1/5, completed ok with a couple of CKSUM errors, which I cleared and ran the pool for a while with no issues. All online and zero errors. No data was ever affected.

I followed the same process to replace the next drive, 2/5:

  • Offline
  • Physically remove old disk
  • Insert new disk - same slot
  • Hit ‘replace’ and select drive from dropdown list (only one entry available, so foolproof, eh?)
  • hit the button and confirm

After some time the progress showed reasonable values for progress and time to go to completion of resilvering. This is where weird stuff started.

The first disk I replaced earlier that successfully ran for a while suddenly developed a few CKSUM errors (53) at about the same time the ‘replacing’ vdev also showed a similar figure (68). It, 1/5, then also decided to resilver again, even though it was not part of the current replacement process for 2/5.

Although I was worried, after some research and building some faith in the process I waited for it to complete, at which point (2 days 2 hours later) both resilvering activities completed ok.

Again, pool running ok for a while with no errors and everything online and active as expected.

I am now replacing 3/5 and the same thing happened. After about an hour 2/5 threw some errors and spontaneously started to resilver:

[…]

root@rex[~]# zpool status -v -LP re1
pool: re1
state: DEGRADED
status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scan: resilver in progress since Mon Dec 22 18:43:00 2025
5.03T / 21.6T scanned at 242M/s, 2.99T / 21.6T issued at 144M/s
621G resilvered, 13.85% done, 1 days 13:39:31 to go
config:

    NAME                                                              STATE     READ WRITE CKSUM
    re1                                                               DEGRADED     0     0     0
      raidz2-0                                                        DEGRADED     0     0     0
        /dev/sdh2                                                     ONLINE       0     0     0
        /dev/sdk1                                                     ONLINE       0     0    13  (resilvering)
        /dev/sda2                                                     ONLINE       0     0     0
        /dev/sde1                                                     ONLINE       0     0     0
        replacing-4                                                   DEGRADED     0     0    11
          /dev/disk/by-partuuid/c2400e8f-3476-483d-a675-873e32aff0b8  OFFLINE      0     0     0
          /dev/sdf1                                                   ONLINE       0     0     0  (resilvering)

errors: No known data errors

[…]

As you can see, sdk1 is 2/5 and was previously running error free until I replaced sdf1, which is currently 3/5.

Throughout this process, the GUI never showed the repeated resilver, only the current one:

So, I am hoping this is not serious and when I finally replace 5/5 I will be able to expand the pool into the new capacity.

;-^}

P

What is the model of the new drive? Need to ensure they are not SMR. Also, did you burn the drives in, or just open the box and install? Third, do you have an HBA card or are you running this off an onboard controller?

Hi Theo,

Thanks for the response.

The new drives are all WDC_WD120EFBX-68B0EN0. The old ones are all WDC_WD60EFPX-68C5ZN0, so guessing these are all kosher for this application. Although I did not burn them in, I allowed them to acclimatise in the server cabinet for 24 hours before using them.

Since original post I have shut down all apps, VMs and containers to mitigate risk of further failure. The ETA is reducing apace and the error count has not risen since. In hindsight, perhaps I should have done this at the outset, but I was testing the envelope.

Fingers crossed, this will complete successfully. I will keep it quiesced until 5/5 is done and dusted.

The enclosure is a TrueNAS (iX Systems) Mini R. I am hoping these errors are simply a sign the hardware was struggling and reducing the load, as I have, will improve the situation.

MTIA

;-}

P

Ok, so now I am convinced there is a reporting issue in the zpool executable.

Before I quiesced the system to improve resilvering throughput another disk/vdev threw an error (just the one CKSUM) which immediately caused the system to indicate that disk, too, was being resilvered. So, there were then THREE disks in this state.

These are brand new disks so I cannot imagine them all being this bad right at the outset. This is also borne out by the fact that when the original resilvering is completed they are ALL completed at the very same time. I suspect any CKSUM error that is raised by another disk/vdev confuses the status in such a way as to fool zpool into flagging it as the same ‘resilvering’ state as the original replacement disk process.

I have recently started 4/5 after clearing the errors and leaving the system in a quiescent state. It is now running perfectly at about four times the speed the previous replacements ran:

[…]

pool: re1
state: DEGRADED
status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scan: resilver in progress since Tue Dec 23 23:07:27 2025
6.12T / 21.7T scanned at 1.18G/s, 2.44T / 21.7T issued at 481M/s
500G resilvered, 11.24% done, 11:38:02 to go
config:

    NAME                                                              STATE     READ WRITE CKSUM
    re1                                                               DEGRADED     0     0     0
      raidz2-0                                                        DEGRADED     0     0     0
        replacing-0                                                   DEGRADED     0     0     0
          /dev/disk/by-partuuid/6426081c-a789-42cb-947c-4875d2ea966f  OFFLINE      0     0     0
          /dev/sdh1                                                   ONLINE       0     0     0  (resilvering)
        /dev/sdk1                                                     ONLINE       0     0     0
        /dev/sda2                                                     ONLINE       0     0     0
        /dev/sde1                                                     ONLINE       0     0     0
        /dev/sdf1                                                     ONLINE       0     0     0

errors: No known data errors
Wed Dec 24 00:35:53 CET 2025

root@rex[~]# while sleep 30; do clear; zpool status -v -LP re1; date; done

[…]

So, no errors and running well - exactly as expected. Previous replacements took about 2 days 4 hours, so this is a vast improvement, albeit with very limited access to the system for the duration. A small price to pay.

Aside from all this I was impressed with the ease of physically replacing the disks. The slots were easy to access and there was surprisingly little dust after 2 years continuous uptime. The disks were also amazingly cool, barely above room temperature, so the new disks were almost immediately at operating temperature for this enclosure.

;-}

P

Final update:

Well, my euphoria was short lived on 4/5:

[…]

pool: re1
state: DEGRADED
status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scan: resilver in progress since Tue Dec 23 23:07:27 2025
21.7T / 21.7T scanned, 17.5T / 21.7T issued at 355M/s
3.52T resilvered, 80.65% done, 03:26:27 to go
config:

NAME STATE READ WRITE CKSUM
re1 DEGRADED 0 0 0
raidz2-0 DEGRADED 0 0 0
replacing-0 DEGRADED 0 0 1.08K
6426081c-a789-42cb-947c-4875d2ea966f OFFLINE 0 0 0
sdh1 ONLINE 0 0 0 (resilvering)
sdk1 ONLINE 0 0 1 (resilvering)
sda2 ONLINE 0 0 0
sde1 ONLINE 0 0 0
sdf1 ONLINE 0 0 1.83K (resilvering)

errors: No known data errors

[…]

However, finally, 5/5 was a different story:

[…]

pool: re1
state: ONLINE
scan: resilvered 4.36T in 16:14:48 with 0 errors on Thu Dec 25 10:32:03 2025
config:

    NAME           STATE     READ WRITE CKSUM
    re1            ONLINE       0     0     0
      raidz2-0     ONLINE       0     0     0
        /dev/sdh1  ONLINE       0     0     0
        /dev/sdk1  ONLINE       0     0     0
        /dev/sda1  ONLINE       0     0     0
        /dev/sdc1  ONLINE       0     0     0
        /dev/sdf1  ONLINE       0     0     0

errors: No known data errors
Thu Dec 25 17:34:11 CET 2025

[…]

Although all the disks are by the same manufacturer, I am now wondering if mixed capacities, geometries and rotational speed (5400/7200rpm) may have played a part in this little misadventure.

As 5/5 was the last to be replaced and completed the set of identical drives this may explain why it ran clean as there were no longer any older, disparate geometries to cause ZFS distress.

I will be performing a similar exercise with the now retired 5 6TB disks. I have a smaller pool, much less data stored, which comprises disks of various pedigrees - Seacrate, WD and HGST - and mixed rotational speeds. Although it has never failed, the disks are only 4TB, so replacing them with identical 6TB drives seems like a good idea. Time will tell.

;-}

P

Epilogue:

Finally, I was able to scrub everything squeaky clean:

[…]

pool: re1

state: ONLINE

scan: resilvered 4.36T in 16:14:48 with 0 errors on Thu Dec 25 10:32:03 2025

config:

    NAME           STATE     READ WRITE CKSUM

    re1            ONLINE       0     0     0

      raidz2-0     ONLINE       0     0     0

        /dev/sdi1  ONLINE       0     0     0

        /dev/sdl1  ONLINE       0     0     0

        /dev/sda1  ONLINE       0     0     0

        /dev/sdd1  ONLINE       0     0     0

        /dev/sdg1  ONLINE       0     0     0

errors: No known data errors

pool: re2

state: ONLINE

scan: scrub repaired 0B in 01:21:52 with 0 errors on Sat Dec 27 23:19:21 2025

config:

    NAME           STATE     READ WRITE CKSUM

    re2            ONLINE       0     0     0

      raidz2-0     ONLINE       0     0     0

        /dev/sdk1  ONLINE       0     0     0

        /dev/sdj1  ONLINE       0     0     0

        /dev/sdf1  ONLINE       0     0     0

        /dev/sdc1  ONLINE       0     0     0

        /dev/sdb1  ONLINE       0     0     0

errors: No known data errors

pool: re3

state: ONLINE

scan: scrub repaired 0B in 05:39:20 with 0 errors on Sat Dec 27 09:10:24 2025

config:

    NAME         STATE     READ WRITE CKSUM

    re3          ONLINE       0     0     0

      /dev/sdh2  ONLINE       0     0     0

      /dev/sde1  ONLINE       0     0     0’

errors: No known data errors

Mon Dec 29 17:15:46 CET 2025

[…]

But is was a real mezcla. After much hair puling, eye rolling and gnashing of teeth I lost some files and had to sacrifice some containers and all the volumes buried in the ‘experimental’ area, as described by the warning:

[…]

Containers powered by Incus are experimental and only recommended for advanced users. Make all configuration changes using the TrueNAS UI. Operations using the command line are not supported.

[…]

I did try the cli to attempt recovery of the files reportedly with permanent errors. Unfortunately, as there were permanent errors in the metadata any attempt to use zfs send/recv to copy/restore files was thwarted as it checks the metadata before anything else.

Luckily, I had heeded the warning so did not use containers for anything important, really just as a sandbox. But I tried as much as possible to recover data as practice and experience for future occasions when I may need it in earnest.

The pool with the worst experience experience, re2, is now squeaky clean and working well now it has homogeneous drives inherited from re1, which was the first the be expanded.

Happy New Year.

;-}

/p

1 Like