How to recover pool / disks

ecarm85 · April 7, 2024, 12:00am

I caused an issue that required me to reinstall TrueNAS TrueNAS-13.0-U6.1. There were a couple hard reboots. After re-installation, pool status was degraded. Pool is raidz1 (I know), 7 x 10TB HDD.

I replaced one disk that was banging around loudly, and let that resilver successfully. I had one other disk that resilvered successfully as well. Zpool status shows four disks with “too many errors”, and there is one disk that doesn’t appear in zpool status at all, but I do see it in the GUI as being in pool “N/A”.

I have run smart tests short offline on all disks, and they all come back without errors. I have SMART tests scheduled, both short and long, and there are no errors on any disks that I can see. I scrub monthly.

The disks with “too many errors” … can I just zpool clear those?

The disk that is not attached to the pool… what do I do with that one?

root@nas[/]# zpool status
pool: boot-pool
state: ONLINE
config:

    NAME        STATE     READ WRITE CKSUM
    boot-pool   ONLINE       0     0     0
      mirror-0  ONLINE       0     0     0
        ada0p2  ONLINE       0     0     0
        ada1p2  ONLINE       0     0     0

errors: No known data errors

pool: zpool1
state: DEGRADED
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using ‘zpool clear’ or replace the device with ‘zpool replace’.
see: Message ID: ZFS-8000-9P — OpenZFS documentation
scan: resilvered 7.34G in 22:11:31 with 0 errors on Sat Apr 6 18:15:31 2024
config:

    NAME                                            STATE     READ WRITE CKSUM
    zpool1                                          DEGRADED     0     0     0
      raidz1-0                                      DEGRADED     0     0     0
        gptid/6dfdf696-d2ce-11ea-87fd-d8cb8aa05b52  DEGRADED     0     0     0  too many errors
        gptid/6e021a55-d2ce-11ea-87fd-d8cb8aa05b52  ONLINE       0     0     0
        gptid/392ebb9c-f090-11ee-8013-a0369f3fee8c  ONLINE       0     0     0
        gptid/6e072261-d2ce-11ea-87fd-d8cb8aa05b52  DEGRADED     0     0     0  too many errors
        gptid/bce14ebd-04f1-11eb-bead-a0369f3fee8c  DEGRADED     0     0     0  too many errors
        gptid/6e09ae36-d2ce-11ea-87fd-d8cb8aa05b52  DEGRADED     0     0     0  too many errors
    logs
      gptid/e8fb6a7f-d8f8-11ea-a4f9-a0369f3fee8c    ONLINE       0     0     0
    cache
      gptid/e802d832-d8f8-11ea-a4f9-a0369f3fee8c    ONLINE       0     0     0

errors: No known data errors
root@nas[/]#

sorry for the formatting…

Heracles · April 7, 2024, 12:51am

First thing to do is to take a full backup of everything. RaidZ1 with a single drive gone can not take any more hit. Considering there are errors on the other drives, it is highly probable that you lost some data already.

So recover whatever can be recovered first and do that right now.

Once that is done, there is little benefit trying to save a pool that has been damaged to that point. So once you salvaged whatever can be, destroy the pool and re-create a better one using at least RaidZ2.

ecarm85 · April 7, 2024, 12:58am

Pool is now online with six disks instead of seven. All seven disks are now showing no errors. I am waiting for sufficient backup drive capacity to arrive. How can I get that seventh disk online until then?

Heracles · April 7, 2024, 1:58am

TrueNAS recognizes its disk by the gptid. The fact that the disk is not listed at all by the zpool command suggests that it is not recognized at all. Is the drive listed the WebUI in Storage / Disk

If it does, are the details provided and right (serial No ; size )

In all cases, it is not such a good idea to re-add that drive :
–That drive is known as defective and will not provide you with safety
–Your other drives produced too many errors recently, so are not that reliable either. To resilver a drive from them will push them very hard and will likely generate more of these errors.

So for now, best to backup all you can and go easy on that pool and these drives.

ecarm85 · April 7, 2024, 2:13pm

Yes, and I am confused by this.

Yes. What really confuses me is that zpool status shows that the pool is still in raidz1-0 which the full capacity I had before the incident, but without the 7th disk being seen by the pool. How could that be?

root@nas[/]# zpool status
  pool: boot-pool
 state: ONLINE
config:

        NAME        STATE     READ WRITE CKSUM
        boot-pool   ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            ada0p2  ONLINE       0     0     0
            ada1p2  ONLINE       0     0     0

errors: No known data errors

  pool: zpool1
 state: ONLINE
status: Some supported and requested features are not enabled on the pool.
        The pool can still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
        the pool may no longer be accessible by software that does not support
        the features. See zpool-features(7) for details.
  scan: resilvered 7.34G in 22:11:31 with 0 errors on Sat Apr  6 18:15:31 2024
config:

        NAME                                            STATE     READ WRITE CKSUM
        zpool1                                          ONLINE       0     0     0
          raidz1-0                                      ONLINE       0     0     0
            gptid/6dfdf696-d2ce-11ea-87fd-d8cb8aa05b52  ONLINE       0     0     0
            gptid/6e021a55-d2ce-11ea-87fd-d8cb8aa05b52  ONLINE       0     0     0
            gptid/392ebb9c-f090-11ee-8013-a0369f3fee8c  ONLINE       0     0     0
            gptid/6e072261-d2ce-11ea-87fd-d8cb8aa05b52  ONLINE       0     0     0
            gptid/bce14ebd-04f1-11eb-bead-a0369f3fee8c  ONLINE       0     0     0
            gptid/6e09ae36-d2ce-11ea-87fd-d8cb8aa05b52  ONLINE       0     0     0
        logs
          gptid/e8fb6a7f-d8f8-11ea-a4f9-a0369f3fee8c    ONLINE       0     0     0
        cache
          gptid/e802d832-d8f8-11ea-a4f9-a0369f3fee8c    ONLINE       0     0     0

errors: No known data errors
root@nas[/]#

I suspect that all the errors (with exception of the one failed disk that was making clicking sounds) were caused by my rough handling (hard shutdowns) of the box, were “transient” and not caused by disk failure. When I look at historical SMART tests, there are no errors. When I run short SMART tests, there are no errors.

I agree 100% on all your other points of course. My data set is around 43TB, and it’s too expensive for me fully back it up. The really critical stuff is backed up to a 6TB disk in another PC, and I’m waiting for two 20TB disk to arrive tomorrow, and I’ll back up the rest then.

winnielinnie · April 7, 2024, 2:26pm

You keep mentioning “7 disks”, yet your output shows 6 data disks + 1 SLOG + 1 cache.

Where is the evidence for the existence of this missing 10 TiB disk?

If a RAIDZ1 vdev is missing an entire disk, it would not report itself as “ONLINE”, but rather “DEGRADED”. By all means, your vdev (in the latest post) seems to be healthy and fully operational with all six (6) data disks accounted for.

EDIT: Since you are on Core, this can give you an idea of all the available partitions and their respective GPTID’s:

glabel list | grep Name | cut -c 10-

You can compare the list to your pool status output. If all devices (data, SLOG, and cache) are accounted for, then maybe you remembered incorrectly that you used seven (7) data disks for the RAIDz1 vdev?

ecarm85 · April 7, 2024, 2:57pm

root@nas[/]# glabel status
                                      Name  Status  Components
gptid/40536d02-f079-11ee-9cab-a0369f3fee8c     N/A  ada0p1
gptid/392ebb9c-f090-11ee-8013-a0369f3fee8c     N/A  da4p2
gptid/406690f1-f079-11ee-9cab-a0369f3fee8c     N/A  ada1p1
gptid/6e072261-d2ce-11ea-87fd-d8cb8aa05b52     N/A  da3p2
gptid/6e09ae36-d2ce-11ea-87fd-d8cb8aa05b52     N/A  da5p2
gptid/6dfdf696-d2ce-11ea-87fd-d8cb8aa05b52     N/A  ada2p2
gptid/6e07c4c1-d2ce-11ea-87fd-d8cb8aa05b52     N/A  ada3p2
gptid/6e021a55-d2ce-11ea-87fd-d8cb8aa05b52     N/A  ada4p2
gptid/e8fb6a7f-d8f8-11ea-a4f9-a0369f3fee8c     N/A  da0p1
gptid/e802d832-d8f8-11ea-a4f9-a0369f3fee8c     N/A  da1p1
gptid/bce14ebd-04f1-11eb-bead-a0369f3fee8c     N/A  da2p2
gptid/6dffbc93-d2ce-11ea-87fd-d8cb8aa05b52     N/A  ada3p1
root@nas[/]#

The very last disk ada3p1 is the disk that appears in the GUI in “Disks”, but not in the zpool. I keep calling it the 7th disk because it’s my 7th 10TB HDD… serial ```
VCG6LSXN

root@nas[/]# ./DiskInfo.sh

DiskInfo.sh - Mounted Drives on nas.hudsonvalleynetworks.com
TrueNAS-13.0-U6.1 (6bf2413add)
Sun Apr 7 10:56:40 EDT 2024


The 60GB disks are my SSD boot mirror, and the Crucials are log and cache.

Also, I had the box open all week... there's 100% seven 10TB HDDs, they are all cabled and powered on... I can run SMART tests without error on all seven disks... but zpool only sees six, and still thinks it's in raidz1. That's why I keep asking how to get that disk back in the pool.

Did resilvering with that disk not connected to the pool blow it away somehow?  I didn't realize it was missing until it was too late :(

winnielinnie · April 7, 2024, 3:01pm

Are you sure you actually included it the first time you created this RAIDZ1 vdev?

As of the time of this writing, there is no RAIDZ expansion: So once you create a RAIDZ vdev, it’s a done deal.

If it’s with 6 data disks? You’re stuck with a RAIDZ1 of 6 data disks.

If it’s with 7 data disks? You’re stuck with a RAIDZ1 of 7 data disks.

There is no “shrinking” the RAIDZ1 vdev to 6 data disks, down from 7.

ecarm85 · April 7, 2024, 3:13pm

Are you sure you actually included it the first time you created this RAIDZ1 vdev?

Uh… you know, I built this box back in 2019. Now that you mention it, I think that I might have had one of those seven disks as a hot spare and then disabled that at some point. I think I’ve had a senior moment.

My bad. I’ll rebuild as raidz2 this week.

Stux · April 7, 2024, 3:22pm

Warm spare it is.

(Not hot, not cold)

Constantin · April 8, 2024, 10:09am

Not a fan of warm spares in a home setting where you can readily get to and replace drives. They make a lot more sense for remote systems. I buy drives, I qualify them with the usual approach, then I set them aside until they are needed. That means less power needed, less heat, less wear, and still a reasonable assurance that the drive will function when an old one has to be pulled.