Need help. I'm concerned I'm about to lose my data

jcarrut2 · October 15, 2024, 12:22am

I’m running TrueNAS Scale 23.10.1.

I was trying to expand my Pool ‘Ocean’ from 2x12Tb (Mirrored) “sda and sdb”+ 2x10Tb (Mirrored) “sdc and sdd” to 2x24Tb(Mirrored) “sde and sdf” + 2x10Tb(Mirrored) “sdc and sdd”. In other words, I was replacing the 12 Tb mirrored storage with 24 Tb mirrored storage for a total of 34 Tb mirrored storage.

I did a replace on drives sda with sde and sdb with sdf, one by one, letting each resilver each time. After this completed, I tried to expand the pool ‘Ocean’ because it did not expand automatically. TrueNAS gave me an error and told me to reboot. So I did. However upon the reboot I noticed that drive ‘sde’, one of the new 24 TB drives, was reading as ‘Unavailable’ in the Pool ‘Manage Devices’ screen, even though it responded to SMART tests just fine. I tried onlining it through the command line, but it didn’t work. Finally I tried ‘Detach’ on ‘sde’ in the GUI, reasoning that ‘sdf’ and my ‘sdc sdd’ mirror was healthy and I could expand the single drive first, then try to re-mirror onto ‘sde’, and worst case I could simply run with the three drive system while I waited for a new drive if I needed to replace ‘sde’. However when I hit ‘Expand’ for the pool in the three drive configuration, TrueNAS threw and error and told me to reboot. When I rebooted, the pool had ZERO drives assigned to it. When I checked the disk list, I saw all of the drives were still there, but sde read as assigned to pool ‘N/A’ and the three ‘healthy’ drives were reading as assigned to pool ‘Ocean (Exported)’.

So now I’m very concerned that if I do the wrong thing, I’m going to lose my data. Does anyone have any advice on how to recover to a healthy 4-disk mirrored setup?

winnielinnie · October 15, 2024, 12:25am

Before anything else, did you create a checkpoint on the pool before proceeding with the expansion?

jcarrut2 · October 15, 2024, 12:31am

No. I’m aware of snapshots but checkpoints seem to be different and are not something I was aware of.

winnielinnie · October 15, 2024, 12:40am

Do you remember what the error was?

Is there any HBA or “SATA card” involved?

What do you mean it responded to SMART tests? From the GUI you ran a short test for sde?

Please try to avoid just winging it in the command-line.

What did you try? Did you actually specify the kernel identifier name sde or the PARTUUID?

As far as I know, TrueNAS does not allow you to do such actions on a degraded pool. (Or maybe it does now?)

Without committing to anything, what is the output of

zpool import

Just like that. No flags or parameters or devices.

jcarrut2 · October 15, 2024, 12:50am

Do you remember what the error was?

It flashed up on the screen, I should have taken a screenshot of it. Something about the drive still being in use and unable to expand and recommending a reboot before making any further changes.

Is there any HBA or “SATA card” involved?
All my drives are connected directly to Mobo

What do you mean it responded to SMART tests? From the GUI you ran a short test for sde?

Correct. I ran a short test for drive ‘sde’.

What did you try? Did you actually specify the kernel identifier name sde or the PARTUUID?

I wasn’t really winging it, but trying to follow some examples from the old forums. In any case, I used the PARTUUID and it returned that the drive had been onlined but the pool would continue to be degraded.

As far as I know, TrueNAS does not allow you to do such actions on a degraded pool. (Or maybe it does now?)

Don’t know what to tell you, there was a ‘Detach’ button for the Unavailable drive.

Output of zpool import:

root@truenas[~]# zpool import                           
   pool: Ocean
     id: 12947942587710781215
  state: UNAVAIL
status: One or more devices contains corrupted data.
 action: The pool cannot be imported due to damaged devices or data.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-5E
 config:

        Ocean                                     UNAVAIL  insufficient replicas
          mirror-0                                ONLINE
            f00e5072-d12c-11ea-94de-40167e27b4a2  ONLINE
            f0136e31-d12c-11ea-94de-40167e27b4a2  ONLINE
          1b97be74-19e9-48f5-9ead-844b66857f35    UNAVAIL  invalid label

Really appreciate your help, regardless!

winnielinnie · October 15, 2024, 12:59am

I was referring to the “Expand” part, later in the quote. Sorry for not being more specific.

Regardless, you shouldn’t “detach” a drive from a mirror vdev, unless you wish to convert it into a stripe.

This is unnerving. There are only three drives, one of which has an “invalid label”, and I’m assuming the other one was tossed to the void when you “detached” it.

This is out of my comfort level.

@HoneyBadger, maybe there’s a safe approach to this?

My guess is that mirror-1 was always running “degraded” with a single working drive, and then that drive was supposedly detached, leaving only the “invalid label” drive as a sole stripe. Basically, mirror-1 was superficially destroyed.

jcarrut2 · October 15, 2024, 1:05am

Thank you for your help. I hope there is a safe path back to my data.

Just as a note, I’m almost certain that I ‘detached’ the drive originally labeled unavailable, not the ‘healthy’ one by mistake. I checked and the serial number corresponding to the drive you see in the zpool import output was the same drive that was reading ‘Healthy’ before I detached the other drive.

jcarrut2 · October 16, 2024, 4:49pm

I’ve given up on trying to figure out a solution to recovering my data directly. I’m just going to wipe the disks, rebuild the mirror with the new disks, and restore as much data as possible from my old server to my new one. I’m losing a couple of years of data by doing this since I stopped updating my old server several years ago, but nothing critical. All my critical stuff has offsite backup I can restore from. It’s just a bummer and a frustration is all.

Lessons learned:
A) Use checkpoints!
B) Don’t ever just detach a mirror
C) Update your independent backup more often than once every few years.

HoneyBadger · October 16, 2024, 6:01pm

Have you formatted the disks yet? If not, don’t.

RetroG · October 16, 2024, 6:13pm

ok, so the error I’m betting is the same old crap of

“Partition(s) 1, 4 on /dev/sdwhatever have been written, but we have been unable to inform the kernel of the change, probably because it/they are in use. As a result, the old partition(s) will remain in use. You should reboot now before making further changes.”

to be clear, this is an old problem, the expand button seems to be broken every other release of TrueNAS and I would not rely on it. I’ve found the only reliable way is to offline the disk, resize the partition, and online -e it. anything else turns into a gamble of missing devices. there is an unofficial guide from back in the day if anyone else might search this error before rebooting and disaster strikes.

at this point, don’t try to write new partitions in any circumstance, though can you post what the partition layout of the unhappy disk is? you can post the output from fdisk /dev/sdwhatever -l and lsblk -o NAME,SIZE,PARTUUID ?

jcarrut2 · October 16, 2024, 6:51pm

I haven’t yet as I’m at my day job. Is there something else I can try to rescue my data?

Thanks for your help!

jcarrut2 · October 16, 2024, 6:53pm

I believe this is exactly the error that I received.

jcarrut2 · October 16, 2024, 7:07pm

Here is the requested output.

root@truenas[~]# fdisk /dev/sdg -l          
Disk /dev/sdg: 21.83 TiB, 24000277250048 bytes, 46875541504 sectors
Disk model: WDC WD240KFGX-68
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: gpt
Disk identifier: E8F82843-98C0-48B5-ADD6-CCDD218DEE22

Device     Start         End     Sectors  Size Type
/dev/sdg1   2048 46875541470 46875539423 21.8T Solaris /usr & Apple ZFS
root@truenas[~]# lsblk -o NAME,SIZE,PARTUUID
NAME       SIZE PARTUUID
sda        9.1T 
├─sda1       2G f0100702-d12c-11ea-94de-40167e27b4a2
└─sda2     9.1T f0136e31-d12c-11ea-94de-40167e27b4a2
sdb      223.6G 
├─sdb1     512K caf343a4-797a-11ec-9af9-40167e27b4a2
├─sdb2   207.6G cb011827-797a-11ec-9af9-40167e27b4a2
└─sdb3      16G cafa654b-797a-11ec-9af9-40167e27b4a2
  └─sdb3    16G 
sdc      953.9G 
└─sdc1   953.9G 81cca4d9-6136-11ec-b493-40167e27b4a2
sdd       21.8T 
sde        1.8T 
├─sde1       2G 3305e5f9-5898-11ec-b86d-40167e27b4a2
└─sde2     1.8T 330a7238-5898-11ec-b86d-40167e27b4a2
sdf      953.9G 
└─sdf1   953.9G 297a3e90-7997-11ec-848f-40167e27b4a2
sdg       21.8T 
└─sdg1    21.8T 1b97be74-19e9-48f5-9ead-844b66857f35
sdh        9.1T 
├─sdh1       2G f00ae4a1-d12c-11ea-94de-40167e27b4a2
└─sdh2     9.1T f00e5072-d12c-11ea-94de-40167e27b4a2
sdi        9.1T 
sdj       10.9T 
├─sdj1       2G a04109d4-4a30-11eb-9b24-40167e27b4a2
└─sdj2    10.9T a052e97b-4a30-11eb-9b24-40167e27b4a2
sdk        3.6T 
├─sdk1       2G 34f9e0db-7b92-47c8-802b-2fd67d2be544
└─sdk2     3.6T 2e1056c7-7c94-4da8-83f4-d1ffb79bcfba
sdl       10.9T 
├─sdl1       2G a049b2da-4a30-11eb-9b24-40167e27b4a2
└─sdl2    10.9T a055c34f-4a30-11eb-9b24-40167e27b4a2

etorix · October 16, 2024, 7:17pm

If sdg is the UNAVAIL drive, where is the other 24 TB drive?

jcarrut2 · October 16, 2024, 7:25pm

sdg is correct. It matches the unavailable drive label I get from zpool import command.
The other drive you’re looking for is in the list, it’s sdd. This should be the drive I ‘detached’ because IT was initially reading as unavailable. (I think the error effected me twice, once when I tried to expand the full mirrored pool, and once after I detached the first unavailable disk and tried to expand the pool into the remaining disk).

If it helps, sdj and sdl are the drives the two 24TB drives were meant to replace and should have the same (mirrored) data.

HoneyBadger · October 16, 2024, 7:39pm

@jcarrut2 please post the output of the following command as a codeblock:

sudo sfdisk -d /dev/sdg

Assuming sdg is still the remaining, non-detached drive with the unique ID of 1b97be74-19e9-48f5-9ead-844b66857f35 in your zpool status output.

Also - what model are the 24TB drives, assuming they’re identical?

jcarrut2 · October 16, 2024, 8:57pm

root@truenas[~]# sudo sfdisk -d /dev/sdg
label: gpt
label-id: E8F82843-98C0-48B5-ADD6-CCDD218DEE22
device: /dev/sdg
unit: sectors
first-lba: 34
last-lba: 46875541470
sector-size: 512

/dev/sdg1 : start=        2048, size= 46875539423, type=6A898CC3-1DD2-11B2-99A6-080020736631, uuid=1B97BE74-19E9-48F5-9EAD-844B66857F35

The 24TB drives are identical WD Red Pros, WD240KFGX.

HoneyBadger · October 16, 2024, 9:20pm

Info received, please stand by.

HoneyBadger · October 16, 2024, 9:29pm

@jcarrut2 I need to step away for a bit but I’ll be back later tonight. Pretty sure I can restore the previous label and get you back up and running.

jcarrut2 · October 16, 2024, 9:31pm

No problem! I sincerely appreciate the help! Will check back later.