TrueNAS Scale 25.04.0 Upgrade disaster

thewizz · May 28, 2025, 11:43pm

Do I want to leave this checked?

winnielinnie · May 28, 2025, 11:44pm

Uncheck the first two.

thewizz · May 28, 2025, 11:48pm

dmin@truenas[~]$ sudo zdb -l /dev/disk/by-partuuid/e4bb08a2-9a5d-4a24-9b8b-3ab220bb0267
------------------------------------
LABEL 0 
------------------------------------
    version: 5000
    name: 'WizzPool'
    state: 1
    txg: 12007450
    pool_guid: 3518430912930309335
    errata: 0
    hostid: 1780220649
    hostname: 'truenas'
    top_guid: 11842154841165029957
    guid: 18340154160196067035
    vdev_children: 1
    vdev_tree:
        type: 'raidz'
        id: 0
        guid: 11842154841165029957
        nparity: 1
        metaslab_array: 133
        metaslab_shift: 34
        ashift: 12
        asize: 11995904212992
        is_log: 0
        create_txg: 4
        children[0]:
            type: 'disk'
            id: 0
            guid: 18340154160196067035
            path: '/dev/disk/by-partuuid/e4bb08a2-9a5d-4a24-9b8b-3ab220bb0267'
            whole_disk: 0
            DTL: 90826
            create_txg: 4
        children[1]:
            type: 'disk'
            id: 1
            guid: 15295080939741160980
            path: '/dev/disk/by-partuuid/02f5451d-1e09-400a-9e08-2ba73150618f'
            whole_disk: 0
            DTL: 90825
            create_txg: 4
        children[2]:
            type: 'disk'
            id: 2
            guid: 1389250103858833512
            path: '/dev/disk/by-partuuid/c1e813a1-9f62-4450-88e0-1a7c64def8a3'
            whole_disk: 0
            DTL: 90824
            create_txg: 4
    features_for_read:
        com.delphix:hole_birth
        com.delphix:embedded_data
    labels = 0 1 2 3

admin@truenas[~]$ sudo zdb -l /dev/disk/by-partuuid/02f5451d-1e09-400a-9e08-2ba73150618f
------------------------------------
LABEL 0 
------------------------------------
    version: 5000
    name: 'WizzPool'
    state: 1
    txg: 12007450
    pool_guid: 3518430912930309335
    errata: 0
    hostid: 1780220649
    hostname: 'truenas'
    top_guid: 11842154841165029957
    guid: 15295080939741160980
    vdev_children: 1
    vdev_tree:
        type: 'raidz'
        id: 0
        guid: 11842154841165029957
        nparity: 1
        metaslab_array: 133
        metaslab_shift: 34
        ashift: 12
        asize: 11995904212992
        is_log: 0
        create_txg: 4
        children[0]:
            type: 'disk'
            id: 0
            guid: 18340154160196067035
            path: '/dev/disk/by-partuuid/e4bb08a2-9a5d-4a24-9b8b-3ab220bb0267'
            whole_disk: 0
            DTL: 90826
            create_txg: 4
        children[1]:
            type: 'disk'
            id: 1
            guid: 15295080939741160980
            path: '/dev/disk/by-partuuid/02f5451d-1e09-400a-9e08-2ba73150618f'
            whole_disk: 0
            DTL: 90825
            create_txg: 4
        children[2]:
            type: 'disk'
            id: 2
            guid: 1389250103858833512
            path: '/dev/disk/by-partuuid/c1e813a1-9f62-4450-88e0-1a7c64def8a3'
            whole_disk: 0
            DTL: 90824
            create_txg: 4
    features_for_read:
        com.delphix:hole_birth
        com.delphix:embedded_data
    labels = 0 1 2 3

admin@truenas[~]$ sudo zdb -l /dev/disk/by-partuuid/02f5451d-1e09-400a-9e08-2ba73150618f
------------------------------------
LABEL 0 
------------------------------------
    version: 5000
    name: 'WizzPool'
    state: 1
    txg: 12007450
    pool_guid: 3518430912930309335
    errata: 0
    hostid: 1780220649
    hostname: 'truenas'
    top_guid: 11842154841165029957
    guid: 15295080939741160980
    vdev_children: 1
    vdev_tree:
        type: 'raidz'
        id: 0
        guid: 11842154841165029957
        nparity: 1
        metaslab_array: 133
        metaslab_shift: 34
        ashift: 12
        asize: 11995904212992
        is_log: 0
        create_txg: 4
        children[0]:
            type: 'disk'
            id: 0
            guid: 18340154160196067035
            path: '/dev/disk/by-partuuid/e4bb08a2-9a5d-4a24-9b8b-3ab220bb0267'
            whole_disk: 0
            DTL: 90826
            create_txg: 4
        children[1]:
            type: 'disk'
            id: 1
            guid: 15295080939741160980
            path: '/dev/disk/by-partuuid/02f5451d-1e09-400a-9e08-2ba73150618f'
            whole_disk: 0
            DTL: 90825
            create_txg: 4
        children[2]:
            type: 'disk'
            id: 2
            guid: 1389250103858833512
            path: '/dev/disk/by-partuuid/c1e813a1-9f62-4450-88e0-1a7c64def8a3'
            whole_disk: 0
            DTL: 90824
            create_txg: 4
    features_for_read:
        com.delphix:hole_birth
        com.delphix:embedded_data
    labels = 0 1 2 3

winnielinnie · May 28, 2025, 11:50pm

All the same TXG of 12007450

I supposed it’s had… a lot of TXGs since it was 11851573? This is all so confusing.

Go ahead and import the pool normally with the GUI and check that everything works and your data is there.

thewizz · May 29, 2025, 12:01am

Everything seems to be functioning correctly. I can access my files. My computer found the backup files and is now running a backup.

I’m extremely relieved right now!!!

winnielinnie · May 29, 2025, 12:06am

What a relief!

I wish we knew what happened and if it’s related to the others who faced something like this.

This is a good time to make a backup plan if you don’t already.

thewizz · May 29, 2025, 1:42am

I really appreciate the hand holding through this! Couldn’t have done it without your help!

pmh · May 29, 2025, 8:52am

Minor remark: the errors after your first import could have been avoided if you had used the proper

zpool import -o altroot=/mnt ...

instead of just

zpool import ...

thewizz · May 29, 2025, 12:48pm

I’m assuming you’re referring to the dataset error… Yeah, would be nice to understand the options for the zpool command. The documentation that I could find was not very helpful.

Thank you!

thewizz · May 29, 2025, 1:17pm

Should I upgrade my pool now? Of course after it’s done with the scrub…

winnielinnie · May 29, 2025, 1:36pm

Upgrades cannot be undone. If you ever need to import your pool into an older system or previous version of ZFS, you will be unable to.

Only upgrade if you really need the new features.

pmh · May 29, 2025, 1:37pm

It is just what the UI does when importing a pool. You should do the same when using the command line.

You might have noticed that in TrueNAS all pools are mounted at /mnt/<poolname> instead of /<poolname>. That’s accomplished with that altroot option.

When you just did an import without properly specifying altroot the system tried to mount your pool at /WizzPool but failed because / is read-only:

root@truenas:~# mkdir /WizzPool
mkdir: cannot create directory ‘/WizzPool’: Read-only file system

HTH,
Patrick

thewizz · May 29, 2025, 1:41pm

Gotcha. Thanks for the lesson!

thewizz · May 29, 2025, 1:41pm

OK, I will hold off then. Thanks again!

FrankWard · May 29, 2025, 1:56pm

Never. I had a membership to the beta test club years ago and realized it was a horrible club to be a part of if you’re relying on anything.

Captain_Morgan · May 29, 2025, 6:36pm

Great work on resolving this.

It seems like the problem is that on one upgrade, one of the HDDs was removed from the pool, fell behind in TXGs and was no longer able to be imported into the pool.

So, the question is when did this happen… on what upgrade? and then why?

winnielinnie · May 29, 2025, 8:33pm

The nature of such issues are hard to diagnose because of the urgency of trying to recover data.

Superficially, a pattern seemed to overlap with four different cases with a degree of similarity.

I’m not trying to sound ungrateful or rude to anyone in particular, and I can understand how they must be feeling in such situations, but if they reuse or format the drives, it’s impossible to diagnose.

I wonder if a dd of the first 32MB of each drive can provide some hints? I’m not sure if anyone has yet uploaded an attachment to @HoneyBadger regarding this.

I think you’re right, though. “Something” caused one of the drives to stall long enough where the TXG lagged behind the rest.

winnielinnie · May 29, 2025, 8:43pm

It’s not really ZFS that requires this.

TrueNAS uses an “altroot” of /mnt for all storage pools. This is not a ZFS default.

@pmh was pointing out that if you use the command-line, but you don’t specify the “altroot” to use /mnt, then it will confuse the middleware and any configured shares or apps that are expecting your dataset mountpoints to start with /mnt.

Using the GUI automatically does this. That’s why you should always stick to using the GUI for pool and datasert operations, unless you need to bypass the GUI or middleware for reasons of troubleshooting.

HoneyBadger · May 29, 2025, 8:45pm

Nothing yet, likely for the same reason stated of

To really solve this is going to require the ability to reproduce on-demand.

with that said if you happen to be suffering from this, users can do dd if=/dev/sda1 of=sda1.32m bs=1M count=32 to grab that first 32MB of device sda1 - assuming that’s your ZFS partition, as identified by lsblk - and then copy those files somewhere.

thewizz · May 29, 2025, 9:02pm

Got it. Probably looking in the wrong spot for the correct documentation. My bad!