Power Failure - Lost Pool, failing on Pool Import

Antieon · December 12, 2024, 3:58am

TrueNAS Scale 24.10.0.2
Dell R720XD w/ VMWare
TrueNAS Scale running in a VM, 4 x 8TB disks attached to VM for ZFS

Had a power flip and whole server went off, even with battery backup - pool was completely offline and lost its configuration.

zpool import shows the pool being there, but it is failing to import (through GUI or through CLI)

Error through GUI on Import Pool:

Error: concurrent.futures.process._RemoteTraceback:
“”"
Traceback (most recent call last):
File “/usr/lib/python3.11/concurrent/futures/process.py”, line 256, in _process_worker
r = call_item.fn(*call_item.args, **call_item.kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/lib/python3/dist-packages/middlewared/worker.py”, line 112, in main_worker
res = MIDDLEWARE._run(*call_args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/lib/python3/dist-packages/middlewared/worker.py”, line 46, in _run
return self._call(name, serviceobj, methodobj, args, job=job)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/lib/python3/dist-packages/middlewared/worker.py”, line 34, in call
with Client(f’ws+unix://{MIDDLEWARE_RUN_DIR}/middlewared-internal.sock’, py_exceptions=True) as c:
File “/usr/lib/python3/dist-packages/middlewared/worker.py”, line 40, in call
return methodobj(*params)
^^^^^^^^^^^^^^^^^^
File “/usr/lib/python3/dist-packages/middlewared/schema/processor.py”, line 183, in nf
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3/dist-packages/middlewared/plugins/zfs/pool_actions.py", line 211, in import_pool
with libzfs.ZFS() as zfs:
File “libzfs.pyx”, line 534, in libzfs.ZFS.exit
File "/usr/lib/python3/dist-packages/middlewared/plugins/zfs/pool_actions.py", line 231, in import_pool
zfs.import_pool(found, pool_name, properties, missing_log=missing_log, any_host=any_host)
File “libzfs.pyx”, line 1374, in libzfs.ZFS.import_pool
File “libzfs.pyx”, line 1402, in libzfs.ZFS.__import_pool
File “libzfs.pyx”, line 663, in libzfs.ZFS.get_error
File “/usr/lib/python3.11/enum.py”, line 717, in call
return cls.new(cls, value)
^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/lib/python3.11/enum.py”, line 1133, in new
raise ve_exc
ValueError: 2095 is not a valid Error
“”"

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File “/usr/lib/python3/dist-packages/middlewared/job.py”, line 488, in run
await self.future
File “/usr/lib/python3/dist-packages/middlewared/job.py”, line 533, in _run_body
rv = await self.method(*args)
^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/lib/python3/dist-packages/middlewared/schema/processor.py”, line 179, in nf
return await func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/lib/python3/dist-packages/middlewared/schema/processor.py”, line 49, in nf
res = await f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3/dist-packages/middlewared/plugins/pool/import_pool.py", line 113, in import_pool
await self.middleware.call(‘zfs.pool.import_pool’, guid, opts, any_host, use_cachefile, new_name)
File “/usr/lib/python3/dist-packages/middlewared/main.py”, line 1626, in call
return await self._call(
^^^^^^^^^^^^^^^^^
File “/usr/lib/python3/dist-packages/middlewared/main.py”, line 1465, in _call
return await self._call_worker(name, *prepared_call.args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/lib/python3/dist-packages/middlewared/main.py”, line 1471, in _call_worker
return await self.run_in_proc(main_worker, name, args, job)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/lib/python3/dist-packages/middlewared/main.py”, line 1377, in run_in_proc
return await self.run_in_executor(self.__procpool, method, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/lib/python3/dist-packages/middlewared/main.py”, line 1361, in run_in_executor
return await loop.run_in_executor(pool, functools.partial(method, *args, **kwargs))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: 2095 is not a valid Error

Error through CLI

root@truenas[~]# zpool status
pool: boot-pool
state: ONLINE
scan: scrub repaired 0B in 00:00:43 with 0 errors on Fri Dec 6 03:45:45 2024
config:
    NAME        STATE     READ WRITE CKSUM
    boot-pool   ONLINE       0     0     0
      sdb3      ONLINE       0     0     0
errors: No known data errors
root@truenas[~]# zpool import
pool: ZFS_1
id: 13968705838132466553
state: ONLINE
status: Some supported features are not enabled on the pool.
(Note that they may be intentionally disabled if the
‘compatibility’ property is set.)
action: The pool can be imported using its name or numeric identifier, though
some features will not be available without an explicit ‘zpool upgrade’.
config:
    ZFS_1                                     ONLINE
      raidz1-0                                ONLINE
        3918cc4c-fe14-4bae-b3bd-bc7bd0dd496c  ONLINE
        c85c9272-4d63-4362-b432-00eec3aa93ce  ONLINE
        d9a75f50-1e69-406e-96eb-89ee585391fb  ONLINE
        5e440e2d-af36-44ef-b0c2-d12fd11d26bd  ONLINE
root@truenas[~]# zpool import ZFS_1
cannot import ‘ZFS_1’: insufficient replicas
Destroy and re-create the pool from
a backup source.
root@truenas[~]# zpool import -f ZFS_1
cannot import ‘ZFS_1’: insufficient replicas
Destroy and re-create the pool from
a backup source.
root@truenas[~]# zpool import -F ZFS_1
cannot import ‘ZFS_1’: I/O error
Destroy and re-create the pool from
a backup source.
root@truenas[~]#

Antieon · December 12, 2024, 4:01am

Also, I don’t have a “backup source” - so not sure what to do about that. Thought since it can see the pool information on the disks, that it would be able to recreate it very much like importing a traditional RAID array.

elvisimprsntr · December 12, 2024, 4:08am

Didn’t follow the 3-2-1 rule on backups? Sorry to read.

For the future, those Delta hot swappable power supplies have active power factor correction (PFC). What model is the UPS and is it pure sine wave output?

There are two types of sine wave: simulated sine wave (stepped-approximated sine wave) and true/pure sine wave. Depending on which type it is, it differs vastly. A pure sine wave UPS is required for devices that use Active Power Factor Correction (PFC) power supplies, while a stepped sine wave UPS may cause these devices to shut down

Antieon · December 12, 2024, 4:24am

Well, I backup to an external 12TB WD drive daily with Acronis, so I have a recent backup of the most critical data.

It is an APC BackUPS 1300, and I just had a power issue in a strong windstorm not long ago and the batteries were fine, apparently the batteries gave up this go round, my fault for not testing them for frequently.

Protopia · December 12, 2024, 2:00pm

Here is my take on this:

From a practical perspective, to get up and going again in the minimum time you should probably destroy and recreate the pool and restore from backup.

That said, ZFS is supposed to be resilient to failure because it is transaction based and so:

To avoid this happening again we should be trying to understand why this happened - is there some sort of hardware configuration issue that made ZFS less resilient; and
To avoid this happening again, in addition to changing the UPS battery we should be looking at why TrueNAS didn’t realise that the battery was failing and do an earlier shutdown before it failed (i.e. can we improve the UPS service setup); and
Because ZFS is transaction based in may well still be possible to bring the existing pool back online at a prior transaction point which is later than the last external backup, and this might be quicker than restoring from backup.

etorix · December 12, 2024, 2:22pm

This looks like the best solution. For this, and for the sake of investigating what looks like a bug in SCALE, it would be best if @HoneyBadger could advise.
In between, you can try:
sudo zpool import -fn ZFS_1
and
sudo zpool import -fFn ZFS_1
The n ensures that the commands are safe (no-op); we’re interested in the output, or lack thereof. Depending on it, and with expert advice, you might perform actual force import (f) or force rewind (F).

Protopia · December 12, 2024, 2:47pm

I fully agree with @etorix’s view that if a straight forward zpool import ZFS_1 doesn’t work, then we should NOT be running more forceful real attempts to import the pool with -f or -F for fear of such actual attempts making the pool state worse, and instead should be only doing “what would happen if” attempts by adding the -n flag.

HoneyBadger · December 12, 2024, 3:06pm

Hey @Antieon

Can you expand on this piece:

Specifically, how were the disks attached to the VM? Storage controller passthrough (good) individual RAW device passthrough (less good, possibly bad) or VMDKs on individual VMFS’s for each drive (definitely not good)

Antieon · December 12, 2024, 3:08pm

sudo zpool import -fFn ZFS_1 threw no errors, so this should be bringing the pool online in a read only state?

root@truenas[~]# zpool status
pool: boot-pool
state: ONLINE
scan: scrub repaired 0B in 00:00:43 with 0 errors on Fri Dec 6 03:45:45 2024
config:
    NAME        STATE     READ WRITE CKSUM
    boot-pool   ONLINE       0     0     0
      sda3      ONLINE       0     0     0
errors: No known data errors
root@truenas[~]# zpool import
pool: ZFS_1
id: 13968705838132466553
state: ONLINE
status: Some supported features are not enabled on the pool.
(Note that they may be intentionally disabled if the
‘compatibility’ property is set.)
action: The pool can be imported using its name or numeric identifier, though
some features will not be available without an explicit ‘zpool upgrade’.
config:
    ZFS_1                                     ONLINE
      raidz1-0                                ONLINE
        3918cc4c-fe14-4bae-b3bd-bc7bd0dd496c  ONLINE
        c85c9272-4d63-4362-b432-00eec3aa93ce  ONLINE
        d9a75f50-1e69-406e-96eb-89ee585391fb  ONLINE
        5e440e2d-af36-44ef-b0c2-d12fd11d26bd  ONLINE
root@truenas[~]# sudo zpool import -fFn ZFS_1
root@truenas[~]# zpool status
pool: boot-pool
state: ONLINE
scan: scrub repaired 0B in 00:00:43 with 0 errors on Fri Dec 6 03:45:45 2024
config:
    NAME        STATE     READ WRITE CKSUM
    boot-pool   ONLINE       0     0     0
      sda3      ONLINE       0     0     0
errors: No known data errors
root@truenas[~]#

Antieon · December 12, 2024, 3:19pm

VMDK’s on a single VMFS. Basically I have this R720XD, it has 12 x 8TB drives in RAID6. I created 4 x 8TB VMDK’s and created a pool based on that.

HoneyBadger · December 12, 2024, 3:24pm

When an -n command gives no news, it’s good news. It means “ZFS should be able to import the pool, but didn’t do it because you passed the -n parameter.”

Re-running the command without that parameter (eg: zpool import -fF ZFS_1) should bring the pool back into the system and visible from zpool status -v - however, because the capital -F will rewind a previously uncommitted transaction back, discarding it.

This falls under the “definitely not good” configuration I’m afraid. ZFS is basically at the mercy of your hardware-controlled RAID6 as to data consistency and performance, as well as having the extra space overhead from doing a RAIDZ1 on top of a RAID6.

Unfortunately with this configuration you don’t really have an option to move it “in-place” - and the configuration of the R720xd backplane means you can’t really “split” the backplane into two different HBAs.

See the article at Yes, You Can (Still) Virtualize TrueNAS for some details, but in general, the “ZFS on HW RAID” configuration you have is rather fragile, and dependent on your PERC card continuing to behave itself under unexpected power loss.

Antieon · December 12, 2024, 3:40pm

Sadly it fails when I remove the n in the command.

root@truenas[~]# sudo zpool import -fFn ZFS_1
root@truenas[~]# sudo zpool import -fF ZFS_1
cannot import ‘ZFS_1’: I/O error
Destroy and re-create the pool from
a backup source.
root@truenas[~]#

That’s what I was afraid of… at this point sadly I have a lot of data recovery that I need to do… going to have to rearchitect this now. Thank you for all the help.

HoneyBadger · December 12, 2024, 3:44pm

You can try forcing it to go further back with -fFX but there may be further problems as a result.

Determines whether extreme measures to find a valid txg should take place. This allows the pool to be rolled back to a txg which is no longer guaranteed to be consistent. Pools imported at an inconsistent txg may contain uncorrectable checksum errors.

Antieon · December 12, 2024, 4:19pm

Running that command now… hasn’t errored out, been running a few minutes… guessing ZFS is analysis on transactions on backend to see where / if it can rollback.

Or.

It locked up… either or.

HoneyBadger · December 12, 2024, 4:28pm

The -X option can make it take a very long time (think “hours” at least) to import the pool. You may be able to track disk I/O from a second SSH session opened to TrueNAS and iostat or from the hypervisor level by inspecting it there.

Protopia · December 12, 2024, 5:25pm

It seems like the cause was virtualisation outside the recommended configuration i.e. creating RAIDZ1 pools on VMDKs which were already on hardware RAID6 virtual devices.

Antieon · December 13, 2024, 5:49pm

Looks like the IOSTAT stopped reading (rougly 21.5TiB), now this task is running to pool.import_find, I assume this process has to finish to allow me to import the pool?