Pool online - unable to import

Neonkore · September 26, 2024, 2:24am

Hi All,

After a recent power outage (I believe the UPS safely shutdown my TrueNAS server and JBOD), a mirrored NVME pool (that only contains iXApplications) has exported itself and I’m having great difficulty reimporting it. For clarity, these are NVME drives direct to the motherboard, not in the JBOD enclosure (which is connected via HBA and is rock solid).

So far I’ve tried:

Importing via the GUI, initially the pool didn’t appear in the import list in the GUI. Manually exporting the pool has now allowed the pool to show in the GUI to import (still fails import: “I/O Error”).
using the command zpool import -f -R /mnt fasttank to force an import via CLI

Zpool import shows:

root@truenas[~]# zpool import
pool: fasttank
id: 2982151281266948475
state: ONLINE
action: The pool can be imported using its name or numeric identifier.
config:
    fasttank       ONLINE
      mirror-0     ONLINE
        nvme1n1p2  ONLINE
        nvme2n1p2  ONLINE

Smartctl shows same for both disks:

root@truenas[~]# smartctl -a /dev/nvme1n1       
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.6.32-production+truenas] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       Samsung SSD 960 EVO 500GB
Serial Number:                      S3EUNB0J533939H
Firmware Version:                   3B7QCXE7
PCI Vendor/Subsystem ID:            0x144d
IEEE OUI Identifier:                0x002538
Total NVM Capacity:                 500,107,862,016 [500 GB]
Unallocated NVM Capacity:           0
Controller ID:                      2
NVMe Version:                       1.2
Number of Namespaces:               1
Namespace 1 Size/Capacity:          500,107,862,016 [500 GB]
Namespace 1 Utilization:            498,540,187,648 [498 GB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            002538 5571b180c2
Local Time is:                      Thu Sep 26 09:49:00 2024 AWST
Firmware Updates (0x16):            3 Slots, no Reset required
Optional Admin Commands (0x0007):   Security Format Frmw_DL
Optional NVM Commands (0x001f):     Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat
Log Page Attributes (0x03):         S/H_per_NS Cmd_Eff_Lg
Maximum Data Transfer Size:         512 Pages
Warning  Comp. Temp. Threshold:     77 Celsius
Critical Comp. Temp. Threshold:     79 Celsius

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     6.04W       -        -    0  0  0  0        0       0
 1 +     5.09W       -        -    1  1  1  1        0       0
 2 +     4.08W       -        -    2  2  2  2        0       0
 3 -   0.0400W       -        -    3  3  3  3      210    1500
 4 -   0.0050W       -        -    4  4  4  4     2200    6000

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        45 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    9%
Data Units Read:                    12,135,312 [6.21 TB]
Data Units Written:                 25,157,633 [12.8 TB]
Host Read Commands:                 56,532,998
Host Write Commands:                481,064,541
Controller Busy Time:               799
Power Cycles:                       1,352
Power On Hours:                     14,325
Unsafe Shutdowns:                   283
Media and Data Integrity Errors:    0
Error Information Log Entries:      2,304
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               45 Celsius
Temperature Sensor 2:               49 Celsius

Error Information (NVMe Log 0x01, 16 of 64 entries)
Num   ErrCount  SQId   CmdId  Status  PELoc          LBA  NSID    VS  Message
  0       2304     0  0x0015  0x4004      -            0     1     -  Invalid Field in Command
  1       2303     0  0x3013  0x4004      -            0     1     -  Invalid Field in Command
  2       2302     0  0x100c  0x4004      -            0     1     -  Invalid Field in Command
  3       2301     0  0x100b  0x4004      -            0     1     -  Invalid Field in Command
  4       2300     0  0x0011  0x4004      -            0     1     -  Invalid Field in Command
  5       2299     0  0x0004  0x4004      -            0     1     -  Invalid Field in Command
  6       2298     0  0x0001  0x4004      -            0     1     -  Invalid Field in Command
  7       2297     0  0x0019  0x4004      -            0     1     -  Invalid Field in Command
  8       2296     0  0x0008  0x4016  0x004            0     1     -  Invalid Namespace or Format
  9       2295     0  0x0001  0x4004      -            0     1     -  Invalid Field in Command
 10       2294     0  0x100c  0x4004      -            0     1     -  Invalid Field in Command
 11       2293     0  0x0009  0x4004      -            0     1     -  Invalid Field in Command
 12       2292     0  0x0016  0x4004      -            0     1     -  Invalid Field in Command
 13       2291     0  0x1004  0x4004      -            0     1     -  Invalid Field in Command
 14       2290     0  0x1007  0x4004      -            0     1     -  Invalid Field in Command
 15       2289     0  0x000b  0x4004      -            0     1     -  Invalid Field in Command
... (48 entries not read)

Self-tests not supported

Error from GUI is:

Error: concurrent.futures.process.RemoteTraceback: “”" Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/middlewared/plugins/zfs/pool_actions.py", line 227, in import_pool zfs.import_pool(found, pool_name, properties, missing_log=missing_log, any_host=any_host)
File “libzfs.pyx”, line 1369, in libzfs.ZFS.import_pool File “libzfs.pyx”, line 1397, in libzfs.ZFS.__import_pool libzfs.ZFSException: cannot import ‘fasttank’ as ‘fasttank’: I/O error During handling of the above exception, another exception occurred: Traceback (most recent call last): File “/usr/lib/python3.11/concurrent/futures/process.py”, line 256, in _process_worker r = call_item.fn(*call_item.args, **call_item.kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/lib/python3/dist-packages/middlewared/worker.py”, line 112, in main_worker res = MIDDLEWARE._run(*call_args) ^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/lib/python3/dist-packages/middlewared/worker.py”, line 46, in _run return self._call(name, serviceobj, methodobj, args, job=job) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/lib/python3/dist-packages/middlewared/worker.py”, line 34, in call with Client(f’ws+unix://{MIDDLEWARE_RUN_DIR}/middlewared-internal.sock’, py_exceptions=True) as c: File “/usr/lib/python3/dist-packages/middlewared/worker.py”, line 40, in call return methodobj(*params) ^^^^^^^^^^^^^^^^^^
File “/usr/lib/python3/dist-packages/middlewared/schema/processor.py”, line 191, in nf return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3/dist-packages/middlewared/plugins/zfs/pool_actions.py", line 207, in import_pool with libzfs.ZFS() as zfs: File “libzfs.pyx”, line 529, in libzfs.ZFS.exit
File "/usr/lib/python3/dist-packages/middlewared/plugins/zfs/pool_actions.py", line 231, in import_pool raise CallError(f’Failed to import {pool_name!r} pool: {e}', e.code) middlewared.service_exception.CallError: [EZFS_IO] Failed to import ‘fasttank’ pool: cannot import ‘fasttank’ as ‘fasttank’: I/O error “”"
The above exception was the direct cause of the following exception: Traceback (most recent call last): File “/usr/lib/python3/dist-packages/middlewared/job.py”, line 469, in run await self.future
File “/usr/lib/python3/dist-packages/middlewared/job.py”, line 511, in _run_body rv = await self.method(*args) ^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/lib/python3/dist-packages/middlewared/schema/processor.py”, line 187, in nf return await func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File “/usr/lib/python3/dist-packages/middlewared/schema/processor.py”, line 47, in nf res = await f(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3/dist-packages/middlewared/plugins/pool/import_pool.py", line 113, in import_pool await self.middleware.call(‘zfs.pool.import_pool’, guid, opts, any_host, use_cachefile, new_name) File “/usr/lib/python3/dist-packages/middlewared/main.py”, line 1564, in call return await self._call( ^^^^^^^^^^^^^^^^^ File “/usr/lib/python3/dist-packages/middlewared/main.py”, line 1425, in _call return await self._call_worker(name, *prepared_call.args) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File “/usr/lib/python3/dist-packages/middlewared/main.py”, line 1431, in _call_worker return await self.run_in_proc(main_worker, name, args, job) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File “/usr/lib/python3/dist-packages/middlewared/main.py”, line 1337, in run_in_proc return await self.run_in_executor(self.__procpool, method, *args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File “/usr/lib/python3/dist-packages/middlewared/main.py”, line 1321, in run_in_executor return await loop.run_in_executor(pool, functools.partial(method, *args, **kwargs)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ middlewared.service_exception.CallError: [EZFS_IO] Failed to import ‘fasttank’ pool: cannot import ‘fasttank’ as ‘fasttank’: I/O error

Error from CLI ()same with force argument) is unhelpful:

 root@truenas[~]# zpool import fasttank   
 cannot import 'fasttank': I/O error
         Destroy and re-create the pool from
         a backup source.

I’m speculating:
I’m too ignorant to interpret the error, however it appears the pool is visible and therefore the ‘header’ is intact so I’m hoping it’s some kind of caching error that has been caused by power loss. The pool had started throwing very minor checksum errors over the last 3 months (1 or 2 checksums a month) so perhaps the NVME controller on the motherboard is nearing EOL and something didn’t get written prior to an unplanned shutdown (i.e. the UPS didn’t trigger a controlled shutdown). The NVME disks both show SMART health ok but also show the same Invalid Namespace or Format and Invalid Field errors so something has corrupted.

Anyway, hoping someone has a fix that doesn’t involve me wiping the disks and recreating the pool however it seems these types of failures can take out the ZFS pool with some regularity, reading the forums…

System specs:

Threadripper 1950X
ASUS ROG Zenith Extreme
128GB Non-ECC Gskill 3466MHz
ASUS 1080Ti
Pool 1 - 2 x 512GB Samsung 960 NVMe - motherboard
Pool 2 - 8 x 16TB Seagate Ironwolf NAS HDD JBOD
Boot - 1 x 240GB WD Blue SSD - SATA to motherboard
LSI SAS9300-8e HBA (pool 2)
10GBe Aquantia NIC
1000W Seasonic Prime
UPS

winnielinnie · September 26, 2024, 3:02am

How long have you owned these Samsung EVOs?

9% of its spare storage was already used?

Can you post the SMART info for both NVMes?

Neonkore · September 26, 2024, 4:05am

Hi Winnie,

Since 2017. Yes it was around 9% for the various things ixApplications housed on the pool after a prune in August.

I’ve shown the output from smartctl -a /dev/nvme1n1 above, below is nvme2n1, is this what you’re after?

root@truenas[~]# smartctl -a /dev/nvme2n1
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.6.32-production+truenas] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       Samsung SSD 960 EVO 500GB
Serial Number:                      S3EUNB0J533942M
Firmware Version:                   3B7QCXE7
PCI Vendor/Subsystem ID:            0x144d
IEEE OUI Identifier:                0x002538
Total NVM Capacity:                 500,107,862,016 [500 GB]
Unallocated NVM Capacity:           0
Controller ID:                      2
NVMe Version:                       1.2
Number of Namespaces:               1
Namespace 1 Size/Capacity:          500,107,862,016 [500 GB]
Namespace 1 Utilization:            499,657,641,984 [499 GB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            002538 5571b180c5
Local Time is:                      Thu Sep 26 12:03:43 2024 AWST
Firmware Updates (0x16):            3 Slots, no Reset required
Optional Admin Commands (0x0007):   Security Format Frmw_DL
Optional NVM Commands (0x001f):     Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat
Log Page Attributes (0x03):         S/H_per_NS Cmd_Eff_Lg
Maximum Data Transfer Size:         512 Pages
Warning  Comp. Temp. Threshold:     77 Celsius
Critical Comp. Temp. Threshold:     79 Celsius

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     6.04W       -        -    0  0  0  0        0       0
 1 +     5.09W       -        -    1  1  1  1        0       0
 2 +     4.08W       -        -    2  2  2  2        0       0
 3 -   0.0400W       -        -    3  3  3  3      210    1500
 4 -   0.0050W       -        -    4  4  4  4     2200    6000

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        44 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    10%
Data Units Read:                    18,179,300 [9.30 TB]
Data Units Written:                 36,883,059 [18.8 TB]
Host Read Commands:                 71,523,007
Host Write Commands:                525,346,539
Controller Busy Time:               1,019
Power Cycles:                       1,352
Power On Hours:                     14,720
Unsafe Shutdowns:                   283
Media and Data Integrity Errors:    0
Error Information Log Entries:      2,306
Warning  Comp. Temperature Time:    1
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               44 Celsius
Temperature Sensor 2:               50 Celsius

Error Information (NVMe Log 0x01, 16 of 64 entries)
Num   ErrCount  SQId   CmdId  Status  PELoc          LBA  NSID    VS  Message
  0       2306     0  0xa01d  0x4004      -            0     1     -  Invalid Field in Command
  1       2305     0  0x0001  0x4004      -            0     1     -  Invalid Field in Command
  2       2304     0  0x0011  0x4004      -            0     1     -  Invalid Field in Command
  3       2303     0  0x000d  0x4004      -            0     1     -  Invalid Field in Command
  4       2302     0  0x0005  0x4004      -            0     1     -  Invalid Field in Command
  5       2301     0  0x0010  0x4004      -            0     1     -  Invalid Field in Command
  6       2300     0  0x4015  0x4004      -            0     1     -  Invalid Field in Command
  7       2299     0  0x0010  0x4004      -            0     1     -  Invalid Field in Command
  8       2298     0  0x0008  0x4016  0x004            0     1     -  Invalid Namespace or Format
  9       2297     0  0x000d  0x4004      -            0     1     -  Invalid Field in Command
 10       2296     0  0x0010  0x4004      -            0     1     -  Invalid Field in Command
 11       2295     0  0x3013  0x4004      -            0     1     -  Invalid Field in Command
 12       2294     0  0x201c  0x4004      -            0     1     -  Invalid Field in Command
 13       2293     0  0x0019  0x4004      -            0     1     -  Invalid Field in Command
 14       2292     0  0x0005  0x4004      -            0     1     -  Invalid Field in Command
 15       2291     0  0x000d  0x4004      -            0     1     -  Invalid Field in Command
... (48 entries not read)

Self-tests not supported

Neonkore · September 27, 2024, 1:40pm

Looks like there was a solution for this on TrueNAS Core by playing around with the top level VDEV and then importing the pool read only, but the below commands don’t work on Scale.

sysctl vfs.zfs.max_missing_tvds=1
sysctl vfs.zfs.spa.load_verify_metadata=0
sysctl vfs.zfs.spa.load_verify_data=0

It’s quite infuriating that a mirrored pool can be taken out with a power outage in this way.

winnielinnie · September 27, 2024, 1:55pm

Heading into 8 years? Count your blessings and stay vigilant.

No, the “spare used” is not about the actual capacity being used on the SSD. It means that your SSD firmware had to resort to using 10% of its reserved “spare space”. So of the special spare space, 10% already had to be put into use, which might hint at a dying SSD. (Especially if you went from 0% to 10% in a short amount of time.)

EDIT: For comparison, I have two Samsung 970 EVOs, which I use as a “fast pool” in a simple two-mirror.

Since 2021, I’ve had them online 24/7.

One of them shows that 1% of its spare space had been used. The other shows 0%.

Neonkore · September 28, 2024, 4:17am

AH yes I’m with you, bad blocks etc.

Unfortunately I wasn’t tracking the rate of the spare capacity being used and you have a point on rate being an indicator, so the drives could certainly be on the way out. From a utilisation perspective, it’s not yet over the threshold although very close and the available is still 100% so I don’t think it’s the cause of the current problem, but in preparation for EE I’ll replace the drives anyway.

I wouldn’t say these drives have had a hard life over 8 years given there’s only 15k power on hours and 19TB written on each against a 300TB warranty and MTBF of 1.5Mh.