Pool OFFLINE after scrub task commenced?

FoscarEnch · January 22, 2025, 4:02pm

Hey everyone, I’ve had a TrueNAS Scale setup working well for me over the past couple of weeks, but tonight I’ve encountered some (possibly severe) issues. I’ve been running with 4 disks, set up as 2x mirrored VDEVs in the same pool, and all was well until a scheduled scrub task kicked off. Immediately I received a notification that:

Pool HDDs state is SUSPENDED: One or more devices are faulted in response to IO failures.
The following devices are not healthy:

Disk 11454120585917499236 is UNAVAIL
Disk 16516640297011063002 is UNAVAIL

I checked ‘zpool status’ and it showed that mirror-0 was online, but mirror-1 was unavailable. It struck me as strange that both drives offlined at the same time, I would have thought that the chances of both disks failing at the same moment would be pretty unlikely. The two drives running the mirror were two disks I had used lightly for a couple years - both flawlessly and without any SMART errors.

I rebooted from the GUI, and after quite a lengthy reboot time (where the monitor connected to my NAS displayed a few error messages along the lines of ‘Failed unmounting var.mount’ and other similar mesages), my system came back online.

However, now when I run ‘zpool status’, the following displays:

pool: HDDs
id: 4963705989811537592
state: FAULTED
status: The pool metadata is corrupted.
action: The pool cannot be imported due to damaged devices or data.
The pool may be active on another system, but can be imported using the ‘-f’ flag.
see: (insert github link here)
config:

HDDS FAULTED corrupted data
mirror-0 ONLINE
496fbd23-654e-487a-b481-17b50a0d7c3d ONLINE
232c74aa-5079-420d-aacf-199f9c8183f7 ONLINE

I am quite new to TrueNAS, but I do know that what I’m seeing here is definitely not good (and is likely very definitely not good). Before I do anything that may cause damage to what remains of my setup, is there anything I should try or attempt? Any help would be greatly appreciated.

System info: running an AOOSTAR WTR PRO (Intel N100) with 32GB RAM - further specs are available on their website which I cannot link to in this post.

My boot disk is a 256GB NVME drive which has no issues.

HDDs:

1x ST16000NM000J-2TW103
1x WUH721816ALE6L4
1x ST16000NE000-2RW103
1x ST16000NM001G-2KK103

Whattteva · January 22, 2025, 4:15pm

How are these drives connected to the system?

FoscarEnch · January 22, 2025, 11:03pm

All four drives are connected via SATA using the four internal disk bays in the WTR Pro. I’ve been trying to chase up details on the exact SATA controller in use, but looks like there’s not a ton of info available online - if the controller information would be of use, I can try and figure this out once I return home in a few hours.

truenas-fan · January 23, 2025, 12:10am

If it only happens during scrub, it means the cabling, power or controller are faulty.

Ensure the power is good as well as I did have one drive having weird issues/errors due to iffy power from cheap Molex connectors…

FoscarEnch · January 23, 2025, 12:52am

Gotcha - I’ll look into that, once I get home I’ll repurpose my desktop PC into a test machine and see if I can import the pool there.

Just for my own peace of mind too, if I reboot/shut down at this point I won’t be causing any more issues will I?

FoscarEnch · January 23, 2025, 6:45am

I’ve created a new TrueNAS installation on a seperate PC with known working SATA/power cabling, and while all four of my disks are recognised, when I run zpool import this is my output:

pool: HDDs
id: 4963705989811537592
state: FAULTED
status: The pool was last accessed by another system.
action: The pool cannot be imported due to damaged devices or data.
The pool may be active on another system, but can be imported using
the ‘-f’ flag.
see: Message ID: ZFS-8000-EY — OpenZFS documentation
config:

    HDDs                                      FAULTED  corrupted data
      mirror-0                                ONLINE
        496fbd23-654e-487a-b481-17b50a0d7c3d  ONLINE
        232c74aa-5079-420d-aacf-199f9c8183f7  ONLINE

Unfortunately, on my original installation I also had a second ‘mirror-1’ in the pool, comprised of the two disks I received the fault message for when the scrub task started.

Unsure if this may be relevant information, but I was running my original pool with just the one mirror-0 for a few weeks to store the bulk of my data, before adding in the second vdev around a week ago to add in some extra space for other miscellaneous files.

EDIT: I decided to try rebooting with only the two ‘mirror-1’ vdev drives connected, and zpool import showed this message:

zpool import

pool: HDDs
id: 4963705989811537592
state: FAULTED
status: The pool was last accessed by another system.
action: The pool cannot be imported due to damaged devices or data.
The pool may be active on another system, but can be imported using
the ‘-f’ flag.
see: Message ID: ZFS-8000-EY — OpenZFS documentation
config:

    HDDs                                      FAULTED  corrupted data
      mirror-0                                DEGRADED
        496fbd23-654e-487a-b481-17b50a0d7c3d  UNAVAIL
        232c74aa-5079-420d-aacf-199f9c8183f7  ONLINE

Interesting - still seems to pick up mirror-0 even with none of the mirror-0 drives connected. I tried once more, but this time with the ‘online’ drive from the above test and one of the other mirror-0 drives:

zpool import

pool: HDDs
id: 4963705989811537592
state: FAULTED
status: The pool was last accessed by another system.
action: The pool cannot be imported due to damaged devices or data.
The pool may be active on another system, but can be imported using
the ‘-f’ flag.
see: Message ID: ZFS-8000-EY — OpenZFS documentation
config:

    HDDs                                      FAULTED  corrupted data
      mirror-0                                ONLINE
        496fbd23-654e-487a-b481-17b50a0d7c3d  ONLINE
        232c74aa-5079-420d-aacf-199f9c8183f7  ONLINE

Running it with the other two drives instead produced a message advising there were no pools available.

Forgive me if this is irrelevant information, but I figure more info is better than less.

FoscarEnch · January 24, 2025, 6:52am

Another update - after inserting my disks back into my first system with the original TrueNAS installation, under Storage it tells me:

Disks with exported pools: 4

Does this indicate that while they don’t show up when running zpool commands in the Shell, the 2x disks from the mirror-1 vdev are still being recognised?

I also ran some additional commands to help gather more info:

zpool status

  pool: boot-pool
 state: ONLINE
  scan: scrub repaired 0B in 00:00:26 with 0 errors on Wed Jan 22 03:45:27 2025
config:

        NAME         STATE     READ WRITE CKSUM
        boot-pool    ONLINE       0     0     0
          nvme0n1p3  ONLINE       0     0     0

lsblk

NAME        MAJ:MIN RM   SIZE RO TYPE MOUNTPOINTS
sda           8:0    1  14.6T  0 disk 
└─sda1        8:1    1  14.6T  0 part 
sdb           8:16   1  14.6T  0 disk 
└─sdb1        8:17   1  14.6T  0 part 
sdc           8:32   1  14.6T  0 disk 
└─sdc1        8:33   1  14.6T  0 part 
sdd           8:48   1  14.6T  0 disk 
└─sdd1        8:49   1  14.6T  0 part 
nvme1n1     259:0    0 931.5G  0 disk 
nvme0n1     259:1    0 238.5G  0 disk 
├─nvme0n1p1 259:2    0     1M  0 part 
├─nvme0n1p2 259:3    0   512M  0 part 
└─nvme0n1p3 259:4    0   238G  0 part

lsblk -o NAME,MODEL,SERIAL,LABEL,UUID,PARTUUID,TYPE

NAME      MODEL SERIAL LABEL   UUID                                 PARTUUID                             TYPE
sda       WUH72 2CKNL6                                                                                   disk
└─sda1                 HDDs    4963705989811537592                  232c74aa-5079-420d-aacf-199f9c8183f7 part
sdb       ST160 ZL21HS                                                                                   disk
└─sdb1                                                                                                   part
sdc       ST160 ZL2Q1V                                                                                   disk
└─sdc1                                                                                                   part
sdd       ST160 ZR521C                                                                                   disk
└─sdd1                 HDDs    4963705989811537592                  496fbd23-654e-487a-b481-17b50a0d7c3d part
nvme1n1   SPCC  240460                                                                                   disk
nvme0n1   SPCC  210402                                                                                   disk
├─nvme0n1p1
│                                                                   bf8b306a-17d3-474a-91c1-d2a7de30e971 part
├─nvme0n1p2
│                      EFI     8B8C-4F26                            8c298b13-b7ff-4a8f-aa73-11ccc3edeb47 part
└─nvme0n1p3
                       boot-pool
                               10646407501232066962                 3f331e64-bc2d-4f15-928c-f081425c49eb part

zdb -U /data/zfs/zpool.cache

HDDs:
    version: 5000
    name: 'HDDs'
    state: 0
    txg: 290165
    pool_guid: 4963705989811537592
    errata: 0
    hostid: 1637503756
    hostname: 'Vault'
    com.delphix:has_per_vdev_zaps
    vdev_children: 2
    vdev_tree:
        type: 'root'
        id: 0
        guid: 4963705989811537592
        create_txg: 4
        com.klarasystems:vdev_zap_root: 66
        children[0]:
            type: 'mirror'
            id: 0
            guid: 12557942224269859001
            metaslab_array: 128
            metaslab_shift: 34
            ashift: 12
            asize: 16000893845504
            is_log: 0
            create_txg: 4
            com.delphix:vdev_zap_top: 67
            children[0]:
                type: 'disk'
                id: 0
                guid: 13392533498850484953
                path: '/dev/disk/by-partuuid/496fbd23-654e-487a-b481-17b50a0d7c3d'
                whole_disk: 0
                DTL: 110916
                create_txg: 4
                com.delphix:vdev_zap_leaf: 68
            children[1]:
                type: 'disk'
                id: 1
                guid: 17992915048327931901
                path: '/dev/disk/by-partuuid/232c74aa-5079-420d-aacf-199f9c8183f7'
                whole_disk: 0
                DTL: 110915
                create_txg: 4
                com.delphix:vdev_zap_leaf: 69
        children[1]:
            type: 'mirror'
            id: 1
            guid: 3323707249957188009
            metaslab_array: 335
            metaslab_shift: 34
            ashift: 12
            asize: 16000893845504
            is_log: 0
            create_txg: 290163
            com.delphix:vdev_zap_top: 222
            children[0]:
                type: 'disk'
                id: 0
                guid: 11454120585917499236
                path: '/dev/disk/by-partuuid/6be30b9d-db27-409c-895e-9990ab79e974'
                whole_disk: 0
                create_txg: 290163
                com.delphix:vdev_zap_leaf: 223
            children[1]:
                type: 'disk'
                id: 1
                guid: 16516640297011063002
                path: '/dev/disk/by-partuuid/31c323a0-f0fa-42f7-a92f-69f97c646ea2'
                whole_disk: 0
                create_txg: 290163
                com.delphix:vdev_zap_leaf: 224
    features_for_read:
        com.delphix:hole_birth
        com.delphix:embedded_data
        com.klarasystems:vdev_zaps_v2

ZFS_DBGMSG(zdb) START:
metaslab.c:1682:spa_set_allocator(): spa allocator: dynamic

ZFS_DBGMSG(zdb) END

If any more commands could help, let me know and I’ll run them also.

prez02 · January 24, 2025, 10:43pm

Well, I guess you did not export the pool, so truenas will try to import the pool according to the config it has stored, and realizes something is wrong.

There are no partion UUIDs for sdb1 and sdc1, that is likely why truenas does not find them. Maybe it is possible to reconstruct them, I never had to, so I am not sure if and how to achieve that.

That would have me a bit worried, having truenas run on a device without knowing the exact hardware.

ZFS scrubs and resilvering put a lot of strain on SATA controllers, that is why sometimes cheap SATA cards start to produce errors, at least that has been the experience in the past.

FoscarEnch · January 27, 2025, 1:49pm

After spending a few days troubleshooting via Reddit, and learning more about ZFS than I ever thought I would, I seem to have found a way to mount BOTH of the mirror vdevs successfully, and all the data contained within seems to be intact! Big thanks to @Protopia for patiently fielding my questions and providing invaluable help.

==============================

In case it helps anyone reading this thread in the future, I’ll share the steps I took to retrieve the data here:

Although lsblk and blkid did not seem to be able to read the PARTUUID values, gdisk seemed to indicate that the values and partition type were definitely there. After a lot of dancing around trying to work out why exactly this might have been happening (troubleshooting steps including running partprobe to try and re-read the partition table, and noticing that when running zdb -l /dev/sdb1 it showed only LABEL 0 instead of the LABEL 0 and LABEL 1 that zdb -l /dev/sda1 showed)

zdb -l /dev/sdb1

LABEL 0

version: 5000
name: 'HDDs'
state: 0
txg: 290165
pool_guid: 4963705989811537592
errata: 0
hostid: 1637503756
hostname: 'Vault'
top_guid: 3323707249957188009
guid: 16516640297011063002
vdev_children: 2
vdev_tree:
    type: 'mirror'
    id: 1
    guid: 3323707249957188009
    metaslab_array: 335
    metaslab_shift: 34
    ashift: 12
    asize: 16000893845504
    is_log: 0
    create_txg: 290163
    children[0]:
        type: 'disk'
        id: 0
        guid: 11454120585917499236
        path: '/dev/disk/by-partuuid/6be30b9d-db27-409c-895e-9990ab79e974'
        whole_disk: 0
        create_txg: 290163
    children[1]:
        type: 'disk'
        id: 1
        guid: 16516640297011063002
        path: '/dev/disk/by-partuuid/31c323a0-f0fa-42f7-a92f-69f97c646ea2'
        whole_disk: 0
        create_txg: 290163
features_for_read:
    com.delphix:hole_birth
    com.delphix:embedded_data
    com.klarasystems:vdev_zaps_v2
labels = 0 1 2 3

zdb -l /dev/sda1

LABEL 0

version: 5000
name: 'HDDs'
state: 0
txg: 379469
pool_guid: 4963705989811537592
errata: 0
hostid: 1637503756
hostname: 'Vault'
top_guid: 12557942224269859001
guid: 17992915048327931901
vdev_children: 2
vdev_tree:
    type: 'mirror'
    id: 0
    guid: 12557942224269859001
    metaslab_array: 128
    metaslab_shift: 34
    ashift: 12
    asize: 16000893845504
    is_log: 0
    create_txg: 4
    children[0]:
        type: 'disk'
        id: 0
        guid: 13392533498850484953
        path: '/dev/disk/by-partuuid/496fbd23-654e-487a-b481-17b50a0d7c3d'
        whole_disk: 0
        DTL: 110916
        create_txg: 4
    children[1]:
        type: 'disk'
        id: 1
        guid: 17992915048327931901
        path: '/dev/disk/by-partuuid/232c74aa-5079-420d-aacf-199f9c8183f7'
        whole_disk: 0
        DTL: 110915
        create_txg: 4
features_for_read:
    com.delphix:hole_birth
    com.delphix:embedded_data
    com.klarasystems:vdev_zaps_v2
labels = 0 2

LABEL 1

version: 5000
name: 'HDDs'
state: 0
txg: 290164
pool_guid: 4963705989811537592
errata: 0
hostid: 1637503756
hostname: 'Vault'
top_guid: 12557942224269859001
guid: 17992915048327931901
vdev_children: 2
vdev_tree:
    type: 'mirror'
    id: 0
    guid: 12557942224269859001
    metaslab_array: 128
    metaslab_shift: 34
    ashift: 12
    asize: 16000893845504
    is_log: 0
    create_txg: 4
    children[0]:
        type: 'disk'
        id: 0
        guid: 13392533498850484953
        path: '/dev/disk/by-partuuid/496fbd23-654e-487a-b481-17b50a0d7c3d'
        whole_disk: 0
        DTL: 110916
        create_txg: 4
    children[1]:
        type: 'disk'
        id: 1
        guid: 17992915048327931901
        path: '/dev/disk/by-partuuid/232c74aa-5079-420d-aacf-199f9c8183f7'
        whole_disk: 0
        DTL: 110915
        create_txg: 4
features_for_read:
    com.delphix:hole_birth
    com.delphix:embedded_data
    com.klarasystems:vdev_zaps_v2
labels = 1 3

I noticed that while ls -l /dev/disk/by-partuuid/ didn’t show the missing partitions - ls -l /dev/disk/by-id/ did:

ls -l /dev/disk/by-partuuid/

total 0

lrwxrwxrwx 1 root root 10 Jan 27 11:55 232c74aa-5079-420d-aacf-199f9c8183f7 → …/…/sda1

lrwxrwxrwx 1 root root 15 Jan 27 11:55 3f331e64-bc2d-4f15-928c-f081425c49eb → …/…/nvme1n1p3

lrwxrwxrwx 1 root root 10 Jan 27 11:55 496fbd23-654e-487a-b481-17b50a0d7c3d → …/…/sdd1

lrwxrwxrwx 1 root root 15 Jan 27 11:55 8c298b13-b7ff-4a8f-aa73-11ccc3edeb47 → …/…/nvme1n1p2

lrwxrwxrwx 1 root root 15 Jan 27 11:55 bf8b306a-17d3-474a-91c1-d2a7de30e971 → …/…/nvme1n1p1

ls -l /dev/disk/by-id/

total 0

lrwxrwxrwx 1 root root 9 Jan 27 20:39 ata-ST16000NE000-2RW103_ZL2Q1VVR → …/…/sdc

lrwxrwxrwx 1 root root 10 Jan 27 20:39 ata-ST16000NE000-2RW103_ZL2Q1VVR-part1 → …/…/sdc1

lrwxrwxrwx 1 root root 9 Jan 27 11:55 ata-ST16000NM000J-2TW103_ZR521CDT → …/…/sdd

lrwxrwxrwx 1 root root 10 Jan 27 11:55 ata-ST16000NM000J-2TW103_ZR521CDT-part1 → …/…/sdd1

lrwxrwxrwx 1 root root 9 Jan 27 20:29 ata-ST16000NM001G-2KK103_ZL21HSD1 → …/…/sdb

lrwxrwxrwx 1 root root 10 Jan 27 20:29 ata-ST16000NM001G-2KK103_ZL21HSD1-part1 → …/…/sdb1

lrwxrwxrwx 1 root root 9 Jan 27 11:55 ata-WUH721816ALE6L4_2CKNL63J → …/…/sda

lrwxrwxrwx 1 root root 10 Jan 27 11:55 ata-WUH721816ALE6L4_2CKNL63J-part1 → …/…/sda1

lrwxrwxrwx 1 root root 13 Jan 27 11:55 nvme-SPCC_M.2_PCIe_SSD_210402295190177 → …/…/nvme1n1

lrwxrwxrwx 1 root root 15 Jan 27 11:55 nvme-SPCC_M.2_PCIe_SSD_210402295190177-part1 → …/…/nvme1n1p1

lrwxrwxrwx 1 root root 15 Jan 27 11:55 nvme-SPCC_M.2_PCIe_SSD_210402295190177-part2 → …/…/nvme1n1p2

lrwxrwxrwx 1 root root 15 Jan 27 11:55 nvme-SPCC_M.2_PCIe_SSD_210402295190177-part3 → …/…/nvme1n1p3

lrwxrwxrwx 1 root root 13 Jan 27 11:55 nvme-SPCC_M.2_PCIe_SSD_210402295190177_1 → …/…/nvme1n1

lrwxrwxrwx 1 root root 15 Jan 27 11:55 nvme-SPCC_M.2_PCIe_SSD_210402295190177_1-part1 → …/…/nvme1n1p1

lrwxrwxrwx 1 root root 15 Jan 27 11:55 nvme-SPCC_M.2_PCIe_SSD_210402295190177_1-part2 → …/…/nvme1n1p2

lrwxrwxrwx 1 root root 15 Jan 27 11:55 nvme-SPCC_M.2_PCIe_SSD_210402295190177_1-part3 → …/…/nvme1n1p3

lrwxrwxrwx 1 root root 13 Jan 27 11:55 nvme-SPCC_M.2_PCIe_SSD_240460155221007 → …/…/nvme0n1

lrwxrwxrwx 1 root root 13 Jan 27 11:55 nvme-SPCC_M.2_PCIe_SSD_240460155221007_1 → …/…/nvme0n1

lrwxrwxrwx 1 root root 13 Jan 27 11:55 nvme-eui.32343034010000004ce0001835323231 → …/…/nvme0n1

lrwxrwxrwx 1 root root 13 Jan 27 11:55 nvme-nvme.10ec-323130343032323935313930313737-53504343204d2e32205043496520535344-00000001 → …/…/nvme1n1

lrwxrwxrwx 1 root root 15 Jan 27 11:55 nvme-nvme.10ec-323130343032323935313930313737-53504343204d2e32205043496520535344-00000001-part1 → …/…/nvme1n1p1

lrwxrwxrwx 1 root root 15 Jan 27 11:55 nvme-nvme.10ec-323130343032323935313930313737-53504343204d2e32205043496520535344-00000001-part2 → …/…/nvme1n1p2

lrwxrwxrwx 1 root root 15 Jan 27 11:55 nvme-nvme.10ec-323130343032323935313930313737-53504343204d2e32205043496520535344-00000001-part3 → …/…/nvme1n1p3

lrwxrwxrwx 1 root root 9 Jan 27 20:29 wwn-0x5000c500c3802153 → …/…/sdb

lrwxrwxrwx 1 root root 10 Jan 27 20:29 wwn-0x5000c500c3802153-part1 → …/…/sdb1

lrwxrwxrwx 1 root root 9 Jan 27 11:55 wwn-0x5000c500db60e9eb → …/…/sdd

lrwxrwxrwx 1 root root 10 Jan 27 11:55 wwn-0x5000c500db60e9eb-part1 → …/…/sdd1

lrwxrwxrwx 1 root root 9 Jan 27 20:39 wwn-0x5000c500e5a14487 → …/…/sdc

lrwxrwxrwx 1 root root 10 Jan 27 20:39 wwn-0x5000c500e5a14487-part1 → …/…/sdc1

lrwxrwxrwx 1 root root 9 Jan 27 11:55 wwn-0x5000cca2a1f3a23e → …/…/sda

lrwxrwxrwx 1 root root 10 Jan 27 11:55 wwn-0x5000cca2a1f3a23e-part1 → …/…/sda1

So I figure, if by-id is recognising the partitions, can I import the pool using those values, instead of the PARTUUID values?
Using zpool import -d /dev/disk/by-id HDDs:

cannot mount '/HDDs': failed to create mountpoint: Read-only file system
Import was successful, but unable to mount some datasets

zpool status now showed me the magic output I had been waiting for:

  pool: HDDs
 state: ONLINE
  scan: scrub repaired 0B in 21:53:45 with 0 errors on Thu Jan  9 21:53:47 2025
config:

	NAME                              STATE     READ WRITE CKSUM
	HDDs                              ONLINE       0     0     0
	  mirror-0                        ONLINE       0     0     0
	    wwn-0x5000c500db60e9eb-part1  ONLINE       0     0     0
	    wwn-0x5000cca2a1f3a23e-part1  ONLINE       0     0     0
	  mirror-1                        ONLINE       0     0     0
	    wwn-0x5000c500e5a14487-part1  ONLINE       0     0     0
	    wwn-0x5000c500c3802153-part1  ONLINE       0     0     0

errors: No known data errors

  pool: boot-pool
 state: ONLINE
  scan: scrub repaired 0B in 00:00:26 with 0 errors on Wed Jan 22 03:45:27 2025
config:

	NAME         STATE     READ WRITE CKSUM
	boot-pool    ONLINE       0     0     0
	  nvme1n1p3  ONLINE       0     0     0

errors: No known data errors

After manually correcting some mountpoints (zfs set mountpoint=/mnt/HDDs HDDs and zfs set mountpoint=/mnt/.ix-apps HDDs/ix-apps), my data and apps are all now visible and completely accessible.

==============================

In the process of troubleshooting, I determined that the SATA controller in my Aoostar WTR Pro is none other than the ASMedia ASM1064 - which I have now learned is woefully incapable for the job.

Sadly I’m going to have to ditch my little WTR Pro, and instead build something much more reliable myself.

Thanks to all who read my earlier post or replied with troubleshooting assistance!