Help! Pool disappeared & Disks Showing as Unassigned

M_K · October 4, 2024, 8:53am

I have a TrueNas Scale installation that I set up about a month ago. Recently I got an alert that my pool was degraded. When I was able to get back and check it 2 of the disks were showing as FAULTED, then another came up as FAULTED. I restarted the system and then the pool was not showing at all, with the drives showing up as unassigned drives on the storage dashboard.

OS: TrueNAS-SCALE-24.04.2
CPU: i5-6600K
Memory: 16GB DDR4
System Disk: 500GB SATA SSD (onboard controller)
Pool: 5x Seagate IronWolf Pro 16TB (ST16000NT001) - RAIDZ1
Storage Controller: LSI 9300-16i )IT mode)

Short SMART test on all drives (before failure) showed no issues.

Can I recover my pool / data? I have included some of the logs from when the issue appears to have started. My best guess is that there is an issue with the HBA, but I am hoping I can get some advice and guidance before trying anything, I know that if there is a chance of getting my data back - it relies on me not doing anything silly now.

I have another NAS with the same HBA in it that I can borrow if needed.

If anyone can provide any insight or assistance in recovering my situation it would be very much appreciated!!

Sep 28 05:08:15 nas-01 kernel: pcieport 0000:00:1b.4: AER: Corrected error message received from 0000:00:1b.4
Sep 28 05:08:15 nas-01 kernel: pcieport 0000:00:1b.4: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
Sep 28 05:08:15 nas-01 kernel: pcieport 0000:00:1b.4: device [8086:a2eb] error status/mask=00000001/00002000
Sep 28 05:08:15 nas-01 kernel: pcieport 0000:00:1b.4: [ 0] RxErr (First)
Sep 28 05:26:43 nas-01 kernel: pcieport 0000:00:1b.4: AER: Corrected error message received from 0000:00:1b.4
Sep 28 05:26:43 nas-01 kernel: pcieport 0000:00:1b.4: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
Sep 28 05:26:43 nas-01 kernel: pcieport 0000:00:1b.4: device [8086:a2eb] error status/mask=00000001/00002000
Sep 28 05:26:43 nas-01 kernel: pcieport 0000:00:1b.4: [ 0] RxErr (First)
Sep 28 06:38:16 nas-01 kernel: pcieport 0000:00:1b.4: AER: Corrected error message received from 0000:00:1b.4
Sep 28 06:38:16 nas-01 kernel: pcieport 0000:00:1b.4: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
Sep 28 06:38:16 nas-01 kernel: pcieport 0000:00:1b.4: device [8086:a2eb] error status/mask=00000001/00002000
Sep 28 06:38:16 nas-01 kernel: pcieport 0000:00:1b.4: [ 0] RxErr (First)
Sep 28 07:50:31 nas-01 kernel: pcieport 0000:00:1b.4: AER: Corrected error message received from 0000:00:1b.4
Sep 28 07:50:31 nas-01 kernel: pcieport 0000:00:1b.4: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
Sep 28 07:50:31 nas-01 kernel: pcieport 0000:00:1b.4: device [8086:a2eb] error status/mask=00000001/00002000
Sep 28 07:50:31 nas-01 kernel: pcieport 0000:00:1b.4: [ 0] RxErr (First)
Sep 28 09:43:06 nas-01 kernel: pcieport 0000:00:1b.4: AER: Corrected error message received from 0000:00:1b.4
Sep 28 09:43:06 nas-01 kernel: pcieport 0000:00:1b.4: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
Sep 28 09:43:06 nas-01 kernel: pcieport 0000:00:1b.4: device [8086:a2eb] error status/mask=00000001/00002000
Sep 28 09:43:06 nas-01 kernel: pcieport 0000:00:1b.4: [ 0] RxErr (First)
Sep 28 15:08:32 nas-01 kernel: mpt3sas_cm0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000)
Sep 28 15:08:32 nas-01 kernel: mpt3sas_cm0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000)
Sep 28 15:08:32 nas-01 kernel: mpt3sas_cm0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000)
Sep 28 15:08:32 nas-01 kernel: sd 0:0:4:0: [sde] tag#4157 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=2s
Sep 28 15:08:32 nas-01 kernel: sd 0:0:4:0: [sde] tag#4157 Sense Key : Medium Error [current]
Sep 28 15:08:32 nas-01 kernel: sd 0:0:4:0: [sde] tag#4157 Add. Sense: Unrecovered read error
Sep 28 15:08:32 nas-01 kernel: sd 0:0:4:0: [sde] tag#4157 CDB: Read(16) 88 00 00 00 00 00 07 f6 82 b8 00 00 00 40 00 00
Sep 28 15:08:32 nas-01 kernel: zio pool=pool-01 vdev=/dev/disk/by-partuuid/e6193543-b551-4653-816b-f7c33d94df73 error=61 type=1 offset=68398968832 size=32768 flags=1573248
Sep 28 17:32:55 nas-01 kernel: mpt3sas_cm0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000)
Sep 28 17:32:55 nas-01 kernel: sd 0:0:4:0: [sde] tag#4626 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=2s
Sep 28 17:32:55 nas-01 kernel: mpt3sas_cm0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000)
Sep 28 17:32:55 nas-01 kernel: sd 0:0:4:0: [sde] tag#4626 Sense Key : Medium Error [current]
Sep 28 17:32:55 nas-01 kernel: sd 0:0:4:0: [sde] tag#4626 Add. Sense: Unrecovered read error
Sep 28 17:32:55 nas-01 kernel: sd 0:0:4:0: [sde] tag#4626 CDB: Read(16) 88 00 00 00 00 00 0b b8 eb 08 00 00 02 00 00 00
Sep 28 17:32:55 nas-01 kernel: zio pool=pool-01 vdev=/dev/disk/by-partuuid/e6193543-b551-4653-816b-f7c33d94df73 error=61 type=1 offset=100692004864 size=262144 flags=1074267296
Sep 28 17:34:11 nas-01 kernel: Adding 16777212k swap on /dev/mapper/sdf4. Priority:-2 extents:1 across:16777212k SS
Sep 28 17:35:14 nas-01 kernel: mpt3sas_cm0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000)
Sep 28 17:35:14 nas-01 kernel: mpt3sas_cm0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000)
Sep 28 17:35:14 nas-01 kernel: mpt3sas_cm0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000)
Sep 28 17:35:14 nas-01 kernel: mpt3sas_cm0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000)
Sep 28 17:35:14 nas-01 kernel: sd 0:0:0:0: [sdb] tag#4694 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=4s
Sep 28 17:35:14 nas-01 kernel: sd 0:0:0:0: [sdb] tag#4694 Sense Key : Medium Error [current]
Sep 28 17:35:14 nas-01 kernel: sd 0:0:0:0: [sdb] tag#4694 Add. Sense: Unrecovered read error
Sep 28 17:35:14 nas-01 kernel: sd 0:0:0:0: [sdb] tag#4694 CDB: Read(16) 88 00 00 00 00 00 00 6e b3 b0 00 00 07 30 00 00
Sep 28 17:35:17 nas-01 kernel: zio pool=pool-01 vdev=/dev/disk/by-partuuid/89b5831d-d307-4e01-9a02-f3b1835c2e5f error=61 type=1 offset=3712442368 size=1048576 flags=1074267312
Sep 28 17:35:25 nas-01 kernel: mpt3sas_cm0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000)
Sep 28 17:35:25 nas-01 kernel: mpt3sas_cm0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000)
Sep 28 17:35:25 nas-01 kernel: sd 0:0:0:0: [sdb] tag#4674 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=2s
Sep 28 17:35:25 nas-01 kernel: sd 0:0:0:0: [sdb] tag#4674 Sense Key : Medium Error [current]
Sep 28 17:35:25 nas-01 kernel: sd 0:0:0:0: [sdb] tag#4674 Add. Sense: Unrecovered read error
Sep 28 17:35:25 nas-01 kernel: sd 0:0:0:0: [sdb] tag#4674 CDB: Read(16) 88 00 00 00 00 00 00 6e b9 f0 00 00 00 40 00 00
Sep 28 17:35:25 nas-01 kernel: zio pool=pool-01 vdev=/dev/disk/by-partuuid/89b5831d-d307-4e01-9a02-f3b1835c2e5f error=61 type=1 offset=3713261568 size=32768 flags=1605809
Sep 28 17:35:27 nas-01 kernel: sd 0:0:0:0: [sdb] tag#4709 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=2s
Sep 28 17:35:27 nas-01 kernel: sd 0:0:0:0: [sdb] tag#4709 Sense Key : Medium Error [current]

** many similar entries removed **

Sep 28 17:50:20 nas-01 kernel: sd 0:0:0:0: device_block, handle(0x0009)
Sep 28 17:50:22 nas-01 kernel: sd 0:0:0:0: device_unblock and setting to running, handle(0x0009)
Sep 28 17:50:22 nas-01 kernel: sd 0:0:0:0: [sdb] Synchronizing SCSI cache
Sep 28 17:50:22 nas-01 kernel: sd 0:0:0:0: [sdb] Synchronize Cache(10) failed: Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
Sep 28 17:50:22 nas-01 kernel: mpt3sas_cm0: mpt3sas_transport_port_remove: removed: sas_addr(0x4433221102000000)
Sep 28 17:50:22 nas-01 kernel: mpt3sas_cm0: removing handle(0x0009), sas_addr(0x4433221102000000)
Sep 28 17:50:22 nas-01 kernel: mpt3sas_cm0: enclosure logical id(0x500062b200263780), slot(0)
Sep 28 17:50:22 nas-01 kernel: mpt3sas_cm0: enclosure level(0x0000), connector name( )
Sep 28 17:50:30 nas-01 kernel: Adding 16777212k swap on /dev/mapper/sdf4. Priority:-2 extents:1 across:16777212k SS
Sep 28 17:51:06 nas-01 netdata[1540425]: CONFIG: cannot load cloud config ‘/var/lib/netdata/cloud.d/cloud.conf’. Running with internal defaults.
Sep 28 17:51:29 nas-01 netdata[1540621]: CONFIG: cannot load cloud config ‘/var/lib/netdata/cloud.d/cloud.conf’. Running with internal defaults.
Sep 28 17:51:52 nas-01 netdata[1540844]: CONFIG: cannot load cloud config ‘/var/lib/netdata/cloud.d/cloud.conf’. Running with internal defaults.
Sep 28 17:52:14 nas-01 netdata[1541090]: CONFIG: cannot load cloud config ‘/var/lib/netdata/cloud.d/cloud.conf’. Running with internal defaults.
Sep 28 17:52:37 nas-01 netdata[1541303]: CONFIG: cannot load cloud config ‘/var/lib/netdata/cloud.d/cloud.conf’. Running with internal defaults.
Sep 28 17:53:00 nas-01 netdata[1541498]: CONFIG: cannot load cloud config ‘/var/lib/netdata/cloud.d/cloud.conf’. Running with internal defaults.
Sep 28 17:53:23 nas-01 netdata[1541705]: CONFIG: cannot load cloud config ‘/var/lib/netdata/cloud.d/cloud.conf’. Running with internal defaults.
Sep 28 17:53:33 nas-01 kernel: task:txg_sync state:D stack:0 pid:1458 ppid:2 flags:0x00004000
Sep 28 17:53:33 nas-01 kernel: Call Trace:
Sep 28 17:53:33 nas-01 kernel:
Sep 28 17:53:33 nas-01 kernel: __schedule+0x349/0x950
Sep 28 17:53:33 nas-01 kernel: schedule+0x5b/0xa0
Sep 28 17:53:33 nas-01 kernel: schedule_timeout+0x98/0x160
Sep 28 17:53:33 nas-01 kernel: ? __pfx_process_timeout+0x10/0x10
Sep 28 17:53:33 nas-01 kernel: io_schedule_timeout+0x50/0x80
Sep 28 17:53:33 nas-01 kernel: __cv_timedwait_common+0x12a/0x160 [spl]
Sep 28 17:53:33 nas-01 kernel: ? __pfx_autoremove_wake_function+0x10/0x10
Sep 28 17:53:33 nas-01 kernel: __cv_timedwait_io+0x19/0x20 [spl]
Sep 28 17:53:33 nas-01 kernel: zio_wait+0x10f/0x220 [zfs]
Sep 28 17:53:33 nas-01 kernel: dsl_pool_sync_mos+0x37/0xa0 [zfs]
Sep 28 17:53:33 nas-01 kernel: dsl_pool_sync+0x3b9/0x410 [zfs]
Sep 28 17:53:33 nas-01 kernel: spa_sync_iterate_to_convergence+0xdc/0x200 [zfs]
Sep 28 17:53:33 nas-01 kernel: spa_sync+0x30a/0x5e0 [zfs]
Sep 28 17:53:33 nas-01 kernel: txg_sync_thread+0x1ec/0x270 [zfs]
Sep 28 17:53:33 nas-01 kernel: ? __pfx_txg_sync_thread+0x10/0x10 [zfs]
Sep 28 17:53:33 nas-01 kernel: ? __pfx_thread_generic_wrapper+0x10/0x10 [spl]
Sep 28 17:53:33 nas-01 kernel: thread_generic_wrapper+0x5b/0x70 [spl]
Sep 28 17:53:33 nas-01 kernel: kthread+0xe5/0x120
Sep 28 17:53:33 nas-01 kernel: ? __pfx_kthread+0x10/0x10
Sep 28 17:53:33 nas-01 kernel: ret_from_fork+0x31/0x50
Sep 28 17:53:33 nas-01 kernel: ? __pfx_kthread+0x10/0x10
Sep 28 17:53:33 nas-01 kernel: ret_from_fork_asm+0x1b/0x30
Sep 28 17:53:33 nas-01 kernel:

** many similar entries removed **

Sep 28 17:53:33 nas-01 kernel: Future hung task reports are suppressed, see sysctl kernel.hung_task_warnings
Sep 28 17:53:43 nas-01 kernel: mpt3sas_cm0: handle(0x9) sas_address(0x4433221102000000) port_type(0x1)
Sep 28 17:53:45 nas-01 netdata[1541965]: CONFIG: cannot load cloud config ‘/var/lib/netdata/cloud.d/cloud.conf’. Running with internal defaults.
Sep 28 17:53:51 nas-01 kernel: mpt3sas_cm0: log_info(0x31111000): originator(PL), code(0x11), sub_code(0x1000)
Sep 28 17:53:59 nas-01 kernel: mpt3sas_cm0: log_info(0x31111000): originator(PL), code(0x11), sub_code(0x1000)
Sep 28 17:54:08 nas-01 netdata[1542174]: CONFIG: cannot load cloud config ‘/var/lib/netdata/cloud.d/cloud.conf’. Running with internal defaults.
Sep 28 17:54:08 nas-01 kernel: mpt3sas_cm0: log_info(0x31111000): originator(PL), code(0x11), sub_code(0x1000)
Sep 28 17:54:29 nas-01 kernel: scsi 0:0:5:0: attempting task abort!scmd(0x00000000e6c941da), outstanding for 45800 ms & timeout 20000 ms
Sep 28 17:54:29 nas-01 kernel: scsi 0:0:5:0: tag#4687 CDB: Inquiry 12 00 00 00 24 00
Sep 28 17:54:29 nas-01 kernel: scsi target0:0:5: handle(0x0009), sas_address(0x4433221102000000), phy(2)
Sep 28 17:54:29 nas-01 kernel: scsi target0:0:5: enclosure logical id(0x500062b200263780), slot(0)
Sep 28 17:54:29 nas-01 kernel: scsi target0:0:5: enclosure level(0x0000), connector name( )
Sep 28 17:54:31 nas-01 netdata[1542374]: CONFIG: cannot load cloud config ‘/var/lib/netdata/cloud.d/cloud.conf’. Running with internal defaults.
Sep 28 17:54:33 nas-01 kernel: scsi 0:0:5:0: task abort: SUCCESS scmd(0x00000000e6c941da)
Sep 28 17:54:33 nas-01 kernel: end_device-0:5: add: handle(0x0009), sas_addr(0x4433221102000000)
Sep 28 17:54:33 nas-01 kernel: mpt3sas_cm0: mpt3sas_transport_port_remove: removed: sas_addr(0x4433221102000000)
Sep 28 17:54:33 nas-01 kernel: mpt3sas_cm0: removing handle(0x0009), sas_addr(0x4433221102000000)
Sep 28 17:54:33 nas-01 kernel: mpt3sas_cm0: enclosure logical id(0x500062b200263780), slot(0)
Sep 28 17:54:33 nas-01 kernel: mpt3sas_cm0: enclosure level(0x0000), connector name( )
Sep 28 17:54:54 nas-01 netdata[1542578]: CONFIG: cannot load cloud config ‘/var/lib/netdata/cloud.d/cloud.conf’. Running with internal defaults.

Farout · October 4, 2024, 9:55am

Is this a bare metal install or in a VM ?
How are you cooling the HBA?
Output of zpool status please.

If 2 drives are really faulted in a raidz1, your data is gone.

Yes, i would try swapping the HBA.

Stux · October 4, 2024, 8:09pm

And possibly zpool import if the pool is not already showing in status.

Wrap in triple backticks ```

sfatula · October 4, 2024, 8:44pm

And of course, I will add the usual comment. Always do backups! A backup or 2 or 3 is always necessary. Even on raidz3, or 4 way mirrors, etc.

M_K · October 4, 2024, 9:46pm

Bare metal. I don’t seem to be able to boot it now, it is stuck on “Job systemd-tmpfiles-setup.service/start running (1h 20min 00s / no limit)”.

I can try swapping out the HBA tomorrow.

Is it possible to move the drives over to my other working NAS and import the pool? / is this a good idea to try if all else fails? I set up a small test system before doing the real one, and importing a pool seemed to work fine. I want to be sure before trying with this one though!

M_K · October 6, 2024, 10:16am

OK, I swapped over the HBA and I am just getting the same errors that I was before (critical medium error). Before TrueNas wouldn’t start after 90 mins of waiting, but it did start this time after about 10 mins.

The pool is now showing, although degraded. One disk is unavailable (this disk shows as an Unassigned Disk in the storage dashboard). 2 show as degraded and 2 as online.

All disks are showing at least some errors (there were none on any of them before this issue started).

What is my best course of action from here? Should I try and online sdc and sdf?

I have another NAS that I can back up data to if I can access this degraded pool, I really don’t want to do anything silly at this point though!

Any advice would be very much appreciated!!!

zpool status:

pool: pool-01
state: DEGRADED
status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scan: resilver in progress since Sun Oct 6 10:59:09 2024
9.60T / 60.4T scanned at 10.8G/s, 0B / 51.5T issued
0B resilvered, 0.00% done, no estimated completion time
config:

    NAME                                      STATE     READ WRITE CKSUM
    pool-01                                   DEGRADED     0     0     0
      raidz1-0                                DEGRADED     0     0     0
        89b5831d-d307-4e01-9a02-f3b1835c2e5f  UNAVAIL  5.27K 4.28K     0
        94d0758c-44ab-42af-b41d-129969551861  ONLINE       0     0    13
        26d0cd57-3493-4545-ae98-bc32ee63526f  DEGRADED     0     0    13  too many errors
        e6193543-b551-4653-816b-f7c33d94df73  DEGRADED    37     0    13  too many errors
        99c6c170-a493-439a-ad2a-83a9db1df9a2  ONLINE       0     0    13

Farout · October 6, 2024, 10:26am

Having the exact same number of checksum errors on all discs can mean faulty cabling, or faulty controller.

Again, how were you cooling your 9300-16i ?

M_K · October 6, 2024, 10:33am

There is a 120mm fan mounted directly underneath it, running 100%. The heatsink doesn’t seem to get any hotter than ‘very slightly warm’.

Farout · October 6, 2024, 10:37am

Then i would asap copy all the important data off it.
Run long smart tests on the drives
replace the faulted drive if needed
Run a scrub

M_K · October 6, 2024, 11:32am

Maybe a silly question, but my datasets have gone. I can see & browse the pool in the shell but I just have the default message inviting me to create a pool on the datasets section of the UI.