Boot-pool Degraded After Upgrade to Dragonfish

I just upgraded my system from Cobia (23.10.2) to Dragonfish (24.04.1.1)

All went smoothly with the exception that my boot-pool mirror was degraded for some reason:

CRITICAL

Boot pool status is DEGRADED: One or more devices has been removed by the administrator. Sufficient replicas exist for the pool to continue functioning in a degraded state.

zpool status confirmed this:

 pool: boot-pool
 state: DEGRADED
status: One or more devices has been removed by the administrator.
	Sufficient replicas exist for the pool to continue functioning in a
	degraded state.
action: Online the device using zpool online' or replace the device with
	'zpool replace'.
  scan: scrub repaired 0B in 00:00:15 with 0 errors on Wed Jun  5 03:45:16 2024
config:

	NAME        STATE     READ WRITE CKSUM
	boot-pool   DEGRADED     0     0     0
	  mirror-0  DEGRADED     0     0     0
	    sda3    REMOVED      0     0     0
	    sdc3    ONLINE       0     0     0

ā€˜zpool online boot-pool sda3’ brought it back online, so there’s no apparent harm done. It was just a little disconcerting!

What I find particularly odd, though, is that /dev/sda is the boot device. So it was obviously online enough for the BIOS to boot from it, but it ā€œfailedā€ at some point subsequent to it booting.

You also need to remember that your boot drive consists of more than just the boot-pool partition, and that the boot-pool mirror only covers the boot-pool itself and not anything else on the boot drive.

I have no idea how you set up your sdc to mirror sda1/2 nor how the TN upgrade process handles mirroring e.g. UEFI, Grub etc. which are outside the boot-pool.

Thanks, I’m not 100% clear on how the boot drive is laid out, either (online documentation seems to be non-existent).

The boot-pool was initially set up at install to use /dev/sda and /dev/sdb, with no clever/alternate options as it was my first few hours using TrueNAS Scale and I was (still am!) a relative newbie.

With the system’s habit of renaming devices when it reboots, the second disk in the mirror has been sdb, sdd, sde and sdc throughout its short life.

FWIW, I’ve rebooted the server a few times today and it’s come up with the mirror intact each time. Today for the first time, though, it’s actually renamed the first disk in the mirror, too, so it’s now sde and sdc. I was under the impression that the boot disk would always be sda, but I guess not.

Okay, follow-up question: the actual boot device (what used to be /dev/sda but is now /dev/sde; I kept track of the serial numbers!) is now showing 18 checksum errors from a scrub, and appears to be stuck in a smartctl long test for over 24hrs.

I’m running Multi-Report every day, with a short test daily Tues-Sun, and a long test on Mondays. Disks are scrubbed every 28 days.

I suspect it may be unhappy and want to replace it. I’ve run another scrub today, with no errors, and a short test which passed.

As you (@Protopia) mention that the boot drive contains more than just the boot-pool, what is the procedure for replacing a boot device?

Output of 'smartctl -x /dev/sde', for reference:
root@eurybia[/mnt/Pool1/homes/garyp]# smartctl -x /dev/sde
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.6.29-production+truenas] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:               NETAPP
Product:              X439_S16331T6AMD
Revision:             NA04
Compliance:           SPC-4
User Capacity:        1,600,321,314,816 bytes [1.60 TB]
Logical block size:   512 bytes
Physical block size:  4096 bytes
LU is resource provisioned, LBPRZ=1
Rotation Rate:        Solid State Device
Form Factor:          2.5 inches
Logical Unit id:      0x5002538a75801c60
Serial number:        S20JNWAG800454
Device type:          disk
Transport protocol:   SAS (SPL-4)
Local Time is:        Tue Jun 11 09:44:46 2024 BST
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Enabled
Read Cache is:        Enabled
Writeback Cache is:   Disabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Percentage used endurance indicator: 0%
Current Drive Temperature:     29 C
Drive Trip Temperature:        60 C

Manufactured in week 31 of year 2015
Accumulated start-stop cycles:  262
Specified load-unload count over device lifetime:  0
Accumulated load-unload cycles:  0
Elements in grown defect list: 0

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:          0        0         0         0          0      23312.546           0
write:         0        0         0         0          0      28514.214           0
verify:        0        0         0         0          0     247014.160           0

Non-medium error count:       58

SMART Self-test log
Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]
     Description                              number   (hours)
# 1  Background short  Completed                   -   64651                 - [-   -    -]
# 2  Background long   Self test in progress ...   -     NOW                 - [-   -    -]
# 3  Background short  Completed                   -   64601                 - [-   -    -]
# 4  Background short  Completed                   -   64577                 - [-   -    -]
# 5  Background short  Completed                   -   64553                 - [-   -    -]

Long (extended) Self-test duration: 3600 seconds [60.0 minutes]

Background scan results log
  Status: waiting until BMS interval timer expires
    Accumulated power on time, hours:minutes 64652:23 [3879143 minutes]
    Number of background scans performed: 80,  scan progress: 90.93%
    Number of background medium scans performed: 80
Device does not support General statistics and performance logging

Protocol Specific port log page for SAS SSP
relative target port id = 1
  generation code = 3
  number of phys = 1
  phy identifier = 0
    attached device type: expander device
    attached reason: power on
    reason: loss of dword synchronization
    negotiated logical link rate: phy enabled; 6 Gbps
    attached initiator port: ssp=0 stp=0 smp=1
    attached target port: ssp=0 stp=0 smp=1
    SAS address = 0x5002538a75801c61
    attached SAS address = 0x5001438007bdbba6
    attached phy identifier = 3
    Invalid DWORD count = 31086
    Running disparity error count = 33568
    Loss of DWORD synchronization count = 3
    Phy reset problem count = 1
    Phy event descriptors:
     Received ERROR count: 12744
     Received address frame error count: 0
     Received abandon-class OPEN_REJECT count: 0
     Received retry-class OPEN_REJECT count: 3171
     Received SSP frame error count: 0
relative target port id = 2
  generation code = 3
  number of phys = 1
  phy identifier = 1
    attached device type: no device attached
    attached reason: unknown
    reason: power on
    negotiated logical link rate: phy enabled; unknown
    attached initiator port: ssp=0 stp=0 smp=0
    attached target port: ssp=0 stp=0 smp=0
    SAS address = 0x5002538a75801c62
    attached SAS address = 0x0
    attached phy identifier = 0
    Invalid DWORD count = 0
    Running disparity error count = 0
    Loss of DWORD synchronization count = 0
    Phy reset problem count = 0
    Phy event descriptors:
     Received ERROR count: 0
     Received address frame error count: 0
     Received abandon-class OPEN_REJECT count: 0
     Received retry-class OPEN_REJECT count: 0
     Received SSP frame error count: 0

stick new device in, use gui to replace it

System Settings → Boot → Boot Pool Status, select device to replace, click ā ‡then replace… etc

1 Like

Ah, nice! I’d wondered why the boot-pool wasn’t represented in similar fashion in the Storage settings area.