Boot-pool Degraded After Upgrade to Dragonfish

I just upgraded my system from Cobia (23.10.2) to Dragonfish (24.04.1.1)

All went smoothly with the exception that my boot-pool mirror was degraded for some reason:

CRITICAL

Boot pool status is DEGRADED: One or more devices has been removed by the administrator. Sufficient replicas exist for the pool to continue functioning in a degraded state.

zpool status confirmed this:

 pool: boot-pool
 state: DEGRADED
status: One or more devices has been removed by the administrator.
	Sufficient replicas exist for the pool to continue functioning in a
	degraded state.
action: Online the device using zpool online' or replace the device with
	'zpool replace'.
  scan: scrub repaired 0B in 00:00:15 with 0 errors on Wed Jun  5 03:45:16 2024
config:

	NAME        STATE     READ WRITE CKSUM
	boot-pool   DEGRADED     0     0     0
	  mirror-0  DEGRADED     0     0     0
	    sda3    REMOVED      0     0     0
	    sdc3    ONLINE       0     0     0

ā€˜zpool online boot-pool sda3ā€™ brought it back online, so thereā€™s no apparent harm done. It was just a little disconcerting!

What I find particularly odd, though, is that /dev/sda is the boot device. So it was obviously online enough for the BIOS to boot from it, but it ā€œfailedā€ at some point subsequent to it booting.

You also need to remember that your boot drive consists of more than just the boot-pool partition, and that the boot-pool mirror only covers the boot-pool itself and not anything else on the boot drive.

I have no idea how you set up your sdc to mirror sda1/2 nor how the TN upgrade process handles mirroring e.g. UEFI, Grub etc. which are outside the boot-pool.

Thanks, Iā€™m not 100% clear on how the boot drive is laid out, either (online documentation seems to be non-existent).

The boot-pool was initially set up at install to use /dev/sda and /dev/sdb, with no clever/alternate options as it was my first few hours using TrueNAS Scale and I was (still am!) a relative newbie.

With the systemā€™s habit of renaming devices when it reboots, the second disk in the mirror has been sdb, sdd, sde and sdc throughout its short life.

FWIW, Iā€™ve rebooted the server a few times today and itā€™s come up with the mirror intact each time. Today for the first time, though, itā€™s actually renamed the first disk in the mirror, too, so itā€™s now sde and sdc. I was under the impression that the boot disk would always be sda, but I guess not.

Okay, follow-up question: the actual boot device (what used to be /dev/sda but is now /dev/sde; I kept track of the serial numbers!) is now showing 18 checksum errors from a scrub, and appears to be stuck in a smartctl long test for over 24hrs.

Iā€™m running Multi-Report every day, with a short test daily Tues-Sun, and a long test on Mondays. Disks are scrubbed every 28 days.

I suspect it may be unhappy and want to replace it. Iā€™ve run another scrub today, with no errors, and a short test which passed.

As you (@Protopia) mention that the boot drive contains more than just the boot-pool, what is the procedure for replacing a boot device?

Output of 'smartctl -x /dev/sde', for reference:
root@eurybia[/mnt/Pool1/homes/garyp]# smartctl -x /dev/sde
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.6.29-production+truenas] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:               NETAPP
Product:              X439_S16331T6AMD
Revision:             NA04
Compliance:           SPC-4
User Capacity:        1,600,321,314,816 bytes [1.60 TB]
Logical block size:   512 bytes
Physical block size:  4096 bytes
LU is resource provisioned, LBPRZ=1
Rotation Rate:        Solid State Device
Form Factor:          2.5 inches
Logical Unit id:      0x5002538a75801c60
Serial number:        S20JNWAG800454
Device type:          disk
Transport protocol:   SAS (SPL-4)
Local Time is:        Tue Jun 11 09:44:46 2024 BST
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Enabled
Read Cache is:        Enabled
Writeback Cache is:   Disabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Percentage used endurance indicator: 0%
Current Drive Temperature:     29 C
Drive Trip Temperature:        60 C

Manufactured in week 31 of year 2015
Accumulated start-stop cycles:  262
Specified load-unload count over device lifetime:  0
Accumulated load-unload cycles:  0
Elements in grown defect list: 0

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:          0        0         0         0          0      23312.546           0
write:         0        0         0         0          0      28514.214           0
verify:        0        0         0         0          0     247014.160           0

Non-medium error count:       58

SMART Self-test log
Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]
     Description                              number   (hours)
# 1  Background short  Completed                   -   64651                 - [-   -    -]
# 2  Background long   Self test in progress ...   -     NOW                 - [-   -    -]
# 3  Background short  Completed                   -   64601                 - [-   -    -]
# 4  Background short  Completed                   -   64577                 - [-   -    -]
# 5  Background short  Completed                   -   64553                 - [-   -    -]

Long (extended) Self-test duration: 3600 seconds [60.0 minutes]

Background scan results log
  Status: waiting until BMS interval timer expires
    Accumulated power on time, hours:minutes 64652:23 [3879143 minutes]
    Number of background scans performed: 80,  scan progress: 90.93%
    Number of background medium scans performed: 80
Device does not support General statistics and performance logging

Protocol Specific port log page for SAS SSP
relative target port id = 1
  generation code = 3
  number of phys = 1
  phy identifier = 0
    attached device type: expander device
    attached reason: power on
    reason: loss of dword synchronization
    negotiated logical link rate: phy enabled; 6 Gbps
    attached initiator port: ssp=0 stp=0 smp=1
    attached target port: ssp=0 stp=0 smp=1
    SAS address = 0x5002538a75801c61
    attached SAS address = 0x5001438007bdbba6
    attached phy identifier = 3
    Invalid DWORD count = 31086
    Running disparity error count = 33568
    Loss of DWORD synchronization count = 3
    Phy reset problem count = 1
    Phy event descriptors:
     Received ERROR count: 12744
     Received address frame error count: 0
     Received abandon-class OPEN_REJECT count: 0
     Received retry-class OPEN_REJECT count: 3171
     Received SSP frame error count: 0
relative target port id = 2
  generation code = 3
  number of phys = 1
  phy identifier = 1
    attached device type: no device attached
    attached reason: unknown
    reason: power on
    negotiated logical link rate: phy enabled; unknown
    attached initiator port: ssp=0 stp=0 smp=0
    attached target port: ssp=0 stp=0 smp=0
    SAS address = 0x5002538a75801c62
    attached SAS address = 0x0
    attached phy identifier = 0
    Invalid DWORD count = 0
    Running disparity error count = 0
    Loss of DWORD synchronization count = 0
    Phy reset problem count = 0
    Phy event descriptors:
     Received ERROR count: 0
     Received address frame error count: 0
     Received abandon-class OPEN_REJECT count: 0
     Received retry-class OPEN_REJECT count: 0
     Received SSP frame error count: 0

stick new device in, use gui to replace it

System Settings ā†’ Boot ā†’ Boot Pool Status, select device to replace, click ā ‡then replaceā€¦ etc

1 Like

Ah, nice! Iā€™d wondered why the boot-pool wasnā€™t represented in similar fashion in the Storage settings area.