I just upgraded my system from Cobia (23.10.2) to Dragonfish (24.04.1.1)
All went smoothly with the exception that my boot-pool mirror was degraded for some reason:
CRITICAL
Boot pool status is DEGRADED: One or more devices has been removed by the administrator. Sufficient replicas exist for the pool to continue functioning in a degraded state.
zpool status confirmed this:
pool: boot-pool
state: DEGRADED
status: One or more devices has been removed by the administrator.
Sufficient replicas exist for the pool to continue functioning in a
degraded state.
action: Online the device using zpool online' or replace the device with
'zpool replace'.
scan: scrub repaired 0B in 00:00:15 with 0 errors on Wed Jun 5 03:45:16 2024
config:
NAME STATE READ WRITE CKSUM
boot-pool DEGRADED 0 0 0
mirror-0 DEGRADED 0 0 0
sda3 REMOVED 0 0 0
sdc3 ONLINE 0 0 0
āzpool online boot-pool sda3ā brought it back online, so thereās no apparent harm done. It was just a little disconcerting!
What I find particularly odd, though, is that /dev/sda is the boot device. So it was obviously online enough for the BIOS to boot from it, but it āfailedā at some point subsequent to it booting.
You also need to remember that your boot drive consists of more than just the boot-pool partition, and that the boot-pool mirror only covers the boot-pool itself and not anything else on the boot drive.
I have no idea how you set up your sdc to mirror sda1/2 nor how the TN upgrade process handles mirroring e.g. UEFI, Grub etc. which are outside the boot-pool.
Thanks, Iām not 100% clear on how the boot drive is laid out, either (online documentation seems to be non-existent).
The boot-pool was initially set up at install to use /dev/sda and /dev/sdb, with no clever/alternate options as it was my first few hours using TrueNAS Scale and I was (still am!) a relative newbie.
With the systemās habit of renaming devices when it reboots, the second disk in the mirror has been sdb, sdd, sde and sdc throughout its short life.
FWIW, Iāve rebooted the server a few times today and itās come up with the mirror intact each time. Today for the first time, though, itās actually renamed the first disk in the mirror, too, so itās now sde and sdc. I was under the impression that the boot disk would always be sda, but I guess not.
Okay, follow-up question: the actual boot device (what used to be /dev/sda but is now /dev/sde; I kept track of the serial numbers!) is now showing 18 checksum errors from a scrub, and appears to be stuck in a smartctl long test for over 24hrs.
Iām running Multi-Report every day, with a short test daily Tues-Sun, and a long test on Mondays. Disks are scrubbed every 28 days.
I suspect it may be unhappy and want to replace it. Iāve run another scrub today, with no errors, and a short test which passed.
As you (@Protopia) mention that the boot drive contains more than just the boot-pool, what is the procedure for replacing a boot device?
Output of 'smartctl -x /dev/sde', for reference:
root@eurybia[/mnt/Pool1/homes/garyp]# smartctl -x /dev/sde
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.6.29-production+truenas] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Vendor: NETAPP
Product: X439_S16331T6AMD
Revision: NA04
Compliance: SPC-4
User Capacity: 1,600,321,314,816 bytes [1.60 TB]
Logical block size: 512 bytes
Physical block size: 4096 bytes
LU is resource provisioned, LBPRZ=1
Rotation Rate: Solid State Device
Form Factor: 2.5 inches
Logical Unit id: 0x5002538a75801c60
Serial number: S20JNWAG800454
Device type: disk
Transport protocol: SAS (SPL-4)
Local Time is: Tue Jun 11 09:44:46 2024 BST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
Temperature Warning: Enabled
Read Cache is: Enabled
Writeback Cache is: Disabled
=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK
Percentage used endurance indicator: 0%
Current Drive Temperature: 29 C
Drive Trip Temperature: 60 C
Manufactured in week 31 of year 2015
Accumulated start-stop cycles: 262
Specified load-unload count over device lifetime: 0
Accumulated load-unload cycles: 0
Elements in grown defect list: 0
Error counter log:
Errors Corrected by Total Correction Gigabytes Total
ECC rereads/ errors algorithm processed uncorrected
fast | delayed rewrites corrected invocations [10^9 bytes] errors
read: 0 0 0 0 0 23312.546 0
write: 0 0 0 0 0 28514.214 0
verify: 0 0 0 0 0 247014.160 0
Non-medium error count: 58
SMART Self-test log
Num Test Status segment LifeTime LBA_first_err [SK ASC ASQ]
Description number (hours)
# 1 Background short Completed - 64651 - [- - -]
# 2 Background long Self test in progress ... - NOW - [- - -]
# 3 Background short Completed - 64601 - [- - -]
# 4 Background short Completed - 64577 - [- - -]
# 5 Background short Completed - 64553 - [- - -]
Long (extended) Self-test duration: 3600 seconds [60.0 minutes]
Background scan results log
Status: waiting until BMS interval timer expires
Accumulated power on time, hours:minutes 64652:23 [3879143 minutes]
Number of background scans performed: 80, scan progress: 90.93%
Number of background medium scans performed: 80
Device does not support General statistics and performance logging
Protocol Specific port log page for SAS SSP
relative target port id = 1
generation code = 3
number of phys = 1
phy identifier = 0
attached device type: expander device
attached reason: power on
reason: loss of dword synchronization
negotiated logical link rate: phy enabled; 6 Gbps
attached initiator port: ssp=0 stp=0 smp=1
attached target port: ssp=0 stp=0 smp=1
SAS address = 0x5002538a75801c61
attached SAS address = 0x5001438007bdbba6
attached phy identifier = 3
Invalid DWORD count = 31086
Running disparity error count = 33568
Loss of DWORD synchronization count = 3
Phy reset problem count = 1
Phy event descriptors:
Received ERROR count: 12744
Received address frame error count: 0
Received abandon-class OPEN_REJECT count: 0
Received retry-class OPEN_REJECT count: 3171
Received SSP frame error count: 0
relative target port id = 2
generation code = 3
number of phys = 1
phy identifier = 1
attached device type: no device attached
attached reason: unknown
reason: power on
negotiated logical link rate: phy enabled; unknown
attached initiator port: ssp=0 stp=0 smp=0
attached target port: ssp=0 stp=0 smp=0
SAS address = 0x5002538a75801c62
attached SAS address = 0x0
attached phy identifier = 0
Invalid DWORD count = 0
Running disparity error count = 0
Loss of DWORD synchronization count = 0
Phy reset problem count = 0
Phy event descriptors:
Received ERROR count: 0
Received address frame error count: 0
Received abandon-class OPEN_REJECT count: 0
Received retry-class OPEN_REJECT count: 0
Received SSP frame error count: 0