I upgraded from 23.10 to 24.04 this morning. My server is pretty basic, with no apps or VMs, just 4 zpools of various types: fast
(single NVMe SSD), big
(8TB HDD 2-disk mirrror), medium
(RAIDZ with crappy consumer SATA SSDs), and slow
(six old 2TB HDDs in a RAIDZ2 configuration).
After the upgrade, on the first boot, everything seems working except the slow
pool is gone. The UI section for it just shows an error icon (with Pool contains OFFLINE Data VDEVs
hover text) and all the disks are gone.
On the command line, I see this:
root@truenas:/home/admin# zpool import
pool: slow
id: 8876966371717654632
state: UNAVAIL
status: One or more devices contains corrupted data.
action: The pool cannot be imported due to damaged devices or data.
see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-5E
config:
slow UNAVAIL insufficient replicas
raidz2-0 UNAVAIL insufficient replicas
3fcf7700-3773-490e-8266-4865f2dd654f UNAVAIL
802388e3-4276-4c70-9376-68f94875f599 UNAVAIL
2fe2321e-bb47-4f74-8834-36667371f943 UNAVAIL
5f846717-e8e0-40e0-955a-eaa46129024a ONLINE
63513c5f-4213-4fce-9e32-f46c4b7d5d0b ONLINE
09296c5c-ea1d-46a7-99e6-3f920b08e880 UNAVAIL
So four of the disks seem missing. And indeed, using lsblk, they do in fact appear to be missing entirely:
root@truenas:/home/admin# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS
sda 8:0 0 465.8G 0 disk
└─sda1 8:1 0 465.8G 0 part
sdb 8:16 0 465.8G 0 disk
├─sdb1 8:17 0 1M 0 part
├─sdb2 8:18 0 512M 0 part
├─sdb3 8:19 0 449.3G 0 part
└─sdb4 8:20 0 16G 0 part
└─sdb4 253:0 0 16G 0 crypt [SWAP]
sdc 8:32 0 465.8G 0 disk
└─sdc1 8:33 0 465.8G 0 part
sdd 8:48 0 465.8G 0 disk
└─sdd1 8:49 0 465.8G 0 part
sde 8:64 0 465.8G 0 disk
└─sde1 8:65 0 465.8G 0 part
sdf 8:80 0 7.3T 0 disk
└─sdf1 8:81 0 7.3T 0 part
sdg 8:96 0 1.8T 0 disk
└─sdg1 8:97 0 1.8T 0 part
sdh 8:112 0 7.3T 0 disk
└─sdh1 8:113 0 7.3T 0 part
sdi 8:128 0 1.8T 0 disk
└─sdi1 8:129 0 1.8T 0 part
zd0 230:0 0 200G 0 disk
nvme0n1 259:0 0 953.9G 0 disk
├─nvme0n1p1 259:2 0 512M 0 part
├─nvme0n1p2 259:3 0 500M 0 part
└─nvme0n1p3 259:4 0 952.9G 0 part
nvme1n1 259:1 0 3.6T 0 disk
└─nvme1n1p1 259:5 0 3.6T 0 part
root@truenas:/home/admin#
I don’t have previous lsblk
output to compare, but there used to be sdj
, sdk
, sdl
, and sdm
.
I found the timing (first boot after the upgrade) too suspicious to think it is unrelated. Nevertheless, I unplugged and re-plugged all the SATA and power cables, booted again. No change.
What should I do next?
I could imagine some scenario where a regular Linux machine needs to have some config update to support more then n drives, but that seems unlikely for TrueNAS. The drives are connected to a PCIe SATA expansion card, but so are the drives for the big
pool, which seems to be working fine.
Luckily for me, the slow
pool is basically just a junk pool for testing. But I would like to understand what happened and how to best debug it. (And, if there is a bug involved, to generate a good bug report.)