After upgrade to 24.04, 4 (out of 13) of my disks no longer show up

axhxrx · April 24, 2024, 3:52am

I upgraded from 23.10 to 24.04 this morning. My server is pretty basic, with no apps or VMs, just 4 zpools of various types: fast (single NVMe SSD), big (8TB HDD 2-disk mirrror), medium (RAIDZ with crappy consumer SATA SSDs), and slow (six old 2TB HDDs in a RAIDZ2 configuration).

After the upgrade, on the first boot, everything seems working except the slow pool is gone. The UI section for it just shows an error icon (with Pool contains OFFLINE Data VDEVs hover text) and all the disks are gone.

On the command line, I see this:

root@truenas:/home/admin# zpool import
   pool: slow
     id: 8876966371717654632
  state: UNAVAIL
status: One or more devices contains corrupted data.
 action: The pool cannot be imported due to damaged devices or data.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-5E
 config:

	slow                                      UNAVAIL  insufficient replicas
	  raidz2-0                                UNAVAIL  insufficient replicas
	    3fcf7700-3773-490e-8266-4865f2dd654f  UNAVAIL
	    802388e3-4276-4c70-9376-68f94875f599  UNAVAIL
	    2fe2321e-bb47-4f74-8834-36667371f943  UNAVAIL
	    5f846717-e8e0-40e0-955a-eaa46129024a  ONLINE
	    63513c5f-4213-4fce-9e32-f46c4b7d5d0b  ONLINE
	    09296c5c-ea1d-46a7-99e6-3f920b08e880  UNAVAIL

So four of the disks seem missing. And indeed, using lsblk, they do in fact appear to be missing entirely:

root@truenas:/home/admin# lsblk
NAME        MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINTS
sda           8:0    0 465.8G  0 disk
└─sda1        8:1    0 465.8G  0 part
sdb           8:16   0 465.8G  0 disk
├─sdb1        8:17   0     1M  0 part
├─sdb2        8:18   0   512M  0 part
├─sdb3        8:19   0 449.3G  0 part
└─sdb4        8:20   0    16G  0 part
  └─sdb4    253:0    0    16G  0 crypt [SWAP]
sdc           8:32   0 465.8G  0 disk
└─sdc1        8:33   0 465.8G  0 part
sdd           8:48   0 465.8G  0 disk
└─sdd1        8:49   0 465.8G  0 part
sde           8:64   0 465.8G  0 disk
└─sde1        8:65   0 465.8G  0 part
sdf           8:80   0   7.3T  0 disk
└─sdf1        8:81   0   7.3T  0 part
sdg           8:96   0   1.8T  0 disk
└─sdg1        8:97   0   1.8T  0 part
sdh           8:112  0   7.3T  0 disk
└─sdh1        8:113  0   7.3T  0 part
sdi           8:128  0   1.8T  0 disk
└─sdi1        8:129  0   1.8T  0 part
zd0         230:0    0   200G  0 disk
nvme0n1     259:0    0 953.9G  0 disk
├─nvme0n1p1 259:2    0   512M  0 part
├─nvme0n1p2 259:3    0   500M  0 part
└─nvme0n1p3 259:4    0 952.9G  0 part
nvme1n1     259:1    0   3.6T  0 disk
└─nvme1n1p1 259:5    0   3.6T  0 part
root@truenas:/home/admin#

I don’t have previous lsblk output to compare, but there used to be sdj, sdk, sdl, and sdm.

I found the timing (first boot after the upgrade) too suspicious to think it is unrelated. Nevertheless, I unplugged and re-plugged all the SATA and power cables, booted again. No change.

What should I do next?

I could imagine some scenario where a regular Linux machine needs to have some config update to support more then n drives, but that seems unlikely for TrueNAS. The drives are connected to a PCIe SATA expansion card, but so are the drives for the big pool, which seems to be working fine.

Luckily for me, the slow pool is basically just a junk pool for testing. But I would like to understand what happened and how to best debug it. (And, if there is a bug involved, to generate a good bug report.)

Farout · April 24, 2024, 5:28am

I think we found the culprit.

axhxrx · April 24, 2024, 6:47am

Sure, that is definitely possible, although it worked fine in Bluefin and Cobia, and is currently working (in 24.04 Dragonfish) for the drives in the other pool.

For the record, this is the SATA card:

04:00.0 SATA controller [0106]: ASMedia Technology Inc. ASM1166 Serial ATA Controller [1b21:1166] (rev 02)
	Subsystem: ZyDAS Technology Corp. ASM1166 Serial ATA Controller [2116:2116]
	Kernel driver in use: ahci
	Kernel modules: ahci

Farout · April 24, 2024, 7:13am

Yes that might be true, but exactly that kind of behavior is very common for these type of cards.
Everything seems ok, until it doesnt.

Used LSI HBAs from ebay are similar priced to these dodgy port multipliers.

joeschmuck · April 24, 2024, 9:11am

You should be able to roll back to the previous version, change your boot environment and recover from this, provided you did not upgrade ZFS Flags.

axhxrx · April 24, 2024, 9:35am

Thanks. I guess I will try that this evening just to confirm.

OTOH, if these cheap SATA cards aren’t expected to actually work, then maybe there is no point… I have backups of the data, so that isn’t the issue. I was mainly reporting it thinking that I might have found a bug.

axhxrx · April 24, 2024, 12:01pm

Indeed, I rolled back to Cobia 23.10.2 (just using the web UI to select the previous boot environment, and rebooting). All disks appear again, and the slow pool is functioning fine.

So… is that a bug/regression worth investigating and potentially reporting, or is that just expected behavior when using a cheap SATA port multiplier card?

admin@truenas[~]$ lsblk
NAME        MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINTS
sda           8:0    0 465.8G  0 disk
└─sda1        8:1    0 465.8G  0 part
sdb           8:16   0 465.8G  0 disk
├─sdb1        8:17   0     1M  0 part
├─sdb2        8:18   0   512M  0 part
├─sdb3        8:19   0 449.3G  0 part
└─sdb4        8:20   0    16G  0 part
  └─sdb4    253:0    0    16G  0 crypt [SWAP]
sdc           8:32   0 465.8G  0 disk
└─sdc1        8:33   0 465.8G  0 part
sdd           8:48   0 465.8G  0 disk
└─sdd1        8:49   0 465.8G  0 part
sde           8:64   0   7.3T  0 disk
└─sde1        8:65   0   7.3T  0 part
sdf           8:80   0 465.8G  0 disk
└─sdf1        8:81   0 465.8G  0 part
sdg           8:96   0   1.8T  0 disk
└─sdg1        8:97   0   1.8T  0 part
sdh           8:112  0   7.3T  0 disk
└─sdh1        8:113  0   7.3T  0 part
sdi           8:128  0   1.8T  0 disk
└─sdi1        8:129  0   1.8T  0 part
sdj           8:144  0   1.8T  0 disk
└─sdj1        8:145  0   1.8T  0 part
sdk           8:160  0   1.8T  0 disk
└─sdk1        8:161  0   1.8T  0 part
sdl           8:176  0   1.8T  0 disk
└─sdl1        8:177  0   1.8T  0 part
sdm           8:192  0   1.8T  0 disk
└─sdm1        8:193  0   1.8T  0 part
zd0         230:0    0   200G  0 disk
nvme0n1     259:0    0 953.9G  0 disk
├─nvme0n1p1 259:2    0   512M  0 part
├─nvme0n1p2 259:3    0   500M  0 part
└─nvme0n1p3 259:4    0 952.9G  0 part
nvme1n1     259:1    0   3.6T  0 disk
└─nvme1n1p1 259:5    0   3.6T  0 part
admin@truenas[~]$

patrickkeane · April 24, 2024, 2:28pm

Hi @axhxrx !

Could you open a bug, and we can take a look? It would be especially useful if you were able to take a debug in both the Cobia and Dragonfish environments, and attach each.

Thanks!

axhxrx · April 24, 2024, 2:43pm

OK sure, will do.

axhxrx · April 24, 2024, 6:07pm

Filed as NAS-128517

(Log in with Atlassian account)

Marlenio · April 24, 2024, 6:25pm

Same controller and same problem for me. I have a TrueNAS SCALE server with two disk pools. The first pool (ZFS_Pool01) is composed of a VDEV with 6 2Tb SATA disks and a dataset, the second (ZFS_Pool02) of a VDEV with 2 2Tb SATA disks (mirror) and a dataset.
The last version I had installed was 23.10.2, and I never had any particular problems. Today I upgraded to 24.04.0, and upon restart the first disk pool (ZFS_Pool01) went offline, reporting that it had lost the data of 4 disks out of 6 total. By reselecting version 23 from the boot section, the server restarts by correctly mounting the two pools.

patrickkeane · April 24, 2024, 7:00pm

Thanks all, this will be tracked by NAS-128478.

joeschmuck · April 24, 2024, 9:50pm

You could also look at Debian to see if they removed the driver. If they did, that is not good news. If iXsystem just left it out, good news. Also you may be able to manually add the driver if it is available, that would be a short term solution. But that is research on your end. Or you could just wait for an answer from your ticket.

@Marlenio You should make a comment in that ticket to say you have it as well. The more complaints, the faster they should look at it.

patrickkeane · April 24, 2024, 10:16pm

Though we always like to hear from users, no additional prodding required.

sharky-suited · April 28, 2024, 4:31pm

Same issue here. Upgraded to 24.04.0 and 4 disks out of 10 are shown as unavailable.

Using 04:00.0 SATA controller: ASMedia Technology Inc. ASM1166 Serial ATA Controller (rev 02) also.

I was able to downgrade to 23.10.2, not without some kind of heart attack

bklyngaucho · April 30, 2024, 8:40pm

After experiencing much wonkiness with a couple of those cards and external enclosures, I’ve stayed far, far away.

ForFrodo · May 23, 2024, 5:18am

I don’t know if this is of consequence, or if it’s a similar problem, but I upgrade from Core to Scale (Dragonfish 24.04) and had exactly the same problem where the disks weren’t being recognized and vdevs were missing with the tag “Unavailable” from the command line command.

I assume there is more complexity in my circumstance from my migration from Core to Scale but I also have an LSI PCie card as well.

Just thought I would add my voice to this. Would it make more sense for me to try to update/upgrade to Cobia instead and wait until Dragonfish gets ironed out? Or should I stick with Core for the time being?