Lost NVMe on reboot

Needed to shut my system down to install a GPU for Plex transcoding. After rebooting, my special metadata vdev was showing as degraded with one (of 3 NVMe) devices showing UNAVAIL.

Running lsblk showed the disk present, but without a partition, like this (mocked up display because original scrolled off terminal). Can’t remember exactly what it reported as the size, but was not 1.8T

nvme1n1     259:0    0   1.8T  0 disk
└─nvme1n1p1 259:1    0   1.8T  0 part
nvme0n1     259:2    0   1.8T  0 disk
└─nvme0n1p1 259:4    0   1.8T  0 part
nvme4n1     259:?    0   ????  0 disk

lspci correctly shows all the devices:

root@truenas[~]# lspci
00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] (rev 02)
00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]
00:01.1 IDE interface: Intel Corporation 82371SB PIIX3 IDE [Natoma/Triton II]
00:01.2 USB controller: Intel Corporation 82371SB PIIX3 USB [Natoma/Triton II] (rev 01)
00:01.3 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 03)
00:02.0 VGA compatible controller: Device 1234:1111 (rev 02)
00:03.0 Unassigned class [ff80]: XenSource, Inc. Xen Platform Device (rev 01)
00:08.0 VGA compatible controller: Intel Corporation DG2 [Arc A310] (rev 05)
00:09.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller S4LV008[Pascal]
00:0a.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller S4LV008[Pascal]
00:0b.0 SATA controller: Intel Corporation C620 Series Chipset Family SATA Controller [AHCI mode] (rev 09)
00:0c.0 Non-Volatile memory controller: Realtek Semiconductor Co., Ltd. RTS5765DL NVMe SSD Controller (DRAM-less) (rev 01)
00:0d.0 Ethernet controller: Intel Corporation Ethernet Controller X550 (rev 01)
00:0e.0 Non-Volatile memory controller: Realtek Semiconductor Co., Ltd. RTS5765DL NVMe SSD Controller (DRAM-less) (rev 01)
00:0f.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller S4LV008[Pascal]

Trying to look at the device and then create a gpt label with parted gave errors:

root@truenas[~]# parted /dev/nvme4n1
Warning: Error fsyncing/closing /dev/nvme4n1: Input/output error
Retry/Ignore? r
Warning: Error fsyncing/closing /dev/nvme4n1: Input/output error
Retry/Ignore? i
GNU Parted 3.5
Using /dev/nvme4n1
Welcome to GNU Parted! Type 'help' to view a list of commands.
(parted) mklabel gpt
Error: Input/output error during write on /dev/nvme4n1
Retry/Ignore/Cancel? q
parted: invalid token: q
Retry/Ignore/Cancel? c
(parted) q
Warning: Error fsyncing/closing /dev/nvme4n1: Input/output error
Retry/Ignore?
Retry/Ignore? i
root@truenas[~]#

There were also errors written to /var/log/syslog at this point. There were none written during the boot process.

I then tried to remove the device and rescan the pci bus:

root@truenas[~]# echo 1 | sudo tee /sys/bus/pci/devices/0000:00:0f.0/remove
1
root@truenas[~]# lsblk
NAME        MAJ:MIN RM   SIZE RO TYPE MOUNTPOINTS
sda           8:0    0  16.4T  0 disk
├─sda1        8:1    0     2G  0 part
└─sda2        8:2    0  16.4T  0 part
sdb           8:16   0  16.4T  0 disk
├─sdb1        8:17   0     2G  0 part
└─sdb2        8:18   0  16.4T  0 part
sdc           8:32   0  16.4T  0 disk
├─sdc1        8:33   0     2G  0 part
└─sdc2        8:34   0  16.4T  0 part
sdd           8:48   0  16.4T  0 disk
├─sdd1        8:49   0     2G  0 part
└─sdd2        8:50   0  16.4T  0 part
sde           8:64   0  16.4T  0 disk
└─sde1        8:65   0  16.4T  0 part
sdf           8:80   0  16.4T  0 disk
└─sdf1        8:81   0  16.4T  0 part
sr0          11:0    1   1.9G  0 rom
xvda        202:0    0   120G  0 disk
├─xvda1     202:1    0     1M  0 part
├─xvda2     202:2    0   512M  0 part
└─xvda3     202:3    0 119.5G  0 part
zd0         230:0    0   512G  0 disk
zd16        230:16   0    15T  0 disk
nvme1n1     259:0    0   1.8T  0 disk
└─nvme1n1p1 259:1    0   1.8T  0 part
nvme0n1     259:2    0   1.8T  0 disk
└─nvme0n1p1 259:4    0   1.8T  0 part
nvme3n1     259:5    0 476.9G  0 disk
└─nvme3n1p1 259:7    0 476.9G  0 part
nvme2n1     259:6    0 476.9G  0 disk
└─nvme2n1p1 259:8    0 476.9G  0 part
root@truenas[~]# echo 1 | sudo tee /sys/bus/pci/rescan
1
root@truenas[~]# lsblk
NAME        MAJ:MIN RM   SIZE RO TYPE MOUNTPOINTS
sda           8:0    0  16.4T  0 disk
├─sda1        8:1    0     2G  0 part
└─sda2        8:2    0  16.4T  0 part
sdb           8:16   0  16.4T  0 disk
├─sdb1        8:17   0     2G  0 part
└─sdb2        8:18   0  16.4T  0 part
sdc           8:32   0  16.4T  0 disk
├─sdc1        8:33   0     2G  0 part
└─sdc2        8:34   0  16.4T  0 part
sdd           8:48   0  16.4T  0 disk
├─sdd1        8:49   0     2G  0 part
└─sdd2        8:50   0  16.4T  0 part
sde           8:64   0  16.4T  0 disk
└─sde1        8:65   0  16.4T  0 part
sdf           8:80   0  16.4T  0 disk
└─sdf1        8:81   0  16.4T  0 part
sr0          11:0    1   1.9G  0 rom
xvda        202:0    0   120G  0 disk
├─xvda1     202:1    0     1M  0 part
├─xvda2     202:2    0   512M  0 part
└─xvda3     202:3    0 119.5G  0 part
zd0         230:0    0   512G  0 disk
zd16        230:16   0    15T  0 disk
nvme1n1     259:0    0   1.8T  0 disk
└─nvme1n1p1 259:1    0   1.8T  0 part
nvme0n1     259:2    0   1.8T  0 disk
└─nvme0n1p1 259:4    0   1.8T  0 part
nvme3n1     259:5    0 476.9G  0 disk
└─nvme3n1p1 259:7    0 476.9G  0 part
nvme2n1     259:6    0 476.9G  0 disk
└─nvme2n1p1 259:8    0 476.9G  0 part
root@truenas[~]#

The remove did drop the device, but the rescan didn’t find it. The following was thrown in /var/log/syslog at the time:

Nov 29 17:02:33 truenas kernel: pci 0000:00:0f.0: [144d:a80c] type 00 class 0x010802 PCIe Endpoint
Nov 29 17:02:33 truenas kernel: pci 0000:00:0f.0: BAR 0 [mem 0xf1f14000-0xf1f17fff 64bit]
Nov 29 17:02:33 truenas kernel: pci 0000:00:0f.0: 31.504 Gb/s available PCIe bandwidth, limited by 8.0 GT/s PCIe x4 link at 0000:00:0f.0 (capable of 63.012 Gb/s with 16.0 GT/s PCIe x4 link)
Nov 29 17:02:33 truenas kernel: pci 0000:00:0f.0: BAR 0 [mem 0xf1c08000-0xf1c0bfff 64bit]: assigned
Nov 29 17:02:33 truenas kernel: nvme nvme4: pci function 0000:00:0f.0
Nov 29 17:02:53 truenas kernel: nvme nvme4: Device not ready; aborting initialisation, CSTS=0x0

This is TrueNAS SCALE 25.04.2.6 virtualised under XCP-ng 8.3 with the main disk controller and all NVME’s directly passed through.

Am I right in thinking my next step is another reboot to see what happens. Could the addition of the GPU (also passed though) have anything to do with this.