Needed to shut my system down to install a GPU for Plex transcoding. After rebooting, my special metadata vdev was showing as degraded with one (of 3 NVMe) devices showing UNAVAIL.
Running lsblk showed the disk present, but without a partition, like this (mocked up display because original scrolled off terminal). Can’t remember exactly what it reported as the size, but was not 1.8T
nvme1n1 259:0 0 1.8T 0 disk
└─nvme1n1p1 259:1 0 1.8T 0 part
nvme0n1 259:2 0 1.8T 0 disk
└─nvme0n1p1 259:4 0 1.8T 0 part
nvme4n1 259:? 0 ???? 0 disk
lspci correctly shows all the devices:
root@truenas[~]# lspci
00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] (rev 02)
00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]
00:01.1 IDE interface: Intel Corporation 82371SB PIIX3 IDE [Natoma/Triton II]
00:01.2 USB controller: Intel Corporation 82371SB PIIX3 USB [Natoma/Triton II] (rev 01)
00:01.3 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 03)
00:02.0 VGA compatible controller: Device 1234:1111 (rev 02)
00:03.0 Unassigned class [ff80]: XenSource, Inc. Xen Platform Device (rev 01)
00:08.0 VGA compatible controller: Intel Corporation DG2 [Arc A310] (rev 05)
00:09.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller S4LV008[Pascal]
00:0a.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller S4LV008[Pascal]
00:0b.0 SATA controller: Intel Corporation C620 Series Chipset Family SATA Controller [AHCI mode] (rev 09)
00:0c.0 Non-Volatile memory controller: Realtek Semiconductor Co., Ltd. RTS5765DL NVMe SSD Controller (DRAM-less) (rev 01)
00:0d.0 Ethernet controller: Intel Corporation Ethernet Controller X550 (rev 01)
00:0e.0 Non-Volatile memory controller: Realtek Semiconductor Co., Ltd. RTS5765DL NVMe SSD Controller (DRAM-less) (rev 01)
00:0f.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller S4LV008[Pascal]
Trying to look at the device and then create a gpt label with parted gave errors:
root@truenas[~]# parted /dev/nvme4n1
Warning: Error fsyncing/closing /dev/nvme4n1: Input/output error
Retry/Ignore? r
Warning: Error fsyncing/closing /dev/nvme4n1: Input/output error
Retry/Ignore? i
GNU Parted 3.5
Using /dev/nvme4n1
Welcome to GNU Parted! Type 'help' to view a list of commands.
(parted) mklabel gpt
Error: Input/output error during write on /dev/nvme4n1
Retry/Ignore/Cancel? q
parted: invalid token: q
Retry/Ignore/Cancel? c
(parted) q
Warning: Error fsyncing/closing /dev/nvme4n1: Input/output error
Retry/Ignore?
Retry/Ignore? i
root@truenas[~]#
There were also errors written to /var/log/syslog at this point. There were none written during the boot process.
I then tried to remove the device and rescan the pci bus:
root@truenas[~]# echo 1 | sudo tee /sys/bus/pci/devices/0000:00:0f.0/remove
1
root@truenas[~]# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS
sda 8:0 0 16.4T 0 disk
├─sda1 8:1 0 2G 0 part
└─sda2 8:2 0 16.4T 0 part
sdb 8:16 0 16.4T 0 disk
├─sdb1 8:17 0 2G 0 part
└─sdb2 8:18 0 16.4T 0 part
sdc 8:32 0 16.4T 0 disk
├─sdc1 8:33 0 2G 0 part
└─sdc2 8:34 0 16.4T 0 part
sdd 8:48 0 16.4T 0 disk
├─sdd1 8:49 0 2G 0 part
└─sdd2 8:50 0 16.4T 0 part
sde 8:64 0 16.4T 0 disk
└─sde1 8:65 0 16.4T 0 part
sdf 8:80 0 16.4T 0 disk
└─sdf1 8:81 0 16.4T 0 part
sr0 11:0 1 1.9G 0 rom
xvda 202:0 0 120G 0 disk
├─xvda1 202:1 0 1M 0 part
├─xvda2 202:2 0 512M 0 part
└─xvda3 202:3 0 119.5G 0 part
zd0 230:0 0 512G 0 disk
zd16 230:16 0 15T 0 disk
nvme1n1 259:0 0 1.8T 0 disk
└─nvme1n1p1 259:1 0 1.8T 0 part
nvme0n1 259:2 0 1.8T 0 disk
└─nvme0n1p1 259:4 0 1.8T 0 part
nvme3n1 259:5 0 476.9G 0 disk
└─nvme3n1p1 259:7 0 476.9G 0 part
nvme2n1 259:6 0 476.9G 0 disk
└─nvme2n1p1 259:8 0 476.9G 0 part
root@truenas[~]# echo 1 | sudo tee /sys/bus/pci/rescan
1
root@truenas[~]# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS
sda 8:0 0 16.4T 0 disk
├─sda1 8:1 0 2G 0 part
└─sda2 8:2 0 16.4T 0 part
sdb 8:16 0 16.4T 0 disk
├─sdb1 8:17 0 2G 0 part
└─sdb2 8:18 0 16.4T 0 part
sdc 8:32 0 16.4T 0 disk
├─sdc1 8:33 0 2G 0 part
└─sdc2 8:34 0 16.4T 0 part
sdd 8:48 0 16.4T 0 disk
├─sdd1 8:49 0 2G 0 part
└─sdd2 8:50 0 16.4T 0 part
sde 8:64 0 16.4T 0 disk
└─sde1 8:65 0 16.4T 0 part
sdf 8:80 0 16.4T 0 disk
└─sdf1 8:81 0 16.4T 0 part
sr0 11:0 1 1.9G 0 rom
xvda 202:0 0 120G 0 disk
├─xvda1 202:1 0 1M 0 part
├─xvda2 202:2 0 512M 0 part
└─xvda3 202:3 0 119.5G 0 part
zd0 230:0 0 512G 0 disk
zd16 230:16 0 15T 0 disk
nvme1n1 259:0 0 1.8T 0 disk
└─nvme1n1p1 259:1 0 1.8T 0 part
nvme0n1 259:2 0 1.8T 0 disk
└─nvme0n1p1 259:4 0 1.8T 0 part
nvme3n1 259:5 0 476.9G 0 disk
└─nvme3n1p1 259:7 0 476.9G 0 part
nvme2n1 259:6 0 476.9G 0 disk
└─nvme2n1p1 259:8 0 476.9G 0 part
root@truenas[~]#
The remove did drop the device, but the rescan didn’t find it. The following was thrown in /var/log/syslog at the time:
Nov 29 17:02:33 truenas kernel: pci 0000:00:0f.0: [144d:a80c] type 00 class 0x010802 PCIe Endpoint
Nov 29 17:02:33 truenas kernel: pci 0000:00:0f.0: BAR 0 [mem 0xf1f14000-0xf1f17fff 64bit]
Nov 29 17:02:33 truenas kernel: pci 0000:00:0f.0: 31.504 Gb/s available PCIe bandwidth, limited by 8.0 GT/s PCIe x4 link at 0000:00:0f.0 (capable of 63.012 Gb/s with 16.0 GT/s PCIe x4 link)
Nov 29 17:02:33 truenas kernel: pci 0000:00:0f.0: BAR 0 [mem 0xf1c08000-0xf1c0bfff 64bit]: assigned
Nov 29 17:02:33 truenas kernel: nvme nvme4: pci function 0000:00:0f.0
Nov 29 17:02:53 truenas kernel: nvme nvme4: Device not ready; aborting initialisation, CSTS=0x0
This is TrueNAS SCALE 25.04.2.6 virtualised under XCP-ng 8.3 with the main disk controller and all NVME’s directly passed through.
Am I right in thinking my next step is another reboot to see what happens. Could the addition of the GPU (also passed though) have anything to do with this.