Hi all,
My system froze while I was copying files over SMB and when I looked at the monitor it was showing an error like the OP of this post: Problem with NVME driver/ boot-pool posted by amigo3271 (topic ID: 20579) (I can’t post links so can’t link directly to the post)
So I rebooted at which point it game me the following error
nvme nvme0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff
nvme nvme0: Does your device have a faulty power saving mode enabled?
nvme nvme0: Try "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off pcie_port_pm=off" and report a bug
nvme 0000:04:00.0: Unable to change power state from D3cold to D0, device inaccessible
nvme nvme0: Disabling device after reset failure: -19
zio pool=boot-pool vdev=/dev/nvme0n1p2 error=5 type=1 offset=37130704384 size=93184 flags=1573248
zio pool=boot-pool vdev=/dev/nvme0n1p2 error=5 type=2 offset=154624455680 size=3072 flags=1589374
//Then repeats the type=1 error with various offsets, sizes and flags
I then tried to reboot into older boot config (figured it wouldn’t change anything but just troubleshooting) and it didn’t even get that far, instead it loaded me into Busybox initramfs shell. At which point even trying to load the default boot config only loaded the initramfs and no longer even made it to the error above.
I then tried to load a config with debug mode and it brought me back to the original error regarding the controller being down.
The drive is mirrored but I had to get to work so haven’t tried unplugging nvme0 to see if the other is still able to boot. Based off the errors however, I don’t think that will boot up either.
Anyways, it seems pretty clear that either the drive or the controller(?) has gone bad. Since it’s the boot pool I can’t get into the OS to try the “nvme_core.default_ps_max_latency_us=0 pcie_aspm=off pcie_port_pm=off”, or I’m at least not familiar with how to do so with the boot pool being down and only able to access the initramfs. I’m hoping for some verification that my troubleshooting is leading me down the right path and whether replacing the drive will fix it (with OS reinstall obviously) or if there’s more going on here.
System is an IXSystems R50 (out of warranty unfortunately) and is/was running TrueNAS Core EE with a boot pool of 2 nvme mirrored drives then a seperate pool for the data so I’m hopeful I should be able to just reimport the data drives with a new install. If not, I get to put my backup plans into practice hah.
Thanks!