NVME drive gone

Running TrueNas Scale for a while now, started on 22.12.4.2 (2023-10), upgraded to 23.10.1.3 (2024-02), then 23.10.2 (2024-04) and upgrade to 24.04.01 last week (waited because of the ipv6 bug).

System specs:

  • CPU: Intel i5 13500
  • Mobo: ASRock Z790M-ITX WiFi
  • Mem: 2x32GB DDR5-5600 kit (corsair)
  • SSD: 2x WD Black SN770 1TB (1 is boot, the other is missing)
  • HHD: 4x 8TB WD Red SATA
  • GPU: Asrock Intel Arc A380 (Low Profile 6GB) - currently out of the system

Had some stability issues since feb/march on 23.10.2, but never could really find the cause after a reboot. Upgrade to 24.04.01 abount 2 weeks ago, ran fine up until wednesday when the systeem completly locked up and wouldn’t even give me a post. Seems like the issue was caused by the Intel Arch A380 GPU, after removal the system boots normal… except for one of my NVME drives is mia. This nvme was a part of a single zfs pool hosting my the disks for my VMs, so no redundancy (I’m working on a HA proxmox clusters and I had intended to backup the disk to there, truenas was only meant to host a docker with jellyfin and tadrr).

The disk is still listed in the bios and under lspci

zpool list only showing the NVME0-Data and not the NVME1-Data

nvme list only showing the boot nvme

I’ve also removed the drive from the system, put it in a usb to nvme adaptor and connected it to my (windows) laptop. Device and disk manager see the drive just fine, including the partitions.

Next tried with a bootable ubuntu usb disk and the nvme in the usb cradle connected to the original hardware. Drive shows up, zpool import even lists the zpool. Can mount the volume even. Tried it with the disk back in the motherboard and an ubuntu usb, same thing drive is visible, zpool import shows the the zpool and can mount the volume.

Booted Truenas backup… still mising. Even booted back into the 23.10.2, but no dice.

I have no idea how to move forward. Would love to recover a the vms disks, but I’m very much stuck.

With TrueNAS running and NVMe in their proper slots, what is the result of lsblk?

And what is the result of zpool status?

Damn, thought I covered my bases :sweat_smile:

lsblk (2nd nvme not present)

zpool status (also not present)

And forgot to mention: the vm disks are zvol (if that maters, I’m still learning ZFS).

So in TrueNAS SCALE (Debian Linux) the device is shown in lspci but not in lsblk, but in Ubuntu it is working just fine.

I’m flumuxed.

A great, it’s not just me then. :smiley:

I suspect either the nvme is dying (somehow caused by the gpu that is acting funky) and truenas maybe marked it as dead/damanged or something (?) or something is wrong with the mobo/cpu.

But before I continue troubleshooting the hardware I really want to backup to zvols for my VMs (I have a few important one on there and having them down is starting to get really annoying)

I just can’t figure out how to backup/copy/move the zvols. Found guides that say to do a zfs snapshot, but that just hangs. (Maybe because the nvme is actually dying…?).

If you’re doing this on a separate system in a separate OS & it is failing; the good news is you found the problem, the bad news is that your assumption is likely correct…

Haven’t tried a different system yet with the zvol snapshot.

Have 2 older boards laying around, gonna have a look tomorrow if one of those has an nvme slot. Don’t really trust the nvme to usb I got last week (the usb c connector is way to loose, can only get a reliable connection in one system).

I was hoping it was just some setting in TrueNas that marked te drive as failed or something… but based on the reactions I’m 99% sure the drive is dead/dying.

Update:
One of the systems I still had lying around had a nvme slot, popped the drive in there. Drive got recognized in the bios and this board had an option to run a drive diagnostic, ran that and drive passed.

Booted into ubuntu from usb, tried to mount the zfs dataset and immediately got "unknown error’ warnings and system halts.

After a few tries I decided to install ubuntu to a spare sata ssd, managed to mount the zfs dataset and actually copy 3 of the zvols over to a backup disk. After that got errors that the zvols where mounted and the system became unresponsive and even failed to reboot.

I definitely suspect the nvme drive is faulty, probably something like where the heat from usage is causing the errors in time. (Although there is a heat spreader and airflow on the drives.

First going to try and import the disks onto truenas and see if I can atleast get my home assistant vm back up and create a good backup.