Running TrueNas Scale for a while now, started on 22.12.4.2 (2023-10), upgraded to 23.10.1.3 (2024-02), then 23.10.2 (2024-04) and upgrade to 24.04.01 last week (waited because of the ipv6 bug).
System specs:
CPU: Intel i5 13500
Mobo: ASRock Z790M-ITX WiFi
Mem: 2x32GB DDR5-5600 kit (corsair)
SSD: 2x WD Black SN770 1TB (1 is boot, the other is missing)
HHD: 4x 8TB WD Red SATA
GPU: Asrock Intel Arc A380 (Low Profile 6GB) - currently out of the system
Had some stability issues since feb/march on 23.10.2, but never could really find the cause after a reboot. Upgrade to 24.04.01 abount 2 weeks ago, ran fine up until wednesday when the systeem completly locked up and wouldn’t even give me a post. Seems like the issue was caused by the Intel Arch A380 GPU, after removal the system boots normal… except for one of my NVME drives is mia. This nvme was a part of a single zfs pool hosting my the disks for my VMs, so no redundancy (I’m working on a HA proxmox clusters and I had intended to backup the disk to there, truenas was only meant to host a docker with jellyfin and tadrr).
The disk is still listed in the bios and under lspci
I’ve also removed the drive from the system, put it in a usb to nvme adaptor and connected it to my (windows) laptop. Device and disk manager see the drive just fine, including the partitions.
Next tried with a bootable ubuntu usb disk and the nvme in the usb cradle connected to the original hardware. Drive shows up, zpool import even lists the zpool. Can mount the volume even. Tried it with the disk back in the motherboard and an ubuntu usb, same thing drive is visible, zpool import shows the the zpool and can mount the volume.
Booted Truenas backup… still mising. Even booted back into the 23.10.2, but no dice.
I have no idea how to move forward. Would love to recover a the vms disks, but I’m very much stuck.
I suspect either the nvme is dying (somehow caused by the gpu that is acting funky) and truenas maybe marked it as dead/damanged or something (?) or something is wrong with the mobo/cpu.
But before I continue troubleshooting the hardware I really want to backup to zvols for my VMs (I have a few important one on there and having them down is starting to get really annoying)
I just can’t figure out how to backup/copy/move the zvols. Found guides that say to do a zfs snapshot, but that just hangs. (Maybe because the nvme is actually dying…?).
If you’re doing this on a separate system in a separate OS & it is failing; the good news is you found the problem, the bad news is that your assumption is likely correct…
Haven’t tried a different system yet with the zvol snapshot.
Have 2 older boards laying around, gonna have a look tomorrow if one of those has an nvme slot. Don’t really trust the nvme to usb I got last week (the usb c connector is way to loose, can only get a reliable connection in one system).
I was hoping it was just some setting in TrueNas that marked te drive as failed or something… but based on the reactions I’m 99% sure the drive is dead/dying.
Update:
One of the systems I still had lying around had a nvme slot, popped the drive in there. Drive got recognized in the bios and this board had an option to run a drive diagnostic, ran that and drive passed.
Booted into ubuntu from usb, tried to mount the zfs dataset and immediately got "unknown error’ warnings and system halts.
After a few tries I decided to install ubuntu to a spare sata ssd, managed to mount the zfs dataset and actually copy 3 of the zvols over to a backup disk. After that got errors that the zvols where mounted and the system became unresponsive and even failed to reboot.
I definitely suspect the nvme drive is faulty, probably something like where the heat from usage is causing the errors in time. (Although there is a heat spreader and airflow on the drives.
First going to try and import the disks onto truenas and see if I can atleast get my home assistant vm back up and create a good backup.
Interesting, I do recognize a few things mentioned in that topic and it would explain why the SN770 tested just fine and also seem to work just fine outside of zfs (currently running in the proxmox cluster I’m building to host my VMs).
But I’m still not entirely sure about the mainboard or the intel i5 13500 (bios/microcode has not been updated yet in regards to the issues with the 13e gens) and both the SN770 and the ARC 370 GPU run fine in other systems… but both stopped working in the NAS.
And to top it all of, my main Z1 array (4x 8TB WD Red) also keeps spewing out checksum errors.
pool: Zfs-Data
state: DEGRADED
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
scan: scrub repaired 760K in 07:58:01 with 0 errors on Fri Aug 30 16:53:06 2024
config:
NAME STATE READ WRITE CKSUM
Zfs-Data DEGRADED 0 0 0
raidz1-0 DEGRADED 0 0 0
1ebdd8b3-3af3-49ce-8c25-d7427e8fd1ba DEGRADED 0 0 152 too many errors
2fb5ebc4-b771-4f36-a163-0a05d29360eb DEGRADED 0 0 152 too many errors
bd56c110-73d3-43f4-a354-274f9a182157 DEGRADED 0 0 152 too many errors
9f43ae68-4930-43ea-a00f-ae34790cfe29 DEGRADED 0 0 152 too many errors
I’m almost at a point where I can shut my NAS down and test for a while without breaking the entire network and home automation.
It’s been a while, but I wanted to leave an update for anyone finding this topic in the future.
There is definitely something with the WD_Black SN770 when used with ZFS (as said in another reaction here).
Long story short, I’ve pretty much redone my entire home setup at this point, the SN770 are being used in 2 proxmox hosts with ext4 file system and are doing perfectly fine for the last couple of months. I’ve replaced them with SN850x in my TrueNAS build and haven’t had any issues in the last month. Even the errors on my 4xWD Reds Zfs pool are gone.
I’ve only encountered 1 other weird thing, when replacing the NVME in the M2_2 slot on the mainboard (the one that supports both PCI and SATA), it would get recognized in the bios and other OSes, but not by TrueNAS for some reason. It did show up when running the lspci cmdlet, but would not be mounted as a storage device for some reason. Eventually I just did a reset on TrueNAS and restore from backup, after that the disk detected just fine, could be imported and synced with the ZFS pool. This did not happen when replacing the disk in the other slot (pci only)… I guess it’s some weird issue between my specific board (Asrock Z790M-ITX) and Truenas. (All BIOS/FW/… where update to date.)
Thanks for everyone that responded and send me in the right direction after I’ve made my post.