TrueNAS Pools OFFLINE After Every Reboot – Disks Not Recognized

Hello there, this is my first post, so forgive me if I’m wrong.

First of all, my problem:
Each time I reboot Proxmox or TrueNAS, my pools become corrupted in some way, and TrueNAS is not able to recognize the disks to assign them to the pool.

Here’s an example of the error:

What I have
I’m running TrueNAS Scale, up to date, virtualized on Proxmox, also up to date.
This is a home server with the following specs:

Motherboard: Gigabyte B560M D3H
CPU: i5-11400
GPU: NVIDIA 2060
RAM: 32GB
Proxmox boot: 256GB NVMe
For Proxmox VMs: 512GB NVMe
Disks: 6 x WD40EFAX, connected directly to the motherboard

VM Configuration:
The QEMU setup is very straigh forward:

root@pve:~# cat /etc/pve/qemu-server/113.conf
agent: 1
boot: order=scsi0
cores: 4
cpu: x86-64-v2-AES
memory: 16384
meta: creation-qemu=8.0.2,ctime=1696522098
name: truenas
net0: virtio=BC:24:11:2F:52:39,bridge=vmbr0
numa: 0
onboot: 1
ostype: l26
#For True nas boot
scsi0: VMs:vm-113-disk-1,size=32G
#Disk for pools
scsi1: /dev/disk/by-id/ata-WDC_WD40EFAX-68JH4N0_WD-WX62D30RDL6E,backup=0,serial=WD-WX62D30RDL6E
scsi2: /dev/disk/by-id/ata-WDC_WD40EFAX-68JH4N0_WD-WX62D302FTVU,backup=0,serial=WD-WX62D302FTVU
scsi3: /dev/disk/by-id/ata-WDC_WD40EFAX-68JH4N0_WD-WX52D108SP45,backup=0,serial=WD-WX52D108SP45
scsi4: /dev/disk/by-id/ata-WDC_WD40EFAX-68JH4N0_WD-WX52D103RPE6,backup=0,serial=WD-WX52D103RPE6
scsi5: /dev/disk/by-id/ata-WDC_WD40EFAX-68JH4N1_WD-WX12D415NN4K,backup=0,serial=WD-WX12D415NN4K
scsi6: /dev/disk/by-id/ata-WDC_WD40EFAX-68JH4N0_WD-WX52D105RK2R,backup=0,serial=WD-WX52D105RK2R
scsihw: virtio-scsi-single
smbios1: uuid=3b4c0821-16c0-47e8-a897-4fba42f3e7ee
sockets: 2
startup: order=2,up=30
vmgenid: 214b4ed9-2ae1-4db1-ad0a-d9973d57aaf0

Storage Setup:
I have two pools due to different disk read/write speeds:
Archivum (RAIDZ2): using disks 1, 2, 4, and 6
Volatile (Stripe): using disk 3 or 5 (I can’t remember exactly)
One of the disks serves as a hot spare.

Additional Info:
I’ve run SMART tests on all disks, and they all passed without any issues.
I’ve noticed that after every reboot, TrueNAS has trouble recognizing the disks and I don’t know how to manually reassign them.

Any ideas on what could be causing this issue?
I’m not sure if it’s related to the VM configuration in Proxmox or something with the way the disks are passed through to TrueNAS. Any advice or guidance would be much appreciated!

I’m kinda noob, so I’m not really sure what kind of logs do I need to provide.

Potentially related to

1 Like

Also potentially related to other users loosing their pool in virtualised TrueNAS, including when properly passing through a SAS HBA. Do I understand that you’re passing individual disks to TrueNAS rather than your SATA controller? :scream: Don’t do that!

Very curious how you get your pool back after reboot, as this might help others:

I’m inclined to suggest that the “solution” is to run TrueNAS bare metal and not virtualised on Proxmox.

1 Like

Thanks for the link. Fortunately my important stuff was backed up. I decided to write off the replaceable data & start again (bare metal this time), I did pass through controllers not disks, but I’ll consider this a cheap lesson and intend to reduce my risk going forward.

1 Like

I just discovered the problem: none of my disks has a correct partition table after the reboot… Why? you would ask, I don’t know, I only know that all the data is lost. :sweat_smile:

root@pve:~# parted /dev/sda print
Error: /dev/sda: unrecognised disk label
Model: ATA WDC WD40EFAX-68J (scsi)
Disk /dev/sda: 4001GB
Sector size (logical/physical): 512B/4096B
Partition Table: unknown
Disk Flags:
root@pve:~#

I’m going to cry in a corner, see you in a couple of days. By the way, what did you mean when you said:

Do I understand that you’re passing individual disks to TrueNAS rather than your SATA controller? :scream: Don’t do that!

Is there a way to pass through the SATA controller directly and then set up the RAID in TrueNAS?

Thanks for your time

With virtualised TrueNAS you should pass through the drive controller. Typically, this is done by passing through a SAS HBA, but if the hypervisor only uses NVMe drives, you can pass through the chipset SATA controller (depending on proper IOMMU grouping).

1 Like

Yeah, this is what I’m suspecting here. I’ve noticed a significant uptick of this since a recent update (8.1, maybe?) where Proxmox seems to have gotten more aggressive at forcibly claiming pools that aren’t IOMMU isolated.

@raulzgz I would try taking both of your NVMe devices out for the moment, booting TrueNAS bare-metal from a separate device (even a temporary USB) and see if you can import it.

Further muddying the waters is the fact that you’re using 6x WD Red SMR drives (WD40EFAX) which are known to throw problems with Sector ID Not Found (IDNF) errors when writing.

The data is probably NOT lost.

A GPT disk has primary and backup Partition Tables.

If the Primary partition table is corrupt, it looks like the data is lost, however you can use the gdisk utility to recover the primary partition from the backup partition and bring the ZFS disk back online. Do this for all the disks and TrueNAS should be able to import the pool again. :smiley:

In addition to transient errors, the primary issues with use of SMR drives is 1. that they cannot sustain heavy write workloads, and in a redundant pool will take forever and a day to resilver if the pool ever gets degraded. WD is even on official record as saying not to use WD Red SMR drives for ZFS pools, and to use only WD Red Plus or Pro for ZFS.