Have a degraded drive with some smart errors wanted some advice before I started just tossing out commands
e5 2690v4
x99-a mobo
2x32gb 2400 ECC
1x500gb ssd(boot drive for prox and lvm storage from some vms
3x14tb hdds
1000w psu
root@truenas[/]#zpool status -v nasPool
pool: nasPool
state: DEGRADED
status: One or more devices could not be used because the label is missing or
invalid.
Sufficient replicas exist for the pool to continue
functioning in a degraded state.
action: Replace the device using 'zpool replace'.
see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-4J
scan: scrub repaired 08 in 01:38:48 with 0 errors on Mon Sep 2 12:00:20 2024
config:
NAME
nasPool STATE READ WRITE CKSUM
raidzl-ø DEGRADED 0 0 0
12885282463843026577 DEGRADED 0 0 0 was /dev/disk/by-partuuid/bseoac35-363f-4778-a6aa-b7ea3170056f
cb83140b-70cd-4d00-aaSb-7d7b12d3480a ONLINE 0 0 0
424bae2a-a6f3-47c6-8177-3392631f04f4 ONLINE 0 0 0
No known data errors
^ might have some spelling errors was grabbed with text extractor
/dev/sda - working drive
looks similar/same as sdc, just newer drive. limited to 2 links
/dev/sdb - degraded
/dev/sdc - working drive
my basic understanding is the drive is fine I’m just going to have to rebuild my pool. not 100% sure of how to do that properly
also just curious for any info on the SMART errors or what might’ve caused it.
Edit:
truenas vm lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS
sda 8:0 0 32G 0 disk
├─sda1 8:1 0 1M 0 part
├─sda2 8:2 0 512M 0 part
└─sda3 8:3 0 31.5G 0 part
sdb 8:16 0 465.8G 0 disk
├─sdb1 8:17 0 1007K 0 part
├─sdb2 8:18 0 1G 0 part
└─sdb3 8:19 0 464.8G 0 part
sdc 8:32 0 12.7T 0 disk
├─sdc1 8:33 0 2G 0 part
│ └─md127 9:127 0 2G 0 raid1
│ └─md127 253:0 0 2G 0 crypt
└─sdc2 8:34 0 12.7T 0 part
sdd 8:48 0 12.7T 0 disk
├─sdd1 8:49 0 2G 0 part
│ └─md127 9:127 0 2G 0 raid1
│ └─md127 253:0 0 2G 0 crypt
└─sdd2 8:50 0 12.7T 0 part
proxmox shell lsblk relevant drives
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS
sda 8:0 0 12.7T 0 disk
├─sda1 8:1 0 2G 0 part
└─sda2 8:2 0 12.7T 0 part
sdb 8:16 0 12.7T 0 disk
├─sdb1 8:17 0 2G 0 part
└─sdb2 8:18 0 12.7T 0 part
sdc 8:32 0 12.7T 0 disk
├─sdc1 8:33 0 2G 0 part
└─sdc2 8:34 0 12.7T 0 part
In my /etc/pve/qemu-server/x.conf for truenas I have the drives passed through with /dev/sdX and if I’m understanding what im looking at the labels have changed on the host so my paths aren’t correct.
Whether or not this is my issue how do I go about mounting by-id. I think I get the idea but don’t want to just be guessing while having data present
Sounds like a likely culprit.
The general recommendation is to not pass drives through like that at all.
Rather, connect the drives to an HBA and pass the whole controller through. Also, blacklist the controller in Proxmox to make sure Proxmox doesn’t try to import the ZFS pool. If Proxmox tries to import a ZFS pool at the same time as TrueNAS or vice versa, bad things happen.
In the two smart reports you posted the only smart errors present (On drive 9JGHTE2T) are many years old and unlikely to be related to any current issues. Since they are CRC errors I’m guessing you had a bad cable or connection for a short while.
A few closing comments:
No long smart tests have been recorded on either drive. Long tests have a chance of alerting you to errors sneaking up on you. Automating TrueNAS to run at least one per month is what I would personally see as a minimum.
Not posting your full hardware information makes it harder for people to assist.
In Proxmox we have 3x 12.7TB drives.
In TrueNAS we have 2x 12.7TB drives and 2 smaller drives.
I suspect that you have labelled these the wrong way around, but they still don’t make sense.
The pastebin SMART output for /dev/sdb shows power on hours at 39,671 hours (c. 4.5 years), but the errors shown we logged at only 1,154 hours (i.e. more than 3 years ago). So this doesn’t seem to be the cause of the drive dropping out. Similarly all the error attributes are still at zero - so these are not the cause either.
There are no Long smart tests in the log. For the future you should run a Long test on each drive periodically.
So, not a lot to go on to tell us:
Why the drive dropped out;
Whether it was a glitch and the drive is still OK, or the drive is no longer reliable
I am not sure what will happen if you reconfigure to pass through the HBA rather than drives. Will TrueNAS ZFS still find and mount the ZFS pool or not?
If you are going to have to move the data off and rebuild (because of passing through the HBA), I would suggest that you might be better to think about running TrueNAS SCALE native and running any additional VMs with TrueNAS virtualisation (and not running Proxmox at all).
With the benefit of hindsight, is this really the case. Proxmox has added complexity, and a lack of expertise seems to have resulted in incorrect configuration of how drives are passed through to TrueNAS, so (with hindsight) was this really preferable?
They’re labelled correctly. I think my issue is drives on the host changing labels so my passthrough failed which I think is my only issue
That makes sense. I think a few unrelated things all happened at the same time leaving me extra confused when trouble shooting.
Long tests are planned and will be scheduled
I think while doing some other work the drive was slightly unplugged and since reseating has been assigned a different id on the host.
my current plan unless otherwise suggested is to change my passthrough to use by-id/by-uuid
The TrueNAS ZFS pool uses UUIDs, so if the same drives change position that isn’t a problem, but if you pass through drives from Proxmox to TrueNAS by device name, and then a device name changes, an existing drive might not get passed through whilst a different drive does get passed through. In this event, TrueNAS would think that a drive has disappeared - and this appears to fit the symptoms you have reported.
In which case, pass the drive through again, and you should be able to issue a ZFS REPLACE through the UI and resilver the existing drive back into the RAIDZ1 pool.
For my purposes yes. in a best use case/best configuration sense no.
I wanted the added complexity, I think this has turned out to be a rather simple problem but with data being involved(although to clarify no sensitive or important data) I was looking for a second opinion before exacerbating the problem.
If the 500gb SSD is not connected via SATA then you can probably pass the chipset SATA controller into the TrueNAS vm. This is the only reliable way to virtualize truenas.
It seems if you get your drive passed through again it should recover the pool.
Proxmox can corrupt a zfs pool when you pass individual data disks through like this.
Passthrough via by-id for the time being until I transition to bare metal.
TN automatically started to resilver, hopefully all comes out the other end fine
Appreciate the help