I’m having some difficulties with my server and have therefore rebooted it a few times. The issue is that I added 2 new drives and one is not being recognized. I’m still troubleshooting that but so far it seems 2 drive bays (Dell R540) are not working.
My issue is that it seems every time I reboot, TrueNAS picks a drive and says it needs to be resilvered. I have the 2 new drives still in the system but never added either of them to my pool.
Is this expected behavior or maybe it adds some additional information to my troubleshooting?
No, it is definitely not normal to resilver a vDev in a pool after reboot.
What is your pool layout? (Output of zpool status please.)
How are the existing drives wired to the server?
Make, model & firmware revision of the SAS controller please.
What are the 2 new drives?
SAS, SATA or NVMe?
Hardware RAID controllers are known to cause problems with ZFS because of re-ordering writes. ZFS writes in a specific order for data integrity.
You have a Spare drive in use. Not sure why ZFS would want to re-silver repeatedly.
My recommendation is to either replace the failing disk. In which case your Hot Spare will be restored to available status.
Or activate the Hot Spare as a permanent pool disk by removing the failing disk. Afterwards, you will not have a Hot Spare and can remove the failing disk to test and see if it needs to be considered dead. You can always add another Hot Spare again.
Thanks for the advice. The strange thing is that the “failing” drive is one that I newly added to the server. I never added it to a pool. The other strange thing is that if I move that drive to a different bay in the chassis, it works. I verified this by moving around the 2 “new to me” drives.
At one point when checking drives I did accidentally remove one of the active drives that, understandably, started a resilver. I waited for that to completely finish before doing anything else. I then put in another “new to me” drive into the problem bay and that kicked off a resilver. Unfortunmately, I didn’t have a good handle on what drives were where and I’m not 100% certain what drive caused the issue.
I now have my drives labeled by serial number and I’m keeping a visual representation of where the problem drive is every time I reboot (only once so far) to see what happens.
I have a live USB of SystemRescue to be able to do some testing tonight and take TrueNAS out of the picture.
It may be possible to clear the failed drive flag using the boot time menu system of the HBA. It may be worth 30 minutes or so of poking around to see if that is possible. The $1,000 tool from Dell may be an OS “live” version of the same ability.
I’ve had to use such boot time menu for other reasons, like on Cisco server’s SAS controller, (also based on LSI), to set boot drives.
It turns out that in my panic when the 2nd drive didn’t work I didn’t keep good track of my testing. I also didn’t believe that I got a second bad drive.
Well, after about 4 hours of very methodical testing and keeping good track of serial numbers, it does seem that I did get 2 bad drives.
I am still investigating if it may be a firmware issue causing different behavior with the power disable pin on the drive and keeping it from spinning up.
Ah, yes, that is something annoying. Lots of people either check with:
Special power cords that don’t pass 3.3v through. Like Molex to SATA power which can’t supply 3.3v, (as the Molex 4 pin has 2 ground, 1 x 12v and 1 x 5v lines).
Tape up the 3.3v traces on the hard drive
Plus, you can check the make & model’s user manual to see if it does support the power disable pin function.