I have a pool of 60 drives. 5 vdevs of raidz2 with 11 disks each and all other drives as spares. As of this morning one of the vdevs reported thousands of errors(accross all drives). now the pool if offline and resilvering and I have reports of like 30 drives as fualted. I am not super worried as I have a backup but I have to be able to trust this thing. I am willing to wait for the resilver but is there anything I should be looking at?
Give us your full hardware details, OS Version, etc. How are drives attached? HBA, backplane, cables. Is there any pattern to the drives and groupings?
server is dell R820 with x4 E5-4657L v2 and 768 ram. the drive are attached through 2 HBA’s ( LSI SAS 9305-16E) they are attached to a HGST Data60 v2. the enclosuer is zoned to 2 30 drive groups. origianl plan was to use dual port sas drives but ended up gtting a good deal on sata so the dual redundant cards are worthless as sata doesnt do dual zoning. I have to asume a cable or controler broke as it is exactly 30 drives faulted or removed but the whole trunas is now locked up so I cant see anything. I am going out to the site to see if the console is still funtioning and to do a physical check.
Adding info:
I pulled one of the HBA’s and atached both cables to a single HBA. all drives show in the bios but half didnt show in truenas. I rezoned the encloser to show all drives on a single host and now all the drive show. after a couple restarts the pool now showes as healthy.
now I have a new issue. I have a spare dive that shows as unavailible and 1 disk showing as unassigned (SDL) (Asuming its the same disk(SDP)). if I try to import the unasigned disk it fails saying its already in use.
The pool is reilvering and will be done by tomorow morning. I am willing to wait and do another restart but I need to know if I shouold be doing anything else. I dont want to loos spares to a software issue. I do have more spares not connected that I could swap out but I dont want to run into an issue 3 years from now when I swap a used one back in.
Wile I have a resolution, I dont really know what it was. I pulled both the HBA’s and moved jsut one to a new slot (needed the room for A380 anyway). I reprogramed the zoning on my 4u60 to be single zone(all 60 drives to a single host). restarted like 5 times and everything just showed up.
It was a little weird as it said I had one rive missing(a spare) and 1 drive not assigned(cant find were it was)…tried to assigne it and got errors about being in the pool. got frustrited and walked away for the noght and came back to all green checkmarks and all drives in the right spot.
Guessing here. Do you have any power problems upon booting? Do you have the drives set to start in staggered sets so you don’t have one big power draw?
Have you ran SMART Long tests on all your drives? Have you used flash3sas sas3flash to check the status on the controller cards and checked firmware is current?
Thanks for the reply.
No power issues. the HGST 4u60 is pretty smart. It does do staggered spin up and at max load is only 480 watts. Its on it own double online conversion 3000xli UPS so It has super silky smooth power. I also have run 8 of these is on one host in a rack with infinibox’s proprietary software on a 20amp 240 circuit. This has a dedicated 20 amp 240v circuit just for the encloser and host. Host has its own 3000xli it shares with 2 40gb switches. The rest of the servers in the rack share 2 more 3000xli’s and 2 more 20 amp 240v circuits. So a total of 60 amps to the rack and max draw of around 15 amps if I push all the servers with artificial tests.
I have run long smart tests on every drive and they come back clean.
I have not run flash3sas but I don know I have the newest firmware as Avago stopped developing for the is card a few years ago.
What firmware is your HBA currently running?
It’s should be on this LSI 9300-xx Firmware Update | TrueNAS Community
Not trying to be pedantic here but it’s sas3flash. Just don’t want future readers getting confused.
That was my fault on the name. Past, present and future readers will be confused.
Intereesting. I didnt know people were making custom firmware. Does this actualy apply to Scale? I have not noticed the controler resetting.
I have the latest from avago (16.00.11.00)
As far as I am aware yes.