Thats a separate conversation for another day.
I would move these drives into your other unit and see if it plays nice. If the other system is running CORE then you will most likely need to change it to SCALE as your zpool probably won’t import. You could do this by either upgrading CORE to SCALE or installing SCALE on a new boot device.
Once it’s in the other system if it springs back to life don’t try and replace any drives just yet you will need to let it figure itself out first. Try hard to get us that zpool status output if/when it’s up and running.
This is one option. I’m a little hesitant to do this knowing there are a few drives that appear to have faulted in someway and won’t let me read much using smartctl. I do have another, older, Poweredge 420 system connected to a few Powervault enclosures that most of the data still resides on. I’m debating if I should just get these units back to square one and move things over again.
I did put in for UPS units to be on order.
And I know the usual response would be “Shouldn’t you have had them already?”
The answer is yes. We did. We had a standing unit 2 racks wide that was designed to be UPS for multiple racks to keep things powered. But that thing is old at this point and no longer functional. Its acting more like a big power strip now. We didn’t have any room for a single rackmounted UPS until more recent times when some equipment was decommissioned due to consolidation.
That and the usual “We don’t have money for IT needs right now”. We all know how it goes. 
Sure that’s obviously your shout. Those faulted drives may well be as a result of a faulty HBA so bare that in mind. You ‘might’ find with different hardware everything jumps back to life.
True. It might be. The 2nd unit was online and running without issues. No problems. The question is how do I identify which card it is? I Have a spare card I can drop in if that is the case. I suspected HBA card issues in the past with temps and had a command to check them but I can’t recall what it is now.
I think by moving the drives to the other system you may be able to confirm or deny that this system does indeed have a HW problem. You may also be able to recover the pool thus giving you more options. Only after that would I go down the road of trying to identify the hardware issue but thats just me.
Ran into this last year for a government operation. They never factored in replacing the batteries every 5 years. It cost them a lot more to replace everything as the production stopped for several days until they could repair it, to limp along, then several another days to replace the unit entirely.
Your company needs to factor in routine maintenance, with includes replacement of limited life parts.
If the server is that important, it might be a good opportunity to buy a service package from iXsystems. It will depend on how valuable the data is and if you feel you can fix this problem.
Let me ask a few questions:
- You have a full backup on a second server? Yes
- Do you have a backup of your TrueNAS configuration file? If not, you should maintain one in a corporate environment, even home users should have a copy periodically.
- Can you say that there is some common physical relationship between the drives that appear to be failing? Do they share the same data or power connection, same backplane, anything? Look for anything that could be a pattern.
- Have you recently had other similar failures in this NAS?
Not being able to see some of the outputs is really hurting the help you need. Yes, I’m stressing this topic, while I do not have a vested interest in your system operating or losing or saving data, I still think that if we can help someone, we should try.
I do understand the security thing all too well, you are bound by the rules. Hopefully you can convince the management and/or security officer to examine the data and then let you release it. And if this is company policy, then they need to write in exceptions where it makes sense to maintain the servers. Even in the government we had policies on the processes to release data that could be authorized to be released.
Just a slight update here. The UPS units ordered have arrived so that is an obvious improvement. I dropped the drives into the other unit and the Resilver DID finish. However its going through a scrub now and reporting back many permanent errors. So far they are all on a single dataset. I don’t expect that to continue. A long way to go through this scrub.
Just to reply to this. Even if a bit late;
- You have a full backup on a second server? Yes and No. This was a newish transition. The data on here is mostly from another location as is that is still functional. I can pull the data back in again with some loss. It isn’t a backup server but intended to be a new production one. .
- Do you have a backup of your TrueNAS configuration file? No. This was a recent setup as mentioned in #1.
- Can you say that there is some common physical relationship between the drives that appear to be failing? I am thinking the problem is what someone else mentioned with a failing HBA. I have a spare HBA card but I need to determine which card it is of the 4.
- Have you recently had other similar failures in this NAS? No recent failure in this NAS. There were prior issues in the original design of this NAS that have been addressed by the manufacturer. Specifically it had to do with power delivery to the backplanes being insufficient due to using a single power rail with 4 connectors. This has been corrected by using a rail that is now split to 2x2.
Ok so Resilver finished. Scrub finished. 2877 files with errors. Most of it I can tolerate/fix. A lot of the listed files are the snapshot locations. The pool is still in a degraded state. I assume because of the spares that are in use. How do I add these other drives back into the pool as spares? I did this in Core but the UI is different here. And its not immediately apparent how that works.