TrueNAS Scale - Drive/Pool issues after power problems

Good morning!

I am having issues with our TrueNAS Scale unit where we had some power disruptions over the weekend. This is TrueNAS Scale 25.04.0. Momentary blips caused problems. I am using a DRAID that is 60 drive wide with 7 spares. These are 14TB drives with a total usable of 407.09TBs. Because of the power blips I have 4 spares currently in use and 4 that are available to add back into the pool. I checked the drives and they are reporting no SMART issues and no errors. When I try to readd them back I get the error:

Item#0 is not valid per list types: [EINVAL] datavdevs.draid_data_disks:null not allowed.

This is where I will tell you that you must have an UPS. My personal thing in selecting an UPS is how long it can survive without power to keep he system running. I want 10 minutes or more. I like in a very electrical storm heavy area, we have them almost everyday during the summer months. I have had thunderstorms and lighting hit very close, [flash of light, Boom, Shit my pants]. Okay, not the last thing but less than a second after the flash the boom, so it’s very close.

Anyway, I want an UPS that lasts long enough for my system to power down. I let it wait 30 seconds, then start the process. Better safe you know. With as many drives, I would suspect you need a rather large UPS that costs a lot of money, but is the data worth it? And if you have an UPS already, time to fin out why you had any problems at all.

Did you do a search for the error message? I found a few hits, one from last month with the same error message. Not sure if it will help, this quite honestly is not my strong area and I don’t want to give you bad advice. I have never used DRAID.

I was just dealing with one person yesterday who said the SMART was good, we asked for the last test, it was over 10000 hours old. A bit out of date. If you are only looking at the SMART Status, that only means the drive has not completely burned up. Maybe you did look at the actual SMART data, but when was the last Long test performed? When was the last Scrub performed?

Best of luck to you.

Im not very familiar with DRAID however I did once have a 90 bay JBOD lose 30 of its drives due to power/connection issue and the 4 hot-spares in the head kicked in. All was fine in the end once I got the JBOD back up and running but I had to let the hot-spares swap-in complete before it realised the other drives were fine and the spares went back to being spares again. Can you share the output of zpool status?

I’m not able to share directly the zpool status due to this being a disconnected network. Most of the drives are resilvering. The status shows 2 1/2 days to go. 1TB scanned out of 89TB to go. I have maybe 2 drives that are faulted now (They are the ones that report back being unable to check the SMART status via smartctl)

I pulled one of the faulted drives and replaced it. (I have tons of physical spares). Before I did that I was getting a constant message in the prompt of;

kernel: sd 11:0:1:0: device_unblock and setting to running

This would scroll through once every 5 - 10 seconds.

Do I just need to let the resilvering process finish before doing anything further?

Yes, I would.

1 Like

So checking zpool status this morning there appears to be no movement. I’m still at 998G of 89.0T scanned. The speed has dropped to 13.6m/s with an estimated time of over 78 days.

What do? Restart the system?

Is the pool still accessible?

Do you have a current backup of the data?

Im still tempted to suggest you leave things alone for now. Estimates can be wildly off sometimes.

Is the pool accessible? When I do a zpool status I see its there. If I go to Datasets in the UI it shows ā€œCreate Poolā€.

Ah ok that doesn’t sound good.

So you can’t access any of your data currently?

Does zpool status show the pool as online?

The pool is listed as ā€œDegradedā€.

Status: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state.

The front dashboard is acting like its never going to finish loading. The pool is listed there. With a degraded status as well.

Ok so that suggests the pool is still operational however you are saying you can’t see any of your datasets in the UI correct?

When you click on the dataset tab can you see any of them?

What do you use this for ie SMB, NFS, iSCSI? Are any of those services still working?

Dataset UI element only shows a ā€œCreate Poolā€ button in the center so no the UI isn’t seeing it. It functions as an SMB share and I’ve disabled the service for now so no data changes can occur while it does its thing.

Ok thanks.

So I think what has happened is the pool vanished from the system during your power outage and after reboot the UI is now confused.

Under the ā€˜Storage’ tab can you see your pool listed?

Yes it does. It also shows I have 5 unused disks that need to be added back into the pool to replace spares. I’ve done this successfully in the past on TrueNAS Core. But elements here are different and I’m still unsure how or even if I should proceed yet.

So as you currently can’t see and access your datasets from the UI that suggests the pool isn’t properly imported.

Personally I would try and export the pool from the UI and see what happens. If that goes well then try to import it again via the UI and see how things look.

Ok working on it. How long does that generally take? Its sitting at 40% ā€œReconfiguring system Datasetā€

Not long normally about a min tops.

If this sits here for too long any suggestions?

Just wait.

An old IT colleague of mine used to tell me when I was much younger than I am now in situations like this either go and have a fag or in this modern world a cup of coffee and wait.