Power Outage - Pool Offline - Can't Import Via GUI

Sadly only your memory could confirm. I swear from what I remember from TrueNAS core disks 100% showed individually.

Yes, we’d just undo the passthrough in the settings of the HBA, but at this point I don’t want to comment on if this’d make things worse if it was ALWAYS been presented as single disk to TrueNAS. Am currently not going to comment on recommending this further - this territory feels like it is outside of what I’d normally be comfortable suggesting if your pool always existed as a single drive to TrueNAS.

If we do have ZFS mixed with a RAID controller I’m not comfortable in confirming anything. To me, there is nothing inherently risky about exporting a pool (as long as you don’t set the delete data option in the command) & the re-importing, however I’m no longer comfortable advising on next steps & differ to those smarter than me to give you advice.

Edit: HoneyBadger & Stux coming to the rescue - I differ to them entirely.

1 Like

zpool import with no parameters will tell you that a pool can be imported - ONLINE in the status means it’s theoretically present, but as others have indicated you have a hardware RAID5 with 6x8T drives.

The first step I would take is to restart the system and try to enter the H710p BIOS - check there to see if it offers any information on the single 40T “virtual drive” being online and healthy, as well as the backing physical disks.

Unfortunately with this configuration, ZFS can’t provide any redundancy across the six disks - it’s only able to see a single drive.

Converting the controller to passthrough mode will result in the data definitely being unavailable.

The command that @Stux posted will also work - if it does permit the pool to be mounted (and it is visible in a zpool status -v) then that’s good - but I recommend cleanly shutting the system down and checking the BIOS regardless to see if there are any iDRAC warnings (a bad battery on your H710 perhaps)

2 Likes

Try zpool import Photoshoots -R /mnt

Not sure why the GUI is not allowing it to be imported.

2 Likes

There is something in the documents about the following. Maybe check its status too while in BIOS?

Controller Cache Preservation
The controller is capable of preserving its cache in the event of a system power outage or improper system shutdown.
The PERC H710, H710P, and H810 controllers are attached to a Battery Backup Unit (BBU) that provides backup power
during system power loss to preserve the controller's cache data.
Cache Preservation With Non-Volatile Cache (NVC)
In essence, the NVC module allows controller cache data to be stored indefinitely. If the controller has data in the cache
memory during a power outage or improper system shutdown, a small amount of power from the battery is used to
transfer cache data to a non-volatile flash storage where it remains until power is restored and the system is booted.
Recovering Cache Data
The dirty cache LED that is located on the H710 and H810 cards can be used to determine if cache data is being
preserved.
If a system power loss or improper system shutdown has occurred:
1. Restore the system power.
2. Boot the system.
3. To enter the BIOS Configuration Utility, select Managed Preserved Cache in the controller menu.
If there are no virtual disks listed, all preserved cache data has been written to disk successfully.

Hi, thanks for the help.

The command that @Stux posted will also work - if it does permit the pool to be mounted (and it is visible in a zpool status -v) then that’s good - but I recommend cleanly shutting the system down and checking the BIOS regardless to see if there are any iDRAC warnings (a bad battery on your H710 perhaps)

I’ve never accessed the iDRAC before, I’ll be in there in the morning but might need to set that up to gain access (?) or can I access all this through the CLI directly on the server too?

Should I try zpool import Photoshoots -R /mnt now or wait until I have checked the BIOS first?

You can access this directly from the system console, but you would need to reboot the system to enter the BIOS.

You may be able to see it through ipmitool sel list in the TrueNAS console though if Dell supports that.

root@truenas[~]# ipmitool sel list                                              
   1 | 08/19/2022 | 15:00:13 | Event Logging Disabled #0x72 | Log area reset/cle
ared | Asserted                                                                 
   2 | 03/08/2023 | 04:30:38 | Temperature #0x04 | Lower Non-critical going low 
 | Asserted                                                                     
   3 | 03/08/2023 | 09:07:24 | Temperature #0x04 | Lower Non-critical going low 
 | Deasserted                                                                   
   4 | 12/01/2023 | 06:30:26 | Temperature #0x04 | Lower Non-critical going low 
 | Asserted                                                                     
   5 | 12/01/2023 | 07:30:25 | Temperature #0x04 | Lower Non-critical going low 
 | Deasserted                                                                   
   6 | 12/01/2023 | 08:24:51 | Temperature #0x04 | Lower Non-critical going low 
 | Asserted                                                                     
   7 | 12/01/2023 | 09:27:26 | Temperature #0x04 | Lower Non-critical going low 
 | Deasserted                                                                   
   8 | 12/02/2023 | 05:15:51 | Temperature #0x04 | Lower Non-critical going low 
 | Asserted                                                                     
   9 | 12/02/2023 | 12:35:54 | Temperature #0x04 | Lower Non-critical going low 
 | Deasserted                                                                   
   a | 01/15/2024 | 03:15:35 | Temperature #0x04 | Lower Non-critical going low 
 | Asserted                                                                     
   b | 01/15/2024 | 09:21:04 | Temperature #0x04 | Lower Non-critical going low 
 | Deasserted                                                                   
   c | 01/17/2024 | 22:30:31 | Temperature #0x04 | Lower Non-critical going low 
 | Asserted                                                                     
   d | 01/18/2024 | 10:10:51 | Temperature #0x04 | Lower Non-critical going low 
 | Deasserted                                                                   
   e | 06/10/2024 | 18:43:10 | Memory #0x1b | Transition to Non-critical from OK
 | Asserted                                                                     
   f | 06/19/2024 | 23:58:03 | Memory #0x1b | Transition to Critical from less s
evere | Asserted                                                                
root@truenas[~]#                  

Nothing to do with the RAID card there unfortunately. You may need to cable the iDRAC port into your network and let it pick up a DHCP lease.

Give the zpool import Photoshoots -R /mnt a try and see if it complains - if it tells you that you have corrupted data, you may need to look at rewinding.

1 Like

Guess this isn’t what I wanted it to say
image

Nope, that’s what you don’t want to see unfortunately. You may have to check in the BIOS/UEFI to get into the PERC RAID card setup screen and have a look at what it thinks the state of your virtual and physical disks are.

Fault after power loss makes me think that perhaps the battery backup or the non-volatile cache on your controller are bad.

zdb -l /dev/da1p2 ?

image

Morning. After a relatively sleepless night I’m heading in to the office now.

I’ll try to access the BIOS via reboot as per below but I’m not 100% sure what I’m looking for…

check there to see if it offers any information on the single 40T “virtual drive” being online and healthy, as well as the backing physical disks.

and …

You may have to check in the BIOS/UEFI to get into the PERC RAID card setup screen and have a look at what it thinks the state of your virtual and physical disks are.

I’ll check through the steps in SmallBarky’s post above too RE: The
Controller Cache Preservation then update.


Considering getting in touch with someone today that can help walk me through it but I did try a few businesses yesterday with not much luck. If any of you are available for a call today that would be a big help, I’ve a budget to cover time.

If not, could you recommend a UK based company that have expertise in this, was looking at these (Contact | Haptic Networks) will ikely call this morning…

Thanks again for the help. It’s a massive support. :heart:

I would try reaching out to iX Systems directly. There is UK (London) number if you expand the International Phone Numbers

Hi, just an update from this morning.

I now have access to the iDRAC thanks to the efforts of the tech engineer that came out this morning and the really helpful support tech from bargainhardware.co.uk where we initially bought the hardware from.

Unfortunately, there isn’t anything showing in the iDRAC logs about errors or faults. The discs are all showing as status ‘unknown’ see below.

The controller is showing as status ‘unknown’ too.

The controller is showing that it’s set-up in RAID-6, which tallies up with the number and size of the drives vs what is showing as the size of the disc in TrueNAS too.

7x 7.2TB (per drive) -14.4 TB (Two parity) = 36

The support guy couldn’t really take this forward as there apears to be no error on his end. I’ve ordered a replacement controller and drive in case we find either have failed but speaking to the support guy he seems to think if I have to replace or reconfigure the controller then I’ll lose all the data. Could someone confirm?

Is there anything that I can investigate from the iDRAC or from BIOS (I’m in building with the server) to find out anything else that could help diagnose this?

I’ve spoken with about 5 companies, had people out to the building but haven’t moved much further forward. :cry:

PS - Below is the hardware from the system.

I think it is a data loss at this point. You just confirmed that the system has a RAID6 setup between TrueNAS / ZFS filesystem and the disks.

I can really only point you towards the documents and hardware guides at this point so you can learn about TrueNAS and ZFS.

I’m not sure when he will come online today but we can try…

@HoneyBadger What do you think? I am not sure what advice to even give.

Man I don’t want to sound like a jerk, but whoever originally set you up with a RAID card NOT flashed to IT mode, and also purposefully set to RAID-6 & then decided to feed it to ZFS screwed you. HARD. They have a fatal misunderstanding of how to set up ZFS on the most basic level.

I have no clue if it is possible now to recover anything because of this - which is why I said earlier that I have no advice I could give that wouldn’t risk everything & that I’m not comfortable advising further.

I don’t know if you have any chance to recover your data - if you do, back it up off of that setup asap. On the next run, please, use an HBA flashed to IT mode or just connect the drives to the motherboard & let ZFS handle making of raidz1/2/3/mirrors. Fully. It needs complete access to the drives.

SmallBarky gave you IX’s number, I’d say they are your only slim hope, but even then, the way this was originally setup was fatal.

Your individual drives are showing Online at least, but the status being missing and tagged as <?> for the controller itself is concerning. At the least, it’s showing the battery as Ready/Online as well.

Let’s get back into TrueNAS and try zpool import Photoshoots -fFn -R /mnt to simulate a “rollback import” - if it doesn’t throw the same integrity-check error, that means it might succeed here.

3 Likes

Hi,

Ok, will give this a try now.

So, it’s showing as per the below.

image

Just to confirm that date/time is around the time that the power went down. I can’t confirm if it was before or after though. :woozy_face:

Let’s go to the next step and mount read-only:

zpool import Photoshoots -fF -R /mnt -o readonly=on

If that takes, try a zpool status -v Photoshoots and post the results, then browse into the /mnt/Photoshoots directory and try to ls some contents. Find the most recent files or folders, see if the contents are readable. You may not be able to mount them as a network share but you can copy them off over SCP if you have SSH access.

Loss of that ten seconds of data is probably a given at this point.

It looks like you might have a bad NV cache module on the controller - however, I would be exceedingly cautious when performing the swap. Carefully review the Dell manual and processes for disabling and fully flushing the cache (which will also hurt performance on the virtual disk) before fully exporting the pool and shutting the system down.

Unfortunately there is no way to convert the system to individual disks/passthrough or “IT/Initiator-Target” mode while keeping the data intact, because it’s a single virtual disk. You’d need to be able to hold the entirety of that ~40TB volume on an external set of disk(s) - ideally with redundancy - in order to be able to swap out the controller or reflash it to IT mode, and then copy it all back.

1 Like