Metadata Special vDEV question - dead HBA

I know that a loss of the special vDEV will result in a loss of the pool.
However in that context people always talk about redundancy/having too many sVDEV disks die.

But what would happen in the following example:

  • HBA1: spinning HDDs Pool
  • HBA2: SSDs metadata sVDEV

The HBA2 just dies but the SSDs are okay.
So I could just replace the HBA, but is the pool still lost?

I’ll let the data loss experts here weigh in but based on personal experience, as long as you replace the HBA and can reconnect “missing” drives, the pool will import just fine. In my instance, it was a loose electrical connector that lead to multiple drives losing power.

The GUI unhelpfully told me to kill the pool and restore from backup. Crucially, it did not tell me which drives were missing from the pool, I had to deduce that myself (that improvement opportunity became a feature request).

But I figured there was no way that many drives failed at once and went over all electrical connections, found the issue and then the pool imported flawlessly. See here: Descent into Unhappiness | TrueNAS Community

I doubt it makes one bit of difference if the “missing” drives are part of a sVDEV or VDEV, once you exceed the allowable number of drives lost, you’re not going to be able to import the pool. For a Z3 VDEV like mine that would be 4 drives, for the 3-way sVDEV mirror, it would be all three.

2 Likes

Thanks Constantin!
I will try to test this on my testsystem, will report back how TrueNas behaved. :slight_smile:

Although I don’t use metadata vdevs and therefore have not seen this exact scenario before I have lost a large chunk of drives in a pool due to a JBOD issue and the pool has happily recovered when the drives came back online. I believe the pool suspends itself when this happens thus protecting itself.

2 Likes

So on my test system I had the HDDs connected to the SATA ports of the mainboard building the main pool/vdev.
On an HBA I had 4 SSDs, 2x2 Mirror for the sVDEV metadata

I then ripped that HBA out of the mainboard, which caused the pool to get suspended.
So far so good, so I shutdown the system.

Problem is, it does not actually shut down. It is trying to unmount the pool and fails.
Its been in this state for 10minutes now.

Guess I will open a ticket as the only way out seems to be a powercycle.

But I also have good news, after I reconnected the HBA and booted up the system the pool was working just fine! :slight_smile:

You really do not want make electrical changes to the main board, motherboard, etc. while the system is on. Shut down the NAS, turn off the computer, only then undo or redo connections.

The only exception is USB / SATA / Firewire connections where they are built for it. So, for example, the backplanes on my HDD tower lack the decoupling capacitors to make hot swaps advisable, so I generally avoid that.

Other hot-swap setups feature those capacitors and can “take” a HDD dropping out electrically just fine without affecting adjoining HDDs on the bus.

1 Like

I know, this is a test system where its a no brainer. no value lost if it dies. :wink:
I did want to see how it responds when the HBA “dies” while the system is running and data is actively written to the pool.

1 Like

Now this is hillarious, my Jira ticket was closed because “unsupported development tools were enabled”.
Which is not true, this was a plain and simple 26.04 nightly build install - so guess they don’t want feedback on their nightlies..

I do not do nightly builds or Beta builds but those would be considered “development”, but I do believe nightly builds have a feedback “airplane” at the top of the GUI to submit feedback on the nightly direct to the developers. JIRA for production would not be the correct place.

I suggest reading up on COW systems and how they respond to power interruptions. ZFS writes new data to a new location, and only updates pointers if the write succeeds, meaning a power loss leaves the original data intact and the filesystem consistent. Any data not fully committed is lost, much like in any other filesystem. So if you pull power on something that is what happens. ZFS was purposefully designed and implemented this way to deal with both unexpected power losses and crashes. As well resuming scrubs or replacements after graceful shutdowns and then boots. Again without any data loss. If there was “data in flight” when you yanked the card then those blocks are lost.

I would not pull hardware from a system that is on or even plugged in as lots of motherboards and server motherboards in particular “sleep” and are still partially powered when off. It’s a good way to require new hardware from frying something. Hot Swap backplanes are an exception as drives inserted are expected to be randomly pulled and recover as are other hardware connections already mentioned such as USB.

I can say that if you have a properly setup and connected expansion chassis and you forget to turn it on (it should be on first before booting the main server) or it otherwise loses power, It’s no biggie to just restore power to it and the pool should come back online.

As I said, this was a test system…. I am not insane and try such on a system that has any value.

The purpose of this was to simulate a HBA giving up while data was written to the pool…..

The result was partially what I expected, the pool got suspended.

Unexpected was that the system is unable to shutdown as it tries to unmount the pool that is misding the special metadata vdev.

this hardware faliour seems not to be handled properly by TrueNAS as the only solution is to disconnect the power.

What are you expecting if a HBA dies? I don’t think any other OS will behave differently unless the hardware and software was specifically designed for that. TrueNAS Enterprise version and their hardware might do redundant HBA and fail over. You would have to contact their sales for info. Not many of us have experience with the Enterprise version and their hardware
You are really asking about high availability and enterprise features.

That initiating “shutdown” via the WebGUI does actuall shut the system down. :wink:

What happens when the HBA connecting the special vDEVs dies:
Pool gets suspended (as expected)

What happens when you then initiating the shutdown via the WebGUI:
the system does not shut down, it appears to get stuck trying to unmount the suspended pool (see the screenshot in the jira issue). The only way to “shutdown” the system is to unplug the power.

The HBA dying is about the same as a processor or other critical part failing. I wouldn’t expect graceful shutdown at that point.

No, that is not the same at all as the entire system would go down then.

TrueNAS does not freeze, it does not crash, it propely suspends the pool and the WebGUI can still be accessed.

But when trying to shut the system down to work on the issue, it simply fails to unmount the pool / the special vdevs and thus does not fully shutdown.