Zfs pool failure - any recovery options without a backup?

Larisa_Avagyan · March 3, 2025, 1:03pm

Hey everyone, I’m in a tough spot and could really use some guidance. I manage a small cluster with multiple ZFS pools, and one of them recently became unmountable. Unfortunately, we didn’t have a proper backup strategy- only relied on snapshots, which are now inaccessible.

Here’s what happened:

The pool was made up of several mirrored vdevs.
Two drives failed in the same vdev, making the pool unimportable.
Running zpool import shows the pool in a degraded state but won’t mount.
We attempted zpool import -f, but it didn’t work.

Is there any possible way to recover the data, or is it permanently lost? If recovery is still an option, what steps should we take? Also, once we get past this, what’s the best backup strategy to avoid a similar situation in the future? Any advice from those with experience in ZFS failures would be greatly appreciated!

Redcoat · March 3, 2025, 1:58pm

Welcome to the forums - sorry to hear of the event that has brought you here.

Your best option is likely to be the windows recovery software Klennet ZFS Recovery (Klennet ZFS Recovery) Its test version is free to use and will provide a report indicating the expected degree of recovery of the non-importable pool. You can then make a decision if it is worth it to you to purchase the software.
There have been several positive reports here in recent years.

With respect to backup, there are numerous postings here on this topic - use the search feature here to find them and scan then at your leisure. Replicating snapshots to another system is a good starting point for consideration. The location of that backup system then becomes important for such consideration - as resistance to fire, flood, and similar disasters, also network connectivity.

Good luck with your decisions. Forum members will help you with further answers to your questions after you have digested some of the online info.

etorix · March 3, 2025, 2:17pm

Replicating to another system to have a backup would be a good start.

For now, we’d need full hardware details on the failed pool and its drives.

Redcoat · March 3, 2025, 2:52pm

Thanks, I should, of course, have asked for that in my post…

Arwen · March 3, 2025, 6:40pm

I don’t have any suggestions on recovery…

However, it is my opinion, (not shared with others), that 2 way Mirror vDevs using huge disks, like 10TByte, has become too risky. While the old “RAID-5 / RAID-Z1 is dead” when using >1TBytes disks is not exactly true, the underlying concept has some merit.

I can’t remember the error rate that generated the old trope “RAID-5 is dead”, perhaps it was 10^14 for consumer disks and 10^15 for Enterprise disks. But, again in my opinion, this also applies when dealing with Mirrors.

You don’t mention your Mirror disk sizes…

One other comment. You don’t mention your ZFS pool scrub schedule. Scrubs can find checksum errors or read error and correct the block if redundancy is available.

My current media server has 2 storage devices. I take part of each for the OS root pool and Mirror it. Then use the rest in a stripe configuration for the media, (no redundancy but I have reasonable backups). For the first 5 or so years I’d get a failed video file every now and then. (Video files were larger so statistically more likely to have a bad block or checksum error.)

But I had haphazard scrubs, manual, when I remembered. So, could go 3 months without one. Eventually I made a fancy scrub script and scheduled the scrubs every 2 weeks, (all pools, so root pool gets scrubbed too).

The side effect of that scrubbing seems to be early catching of bad blocks. For the last 5 years, (and yes, that media server is 10 years old), I have had no bad blocks, even in the media pool.

Not sure what to make of it, other than ZFS is great.

Well, that is not quite true. I believe, without proof, that failing blocks, (but not yet failed), are detected by the storage device because of the ZFS scrubs, and the storage device corrects the problems itself. Either by re-writing the sector or sparing it out. Then supply the good block to ZFS. Which means ZFS does not record any errors. I guess I can look at the SMART outputs… but I am lazy and plan to replace that media server when I get a round TUIT.

sfatula · March 3, 2025, 7:20pm

I am a hater of the old Raid 5 is dead concept that people take as fact. That being said, there was definitely some element of truth to it, just not the blanket conclusion. But I agree, mirrors are similar to Raidz1, at least parity wise. Though the rebuild is vastly faster if one dies, so any risk is likely lower than Raidz1.

As far as risk, it’s the standard tradeoff. One has backups no matter which pool type they are using (or should). The tradeoffs are worth it for me, I prefer the performance. No backups is of course a horrible idea no matter what type of pool you are dealing with.

And definitely regular scrubs are a very good idea. Along with actually reading promptly any error emails and acting on them. It’s surprising sometimes how many people have errors for months before they act. I would expect a mirrored vdev where someone immediately responds to any error, does regular scrubs, has a UPS and does not do dirty power offs though it’s not supposed to matter much for zfs, and replaces after error(s), is very unlikely to fail. Possible for sure.

As far as the OP question about backups, as has been pointed out, off site replication is one very useful method. I also do replication of some data to a rotated set of drives stored in a bank vault. I also use Kopia on Scale to do a different type of backup off site for certain data. I prefer a variety of methods similar to what the 3-2-1 strategy proposes:

Larisa_Avagyan · March 4, 2025, 6:42pm

Thank you for your reply! Recently, I’ve been seeing more backup vendors starting to support ZFS, like Bacula or Commvault. Do you have any experience with them? Which might be better? I’ll focus on trying Klennet and see if it can help recover the data.

Redcoat · March 4, 2025, 7:15pm

I don’t have any experience with them. I have two NAS’s, one of which is a replication target of the other. I also have USB drives for off-site storage.

I have not gone through the evaluation and selection process of a commercial backup product, but I suspect that would be significantly influenced by the choice of medium.

I see that @Arwen has posted above - had she not I would have recommended that you look up her posts in the Forum Archive - she has long practiced backup in various formats and posted thoughtful and informative information about her practices and experiences, such as this Resource: How to: Backup to local disks | TrueNAS Community

I hope that Klennet meets your needs. Please report your experience with it.

Arwen · March 4, 2025, 9:26pm

I vaguely remembered that this backup company supports ZFS. A quick check, just now, and yes they do. Not sure of pricing, nor competitiveness. Here is a link:
https://www.rsync.net/

As for my local backups, they do work good. If someone can fit their backup onto a single, large, external drive, that can be a good option. Even taking the backup disk off-site afterwards. Plus, nothing stopping someone from rotating through several external backup disk(s).

It should even be possible to break up a larger pool into single external disk segments, for backing up. Even if the external disks are not filled up. Or could possibly use smaller disks, (that you have hanging around), for smaller pools or datasets.