Viewing data errors on remote, untrusted, encrypted dataset - best practice

marshalleq · February 14, 2025, 4:03am

Hi all, so my remote backup system is working like a dream, however I just got off a call where the below screenshot was shown to me. Normally I would just decrypt the dataset and list what file is a problem, delete it and restore it. However this has some other challenges which I am not sure what the protocol should be.

For example, I could just copy and paste the decryption key into the remote side, because while it ‘is’ untrusted, it’s actually a friend that I do trust. Simplest but not technically desirable. Is there a way to allow just this capability to a remote user?

Also, once we resolve how to know which file it is and where it’s from - there are other questions.

I assume this file will be attached to a dataset that has a snapshot that I will have to rollback to to fix, due to all it’s subsequent snapshots relying on the former - then, is there some kind of incremental sync up I can do so that the data is re-sent from the source? Or do I need to do a ‘from scatch’ backup on that specific dataset.

I have had more problems with LSI cards related to heat which are now all swapped out with 3rd party (slower but more reliable) SATA cards. This resulted in a number of issues while I was away overseas which corrupted some data. What happens to the backup if the source has corrupted data? Does it just send the corrupted data as is and I can go back to an earlier version and it’s all good? Or if I’m doing a raw send (not sure that I am though) does it replace the original good version with the corrupted version in the backup? Are there any scenarios here I need to be aware of?

How do you recommend I fix the remote backup target?

Thanks!

winnielinnie · February 14, 2025, 2:23pm

Not really, because your friend is the ultimate administrator of the server. They can delete or revoke users and privileges any time they want.

You could, through a VPN, SSH into a command-line with a username that is allowed to unlock the dataset. (Your friend would need to create this user, and grant it allow privileges for the datasets.)

From here, you unlock the dataset (with a keystring or passphrase), then check the output of zpool status -v, then delete or replace the relevant files and snapshots, and finally lock the dataset again.

Mostly likely it is the data blocks that are affected. ZFS is designed to make sure that you have multiple copies of metadata, on top of what the vdev redundancy already provides.

This means that any snapshots which point to these data blocks will need to be destroyed (even after you delete the offending file), which unfortunately might overlap with other snapshots you want to keep.

There’s a new feature in ZFS, which was never really “completed” that theoretically lets you do this. The -c flag for a zfs send allows you to replicate a “corrective stream”, which will use a source dataset’s “good” copy of data blocks to replace (inline) a destination dataset’s “bad” data blocks.

It’s not very intuitive, since you need to find the smallest incremental send (between two snapshots) that includes this file. Otherwise, you’ll have to send a massive full replication just to hopefully repair a single file.

You can use zfs diff or “guess” where you think this file might have first been introduced, so that you can specify an incremental stream that will contain the file’s data blocks, while making sure the stream isn’t too large.

Caveat: Corrective streams cannot repair metadata, only data blocks.

marshalleq · March 7, 2025, 9:15pm

So, just updating here - I ended up needing to get on the box directly. If I had had direct SSH to it, I could have parsed commands via an SSH stream, but I don’t and it didn’t seem worth it.

Anyway, more has come out of this that needs attention. Firstly, after confirming the encrypted pool is unencrypted (and mounted to confirm working for good measure), zpool status -v still did not tell me what files were needing to be restored, I tried to find the error just now to paste but couldn’t find it unfortunately.

Prior to this whole thing happening a scrub was done in case it fixed anything. And after this decryption excercise another scrub was done. There are now no errors present at all. I have since read that multiple scrubs can make the error go away. What I’m not clear on, is if this makes the disk error repair itself or if there is a bug in the scrub system that effectively makes zfs think the corrupted data is normal data and therefore no longer presents it as an issue.

Does anyone know what multiple scrubs does to a pool when it has an error and why the error goes away?

I’m suspicious now and am thinking I might need to delete the whole backup and re-replicate it. Which is rather annoying.

Thanks.

winnielinnie · March 7, 2025, 9:30pm

It’s likely that the subsequent scrub used a “good” copy of the block (or “parity”, if it’s RAIDZ) to fix the corrupt block. I can’t say for sure why it takes more than one scrub to clear the error from the zpool status -v command, but that’s usually what people experience.

Think of the status output of “errors” and “corrupt files” as more of a “log” than an active status. It can be “outdated”.

If a full scrub completes, which results in no errors or corrupted files, then it means that it either repaired the corruption or that the corrupted block/file has since been removed.

marshalleq · March 10, 2025, 5:18am

Thanks, so you’re saying people usually experience the errors disappearing after multiple scrubs and to ignore this and just thank myself I’m running ZFS because it saved me again? That’s what I want it to be… I wonder if there’s something upstream someone should post about this, as it doesn’t really instil confidence.