Pool is suspended and zpool commands hang

zaxmyth · September 24, 2024, 1:16pm

Thanks for the help, everybody.

@Protopia sorry about the zpool status -v output. I didn’t have the ability to copy and paste the output from the computer so I had to type it out. I should have at least mentioned that in that reply.

Last night I moved the drives into a new case with a new backplane, new SATA cables, and no expander and the results were unfortunately, the same…

The good news is I guess I don’t have any bad hardware but the bad news is I’m not exactly sure what to do next - although it seems to probably require restoration…

Any tips or pointers to documentation that you all think would be appropriate would be appreciated.

I do have regular snapshots on most of the datasets - is it possible to try rolling back before I start doing anything more drastic?

Also, is it possible to determine which dataset that missing metadata file is on?

Protopia · September 24, 2024, 3:13pm

Yes - rolling back a dataset from a snapshot should fix the metadata issue identifying the dataset because snapshots are block based and the metadata block is therefore part of the snapshot and would be rolled back. My first assumption (and it is ONLY an assumption) is that this will be part of the same dataset as the corrupted files.

Equally, as Stux pointed out, metadata blocks hold 2 or 3 copies of the metdata (as well as being RAIDZ or mirror protected) - and I am unsure whether you get a metadata error if only the primary copy is corrupt or if all the copies are corrupt and / or whether a scrub will fix the metadata error.

So having verified by a SMART LONG test that the drives are working fine, I would start by running a scrub on the pool and then review the output to see whether anything has been fixed.

If you still have a metadata error then working out which dataset it is part of would seem to be the next action to attempt.

etorix · September 24, 2024, 3:31pm

Ideally, commands would be run from a ssh session and the output cleanly copied out from shell to be pasted. Failing that, take screenshots.

You may try to delete the damaged files, the assfected snapshots, and all other snapshots the files are in. Then scrub. If that clears the error, restore the files from backup.
Otherwise, destroy and restore the entire pool.

I suppose so, from zdb and with extensive knowledge or ZFS data structures on drive. (That is: Pointless for mere mortals…)

Protopia · September 24, 2024, 5:00pm

Except that you DON’T want to delete snapshots that might enable you to roll back to a point that the affected pool / dataset wasn’t corrupt.

Stux · September 24, 2024, 7:13pm

There is no guarantee that rolling a dataset back will do anything except lose data written after the rollback point.

The problem is you don’t know what metadata is corrupt, so you don’t know how to get rid of it.

Arwen · September 24, 2024, 7:46pm

If ZFS is reporting Metadata corruption, that is ALL copies of that specific block of Metadata are corrupt.

During normal operation or during a scrub, if ZFS detects bad Metadata, it will use any redundancy to fix it. That is normally available extra copies, Mirroring or RAID-Zx. I think I have personally seen this on my non-redundant Media pool. It must have used a standard Metadata copy to fix it, since there was no vDev redundancy.

Please note, this Dataset attribute can disable extra copies of Metadata;
redundant_metadata=all|most|some|none
with the default of “all”. The manual page for “all” option seems badly worded. I might submit a GitHub issue on it. (The manual pages were changed a while back, so perhaps it got messed up.)

For a pool with redundant vDevs to have Metadata corruption, something bad must have happened. Perhaps something like this, (not saying the original poster had any of these);

Software bug, perhaps in I/O, (aka SATA), drivers
Disk controller (partial), failure, (aka SATA controller overheated writing garbage)
Lack of regular scrubs so that many more than normal bit-rot blocks accumulate.
Memory errors

The last worries me. I think with the massive amount of data ZFS is now storing world wide, and the large user base with Non-ECC memory servers, it is my opinion we are starting to see occasional pool corruption because of memory errors. I am not saying that is the case for the original poster. But it is possible.

PS - ZFS RAID redundancy is at the vDev level, (Virtual Device). Not the pool level. A pool might have only 1 data vDev, but a pool is not limited to 1 data vDev.

Protopia · September 24, 2024, 8:35pm

@Arwen

What about using snapshots to go back in time to a point where there wasn’t any metadata corruption (or indeed any corruption)?

And how can we identify which dataset the metadata is related to?

Does ZFS and / or TrueNAS actually allow you to have different shaped data vDevs in the same pool? I can’t see any fundamental reason why it wouldn’t work even though it would make sense for consistent performance and recovery for all data vDevs in the same pool to have the same shape.

Stux · September 24, 2024, 8:59pm

Absolutely.

Mirror mixed with a stripe, a small raidz1 and a giant raidz3 if you want. Unadvisable of course

Stux · September 24, 2024, 9:00pm

I was going to say something similar, a software bug (in zfs) or a memory error.

I think the sata controller failure may have been caught by UDMA CRC… or not

Arwen · September 25, 2024, 1:18am

Snapshots only make certain data & metadata read only. So a snapshot could easily be corrupted either due to bit-rot, (data going bad on storage device), or written as bad due to hardware fault or software bug.

I don’t know the tools well enough to determine what metadata goes to which dataset or snapshot. It is probably “zdb”.

ZFS definitely YES. TrueNAS, not so much today.

Their was a serious problem with new users where they tried to extend their RAID-Zx vDev by 1 disk. (Before RAID-Zx expansion… users “assumed” it was possible.) Instead, they turned their redundant pool with RAID-Zx vDev(s) into a mostly redundant pool, but with a single disk stripe. Loss of that single disk stripe meant loss of the pool.

I think TrueNAS is attempting to prevent that kind of mistake today. Though perhaps Core might have less hand holding than SCALE, since SCALE is getting more GUI and code changes.

Arwen · September 25, 2024, 1:34am

That is a good point. I don’t know SATA wire protocol, just that SAS protocol has much higher reliability built into the wire part.

A quick search indicates that a CRC32 protects SATA data, (and I assume commands), over the wire. But, if the SATA controller chip is over heating, the corruption could possibly occur before the checksum is computed. Still causing bad data to be written.

Stux · September 25, 2024, 2:34am

Anyway, conclusion is the same, this is a messed up situation and shouldn’t happen except with either a software or hardware fault… that resulted in all copies of the metadata being corrupt, which seems unlikely without a massive amount of errors occurring, which did not occur.

Could be a memory error, but without ECC you can never tell.

zaxmyth · September 27, 2024, 1:48pm

My current guess is that the problem was created because of the SATA expander or potentially a memory issue. It likely worked most of the time but at some point It may have written some garbage that caused the corruption.

I do regular scrubs and had seen some some errors in the past but subsequent scrubs had seemed to resolve them. I was not sure what was going or what to think/do about it because it seemed intermittent and self-resolving, but in retrospect it was probably trying to tell me something was actually wrong - I just couldn’t figure out what. Like many new ZFS users, there are sometimes some hard lessons you need to learn

I had plans to build a new server as soon as a construction project in my house is done (because the space under construction is where the servers go…) but this is making me move that project forward a bit. All the parts are ordered and should all arrive within the next week. I’ll build up the new system and migrate the data over that I can. New system will have ECC and plenty of proper SATA connections.

Thanks for all of the help and advice along the way.

zaxmyth · September 27, 2024, 2:09pm

I did open a bug as @Protopia suggested for anybody interested

https://ixsystems.atlassian.net/jira/software/c/projects/NAS/issues/NAS-131418