Pool is suspended and zpool commands hang

zaxmyth · September 19, 2024, 1:37pm

My TrueNAS Scale (Dragonfish) install has a single pool (“pool-01”) composed of a single v-dev with 5x16tb drives in raidz2.

Yesterday, I noticed all IO to the pool had stopped working and I was seeing

"has been suspended (uncorrectable I/O failure)"

I [probably stupidly] rebooted the system to see if that would resolve the issue or at least help me get more information but now once the system boots to a shell (which takes about 15m) I am unable able to get any information about the pool as zpool status commands just hang.

I’m also seeing core dumps in dmesg and messages aboutmiddleward and zpool blocked for more the 241 seconds.

I’ve checked the smart test results on the drives and I think they’re ok but clearly something is wrong somewhere.

I have backups offsite of the data so I can restore (hopefully) but at the moment I’m a bit confused about why ZFS would be in such a state that I can’t get any information about the pool.

Clearly, pretty new to ZFS so any advice or guidance on what I can try and what my options are would be helpful.

Arwen · September 19, 2024, 2:36pm

Normally you would want to use zpool import for a pool that is unavailable. However, something worse seems to be going on as zpool status should not hang.

Please list all the hardware details, especially including;

Make & Model of disks
How are the disks wired to the server, (system board SATA ports?)

As for rebooting, ZFS was specifically designed NOT to loose data on graceless, (aka hard, or power loss), reboots. However, existing hardware faults can cause problems on reboot. And occasionally a device that is close to failing, will fail on power cycle or reboot.

zaxmyth · September 20, 2024, 4:47am

Thanks Arwen!

I have 5 x Seagate IronWolf Pro 16TB
3 of the disks in the pool are wired to the SATA ports on a GIGABYTE B550I AORUS PRO AX. There is a 500gb SSD that is wired to the one of board SATA ports and is used as a boot drive and is not part of the pool. 2 of the disks in the pool are connected via a SilverStone Technology ECS07 5-Port SATA expansion card via nvme.

Arwen · September 20, 2024, 8:49pm

This is likely the problem. And it’s not “nvme”, it’s M.2 M-Key slot for PCIe connectivity, (or another type of M.2 keyed slot with PCIe).

While it seems like straight forward SATA expansion chip, (without Port Multiplier builtin), heat or electrical connectivity may be an issue.

What does zpool import say?

zaxmyth · September 20, 2024, 9:30pm

I suspected that expander might be the problem. Because I have 5 drives in raidz2, I tried removing the SATA port expander that has 2 drives connected and rebooted hoping that I might be able to interact with the pool in a degraded state. Unfortunately, all zpool commands continued to hang. zpool status|import never return.

I’ve also tried re-seating all the cables and drives - same results.

Arwen · September 20, 2024, 9:33pm

Then I don’t know what to do next. It is almost like you have hardware problem not related to the SATA expansion card.

Perhaps someone else will have a suggestion.

zaxmyth · September 22, 2024, 4:14am

Thanks, Arwen.

Here is what I tried next:

I removed the boot drive and put it in another system - it booted right up (somewhat unsurprisingly)
Next, I moved all of the drives in the pool over. The motherboard on this system has 6 SATA ports so I had enough directly on the board for all of the drives without the expander.

Unfortunately, I get the same behavior: long boot with errors and panics, pool in suspended state, and zpool commands that hang.

So, it’s [probably] not the other original motherboard that is the problem and it’s [probably] not the SATA expander that is the problem (although, I understand those are flaky and suspect generally but doesn’t seem to be the culprit here),

The only remaining options are 1 or more of the drives in the pool have failed or the SATA cables are bad. Unfortunately, I don’t think I have 6 extra SATA cables lying around to test that theory with. But I’ll dig through some boxes and see what I can find tomorrow.

Any other ideas?

Questions for the ZFS experts here:
In my raidz2 pool of 5 drives, what would you expect would happen:

if 1 drive failed, if 2 drives failed, and if more than 2 drives failed? Is there a drive failure scenario where you would expect zpool commands to be completely unresponsive? I’m working under the assumption that in the 1 or 2 drive failure scenario that zpool would still work and I’d be able to attempt to replace and resilver drives. If more than 2 failed, would you expect the behavior I’m getting? Just wondering if I’m even looking in the right direction for the problems right now because the lack of responses from zpool is confusing.

Protopia · September 22, 2024, 9:09am

I think it is worth trying to identify whether any drives have failed or are failing by running SMART tests on them. I would personally…

Run a SHORT test on each SATA drive in the raidz2 array - you can run them in parallel and they should take c. 10mins. The SMART tests are entirely internal to the drive and a bad SATA cable shouldn’t impact them providing that the cable is sufficiently good enough for short commands to get through and short responses also.

With the SHORT test results in hand, I might then decide whether it was likely a SATA cable issue or a hard drive issue, and I might try running a SMART LONG test on each drive (providing that I wasn’t worried that such a LONG test might precipitate a drive failure).

zaxmyth · September 22, 2024, 1:40pm

Thanks Protopia.

I had run some short tests previously and all “passed” but did it again just now. I’m no expert in interpreting all of the results, but looking at Raw_Read_Error_Rate, Seek_Error_Rate, and Reallocated_Sector_Ct, they are 0 on all of the drives.

I’ll go ahead and kick off the long tests and see what happens. smartctl -c estimates it’ll take about 24 hours.

I’m still puzzled as to why zpool commands are completely hung. Any ideas about what would cause this sort of situation?

zaxmyth · September 22, 2024, 9:54pm

Running ps -ef | grep -i zfs shows that a zpool import command is running:

zpool import 657923847382948383 -R /mnt -m -f -o cachefile=/data/zfs/zpool.cache

Would having a running import cause other zpool commands to hang? I tried kill -9 on the process…but it is still there.

Arwen · September 22, 2024, 10:13pm

I once had a really odd problem many years ago. (Before ZFS…)

Their was this theoretical parallel SCSI problem where a single disk can hang the entire SCSI bus. Let us be clear, I never heard of it happening. But, theoretically it could. When a co-worker called and said they had a problem with their software RAID-5 array where none of the 7 disks showed up, I thought of that problem.

I explained this “theoretical” problem and the fix, it worked. So no more theory, proven fact to me.

Basically a single disk was so badly damaged, (during a data center move), that it hung the old parallel SCSI bus. I explained pulling 1 disk out at a time and running “probe-scsi-all”. (This was a Sun SPARC server using OpenBoot firmware…) They found the bad disk, were able to get the RAID-5 up in degraded mode, then replaced the disk and resync’ed it. All good then.

So, how does this story affect you, @zaxmyth?

Well, if your import fails, you might try removing 1 disk at a time. Then re-run the ZFS import. If not good, put that removed disk back in, and try another disk. You only have 5 disks to try.

If that does not work, well, you could try 2 disks at a time in the various unique combinations. With RAID-Z2 you can “loose” 2 disks and still have both a functioning pool and fully intact data.

As to WHY, well, if one or more disks in the pool are corrupt, it is usually hardware fault. ZFS is incredibly robust in protecting data. But, data written out of order, like with a hardware RAID controller, or corrupted by a over-heated SATA port multiplier, gets onto a disk.

To prevent that, regular ZFS scrubs help detect and correct the problem.

Good luck.

Stux · September 22, 2024, 10:38pm

I would think that zfs hangs are normally caused by hardware issues.

How long did you wait for the “Hang” to resolve?

zaxmyth · September 23, 2024, 12:57pm

@Protopia - The result of the long test for all 5 drives was Completed without error.

@Stux - I let the command run overnight once - so maybe 10 hours?

I’ll start pulling out drives 1 by 1 and see what happens.

Protopia · September 23, 2024, 4:32pm

No, do NOT start pulling drives to see what happens. Random attempts to bring a pool back online is VERY likely to make it irrecoverable.

By all means turn the system off, and reseat all drives and all SATA connectors before you turn it back on again, but having a drive removed whilst powered on will quite likely make things worse.

zaxmyth · September 23, 2024, 6:49pm

I was doing all the disk swapping with poweroff so no hot-swapping.

Some progress.

I disabled ix-zfs.service and rebooted - I wanted to avoid the zpool import command running. I think think that is what was hanging the subsequent zpool commands.

Without ix-zfs I was able to boot quickly and get to a shell and issue zpool commands. I was also able to recreate the hanging situation by manually running the zpool import pool-01 -R /mnt -m -f -o cachefile=/data/zfs/zpool.cache which is the command that appears to be started via ix-zfs.service. It spits out WARNING: Pool 'pool-01' has encountered an unrecoverable I/O failure and has been suspended and then never returning.

Next I tried

echo 1 > /sys/module/zfs/parameters/zfs_recover
zpool import pool-01 -R /mnt -o readonly=on

And that seems succesfully imports and mount the pool and datasets so now I can at least see some of the data. That’s as far as I have gotten, but it feels a bit like progress.

Thoughts on next steps? I haven’t been able to import in RW mode and I still haven’t managed to find an actual hardware failure that I can attribute to this.

zaxmyth · September 23, 2024, 6:58pm

zpool status -v does show show problems with the pool:

One or more devices has experienced an error resulting in data corruption. Applications may be affected

and

errors: Permanent errors have been detected in the following files: 
<metadata>:<0x3c>
pool-01/security@auto-2024-09-04...some blue iris bvr file i don't care about 
pool-01/security@auto-2024-09-04...some other blue iris bvr file i don't care about

Protopia · September 23, 2024, 8:24pm

@zaxmyth Zach

When you post output from commands can you please post the full output and not just what you personally think is relevant.

I would like to see the full output of the zpool status -v command please.

It is heartening that you have made progress in getting the system running and getting better diagnostics of the issue.

(What is annoying is that the ix-zfs.service is not handling the situation well. I would consider this a bug that needs to be reported, and so if you can preserve any logs and detailed diagnostic command outputs etc. that might indicate why the ix-zfs service is failing and allow iX to recreate the problem that would be helpful for when you report this as a bug.)

The good news is that both:

All the drives are good - suggesting an issue with the SATA cables (most likely) or the SATA controller.
The pool errors are only on two unimportant files.

So what we now need to do is to work on:

Eliminating future I/O errors. My advice is to change out all the cables before trying anything new - relatively low cost and worth eliminating as an issue.
Getting the pool errors cleared and bringing the pool back online… I am thinking a scrub and then a zpool clear as a way forwards, but wait to see what others think.

Stux · September 23, 2024, 8:45pm

The unimportant files are unimportant.

The important thing is

Which is corrupted metadata.

Unusual to get corrupted metadata. It’s copied 2 or 3 times and protected with raidz2.

Protopia · September 23, 2024, 9:12pm

Ah - I missed that. That is much more of a problem. Will a scrub fix that or is it a death rattle for the pool?

Stux · September 23, 2024, 10:17pm

I think it’s possible you could fix it by deleting the right dataset.

But, I think It’s backup and restore time.