Scrub task stuck

Stux · April 25, 2024, 7:28am

can you rotate the cooler 90 degrees

Soemtimes the mounting hardware is square.

imvi · April 25, 2024, 7:46am

can you rotate the cooler 90 degrees

If memory serves me right, I think that’s indeed possible in this case. However, that would make for a very weird airflow.

I ordered a replacement cooler already, if it arrives before the weekend, I’ll go with that. If not, maybe I wont be patient enough and try to move the current one

Stux · April 25, 2024, 9:54am

It’d be fine

Might even get extra air overt to your slots

imvi · April 26, 2024, 3:25am

Of course I wasn’t patient and switched everything around yesterday as suggested by turning the current cooler 90 degrees. The HBA is now in PCIe x16 slot.

I’m very happy to report that last night the scrub then ran through, this time for real, without any disks getting removed.

I’ve celebrated prematurely before so before announcing success I’ll run concurrent SMART long tests and then reboot all apps but I’m mildly optimistic

Apollo · April 26, 2024, 4:29am

Caution is the word of the day.
I don’t expect PCIe x8 vs x16 be an issue except of throughput performance, otherwise PCIe x16 is expected to support any lane width down to PCIe x1 and ZFS shouldn’t be any wiser.
It is possible having to shuffle cables around allowed connector contacts to be reseated. Something that seems to be overlook in server interconnect.
It is also possible CPU cooler was causing the CPU from hitting high temperatures. Did you clean up and reapply thermal compound between CPU and heathsink?
If your HBA issue was due to overheat, then I would put the fault on the company that manufacture those cards.
Anyway, I would suggest you keep a close eye on your system and as long as you have proper backup/replication in place, I wouldn’t mind going through a series of scrubs just to understand if the fix is really a fix or just a fluke.

Fleshmauler · April 26, 2024, 4:43am

He was using a PCIEx8 card in a PCIEx4 slot - I think we have reason to be optimistic now.

Apollo · April 26, 2024, 6:06am

This doesn’t change how backward compatibility of PCIe works. If not 16x, then we still have x8 to support down to x1 capability without reliability issues except of the target throughput performance.

Fleshmauler · April 26, 2024, 3:04pm

Wouldn’t throughput be a possible issue when we have the HBA close to fully loaded with only half of its requested bandwidth & a scrub task running? I could imagine a world where drives timeout due to limited throughput causing zfs to panic instead of the HBA throttling drive speed to match available bandwidth.

That being said I agree that further testing will be necessary if nothing else.

imvi · April 26, 2024, 3:07pm

I have no idea what PCIe is supposed to support in such a case. Intuitively it makes sense to me, that it wouldn’t be happy about the situation. On the other hand of course, it would’ve been nicer to just have it not work at all in a ‘hidden’ PCIe x4 instead of giving random errors. Maybe that’s exactly your point - that it’s still supposed to work, so it had no reason to ‘complain’

It is possible having to shuffle cables around allowed connector contacts to be reseated. Something that seems to be overlook in server interconnect.

The only cables I changed since that last time it broke on me during every scrub is unplugging the Mini SAS side of the cables when moving the HBA. That could’ve been it, sure. All other cables changes I already went through before.

It is also possible CPU cooler was causing the CPU from hitting high temperatures. Did you clean up and reapply thermal compound between CPU and heathsink?

I’m confident this isn’t the issue. I’ve always been monitoring CPU temperature and it’s far from critical in any way. This cooler could probably run the CPU passively if I really wanted. With the fan going, it’s just 5 degrees or so above ambient. Even at very high loads I have never seen it exceed 50 degrees (or, all temperature sensor in the CPU report complete bogus numbers).
Thermal paste stilled looked to be in very good condition and since I never had heat issues, I actually didn’t replace it.

If your HBA issue was due to overheat, then I would put the fault on the company that manufacture those cards.

Well depends on whether they are supposed to work in non-server chassis with much less airflow It could’ve been completely unrelated to heat since adding the fan in the end didn’t change much - I was overconfident there.

Anyway, I would suggest you keep a close eye on your system and as long as you have proper backup/replication in place, I wouldn’t mind going through a series of scrubs just to understand if the fix is really a fix or just a fluke.

Absolutely. Still nothing so far with SMART extended tests at 60%. Whatever changed, it at least made the system a lot more stable. Previously it would’ve been long gone by now.

What I still don’t understand is why it is/was always the same disk that got kicked out (maybe the cable contact could explain that), and why this issue only occurred once before (checking my logs about 6 months ago) and now I could get it 100% of the time a scrub got triggered (bandwidth, now that pool reached a certain size??)

Davvo · April 27, 2024, 9:44am

I believe not to be a throughput issue.

Fleshmauler · April 28, 2024, 2:11am

Fair - maybe the HBA just needed to be re-seated then & the so far ‘solution’ just kinda worked out as it happened to re-slot the card.

imvi · April 29, 2024, 6:57am

I have little news to report which is good news in this case I’d say.

I have since

successfully completed concurrent SMART extended tests on all of the pool’s disks
Re-started and extensively used all apps which access the pool
Run another scrub
Put four of the pool’s disks back onto a single cable to PSU and cleaned up the case and cabling in general which meant unplugging various connectors again
Run another final scrub.

No more issues during all of this.
Thanks a lot to everyone for contributing their time to help me debug this issue, I really appreciate it

Apollo · April 29, 2024, 7:23am

Congratulation.
On a cautionary note, try not to force a reboot in the first sign of trouble. While ZFS is fairly resilient, it could make things worse.

Davvo · April 29, 2024, 7:41pm

Remember to mark the topic as solved!