can you rotate the cooler 90 degrees
Soemtimes the mounting hardware is square.
can you rotate the cooler 90 degrees
Soemtimes the mounting hardware is square.
can you rotate the cooler 90 degrees
If memory serves me right, I think thatâs indeed possible in this case. However, that would make for a very weird airflow.
I ordered a replacement cooler already, if it arrives before the weekend, Iâll go with that. If not, maybe I wont be patient enough and try to move the current one
Itâd be fine
Might even get extra air overt to your slots
Of course I wasnât patient and switched everything around yesterday as suggested by turning the current cooler 90 degrees. The HBA is now in PCIe x16 slot.
Iâm very happy to report that last night the scrub then ran through, this time for real, without any disks getting removed.
Iâve celebrated prematurely before so before announcing success Iâll run concurrent SMART long tests and then reboot all apps but Iâm mildly optimistic
Caution is the word of the day.
I donât expect PCIe x8 vs x16 be an issue except of throughput performance, otherwise PCIe x16 is expected to support any lane width down to PCIe x1 and ZFS shouldnât be any wiser.
It is possible having to shuffle cables around allowed connector contacts to be reseated. Something that seems to be overlook in server interconnect.
It is also possible CPU cooler was causing the CPU from hitting high temperatures. Did you clean up and reapply thermal compound between CPU and heathsink?
If your HBA issue was due to overheat, then I would put the fault on the company that manufacture those cards.
Anyway, I would suggest you keep a close eye on your system and as long as you have proper backup/replication in place, I wouldnât mind going through a series of scrubs just to understand if the fix is really a fix or just a fluke.
He was using a PCIEx8 card in a PCIEx4 slot - I think we have reason to be optimistic now.
This doesnât change how backward compatibility of PCIe works. If not 16x, then we still have x8 to support down to x1 capability without reliability issues except of the target throughput performance.
Wouldnât throughput be a possible issue when we have the HBA close to fully loaded with only half of its requested bandwidth & a scrub task running? I could imagine a world where drives timeout due to limited throughput causing zfs to panic instead of the HBA throttling drive speed to match available bandwidth.
That being said I agree that further testing will be necessary if nothing else.
I have no idea what PCIe is supposed to support in such a case. Intuitively it makes sense to me, that it wouldnât be happy about the situation. On the other hand of course, it wouldâve been nicer to just have it not work at all in a âhiddenâ PCIe x4 instead of giving random errors. Maybe thatâs exactly your point - that itâs still supposed to work, so it had no reason to âcomplainâ
It is possible having to shuffle cables around allowed connector contacts to be reseated. Something that seems to be overlook in server interconnect.
The only cables I changed since that last time it broke on me during every scrub is unplugging the Mini SAS side of the cables when moving the HBA. That couldâve been it, sure. All other cables changes I already went through before.
It is also possible CPU cooler was causing the CPU from hitting high temperatures. Did you clean up and reapply thermal compound between CPU and heathsink?
Iâm confident this isnât the issue. Iâve always been monitoring CPU temperature and itâs far from critical in any way. This cooler could probably run the CPU passively if I really wanted. With the fan going, itâs just 5 degrees or so above ambient. Even at very high loads I have never seen it exceed 50 degrees (or, all temperature sensor in the CPU report complete bogus numbers).
Thermal paste stilled looked to be in very good condition and since I never had heat issues, I actually didnât replace it.
If your HBA issue was due to overheat, then I would put the fault on the company that manufacture those cards.
Well depends on whether they are supposed to work in non-server chassis with much less airflow It couldâve been completely unrelated to heat since adding the fan in the end didnât change much - I was overconfident there.
Anyway, I would suggest you keep a close eye on your system and as long as you have proper backup/replication in place, I wouldnât mind going through a series of scrubs just to understand if the fix is really a fix or just a fluke.
Absolutely. Still nothing so far with SMART extended tests at 60%. Whatever changed, it at least made the system a lot more stable. Previously it wouldâve been long gone by now.
What I still donât understand is why it is/was always the same disk that got kicked out (maybe the cable contact could explain that), and why this issue only occurred once before (checking my logs about 6 months ago) and now I could get it 100% of the time a scrub got triggered (bandwidth, now that pool reached a certain size??)
I believe not to be a throughput issue.
Fair - maybe the HBA just needed to be re-seated then & the so far âsolutionâ just kinda worked out as it happened to re-slot the card.
I have little news to report which is good news in this case Iâd say.
I have since
No more issues during all of this.
Thanks a lot to everyone for contributing their time to help me debug this issue, I really appreciate it
Congratulation.
On a cautionary note, try not to force a reboot in the first sign of trouble. While ZFS is fairly resilient, it could make things worse.
Remember to mark the topic as solved!