Hi there,
I have an issue that has been plaguing me for weeks now that I’ve been trying to solve with the help of ServeTheHome forums. I’m hoping someone here who might be more familiar with TrueNAS can help figure out what’s going on. I apologize in advance for the wall of text. I hope it isn’t too rambly.
My problem presents itself like this: every so often, I start getting timeouts on my disks that cause the entire array to basically grind to a halt for a few minutes. Then it recovers and seems like all is fine until it happens again. The issue usually looks something like this (this is a small snippet taken from dmesg from the latest occurrence). It affects all drives (I have 8) randomly.
[86899.065541] sd 2:0:4:0: attempting task abort!scmd(0x00000000f412cd61), outstanding for 31192 ms & timeout 30000 ms
[86899.065550] sd 2:0:4:0: [sdg] tag#2548 CDB: Read(16) 88 00 00 00 00 00 46 4e 8a 18 00 00 07 e8 00 00
[86899.065553] scsi target2:0:4: handle(0x001c), sas_address(0x3474f4f584b3480c), phy(12)
[86899.065557] scsi target2:0:4: enclosure logical id(0x3474f4f584b3483e), slot(12)
[86899.065560] scsi target2:0:4: enclosure level(0x0000), connector name( C0 )
[86899.491231] sd 2:0:4:0: task abort: SUCCESS scmd(0x00000000f412cd61)
[86899.826771] sd 2:0:4:0: Power-on or device reset occurred
[86931.840247] sd 2:0:4:0: attempting task abort!scmd(0x000000007be636df), outstanding for 32000 ms & timeout 30000 ms
[86931.840261] sd 2:0:4:0: [sdg] tag#2499 CDB: Read(16) 88 00 00 00 00 00 46 4e 92 00 00 00 07 e8 00 00
[86931.840268] scsi target2:0:4: handle(0x001c), sas_address(0x3474f4f584b3480c), phy(12)
[86931.840275] scsi target2:0:4: enclosure logical id(0x3474f4f584b3483e), slot(12)
[86931.840281] scsi target2:0:4: enclosure level(0x0000), connector name( C0 )
[86932.239236] sd 2:0:4:0: task abort: SUCCESS scmd(0x000000007be636df)
[86932.572170] sd 2:0:4:0: Power-on or device reset occurred
[86964.606651] sd 2:0:0:0: attempting task abort!scmd(0x00000000f0b15c9f), outstanding for 31188 ms & timeout 30000 ms
[86964.606665] sd 2:0:0:0: [sdc] tag#2558 CDB: Write(16) 8a 00 00 00 00 06 db d6 86 98 00 00 00 20 00 00
[86964.606670] scsi target2:0:0: handle(0x0018), sas_address(0x3474f4f584b34808), phy(8)
[86964.606677] scsi target2:0:0: enclosure logical id(0x3474f4f584b3483e), slot(8)
[86964.606682] scsi target2:0:0: enclosure level(0x0000), connector name( C0 )
[86964.987181] sd 2:0:0:0: task abort: SUCCESS scmd(0x00000000f0b15c9f)
[86964.987193] sd 2:0:0:0: attempting task abort!scmd(0x000000005dc3fd26), outstanding for 31604 ms & timeout 30000 ms
[86964.987198] sd 2:0:0:0: [sdc] tag#2497 CDB: Write(16) 8a 00 00 00 00 06 db d6 86 c0 00 00 00 50 00 00
[86964.987200] scsi target2:0:0: handle(0x0018), sas_address(0x3474f4f584b34808), phy(8)
[86964.987204] scsi target2:0:0: enclosure logical id(0x3474f4f584b3483e), slot(8)
[86964.987207] scsi target2:0:0: enclosure level(0x0000), connector name( C0 )
[86964.987210] sd 2:0:0:0: No reference found at driver, assuming scmd(0x000000005dc3fd26) might have completed
[86964.987212] sd 2:0:0:0: task abort: SUCCESS scmd(0x000000005dc3fd26)
[86965.446238] sd 2:0:0:0: Power-on or device reset occurred
The big weird thing is the pattern to this issue. I can go a day or two completely error free. Or it happens a few times a day. The kicker is, if you look at the time each episode starts, it falls on EXACTLY (and I do mean exactly) an interval of 90 minutes (5400 seconds) from when the system was last booted. For example, it might happen at 4.5 hrs, 12 hrs, 25.5 hrs since the last boot. But literally NEVER at any other multiple. It is ALWAYS a multiple of 90 minutes since the last boot. Not every multiple, but always a multiple.
Yesterday I actually had a drive fail (l dropped one on concrete while moving drives around and it lasted a few days before giving up, RIP and not a good time to be needing to shop for a new spare …) so I swapped it out for a replacement and started the resilvering. During resilvering, the errors kicked off exactly on EVERY 90 minute interval from when I booted the system. It lasts for about 5-10 minutes each time and then is completely error free for the next 80-85 minutes. But it was resilvering the entire time so it’s not like loads were higher at some point. They were high for 24 hours and the issue only happened on every 90 minute interval.
My fundamental question for you all is: is there some scheduled task in TrueNAS that runs on a 90 minute interval? I have tried to use journalctl to find something but I can’t find anything kicking off every 90 minutes and I’m not sure where else to look. Or even what could cause something like this.
Okay, onto some specifics of my setup. It’s a AMD Epyc 7302p in a Supermicro H11SSL-i motherboard with 64GB ram in a Gooxi 36-bay chassis that has two backplanes, a 24-port front one and a 12-port rear one. I have a LSI 9400-16i HBA (it’s legit, not a ‘knock-off’, with latest firmware/bios) that I have connected with a single cable to the front backplane. 8x 20TB EXOS recertified drives in RAIDZ2 in the front of the chassis. The cable is brand new that I purchased for eliminating that as the problem so I know it’s not a cable issue (have tried many different ones).
I’m running proxmox on that with TrueNAS SCALE (25.10) virtualized. HBA is passed through as a PCIe device so I have access to the raw drives from the TrueNAS VM. It all works GREAT except for these recurring errors.
The folks over at ServeTheHome suggested a number of things, including removing proxmox from the equation so I did. I’ve bare-metal installed as well on a spare SSD and the results are the same. So I’m reasonably confident it’s not a proxmox/virtualized issue.
I’ve also tried all combinations of HBA ports and backplane ports - it makes no difference. The only thing that DID make a difference was that I could not reproduce the problem if I put all the drives in the rear backplane. If they were there, the problem disappeared. Put them in the front one, problem returns. That led me down the path of it being the backplane that was faulty. Well, I got a second of these chassis, swapped out the backplanes and the problem is identical. That reminds me that I also initially though that the problem was 100% reproducible during boot, but I learned the other day that this isn’t always true - I have been able to boot without any errors but then have them show up later (on a multiple of 90 minutes from boot).
At this point I’ve narrowed it down to:
- Bad HBA. I have a 9300-8i on the way to test with to eliminate that as the problem.
- Backplane firmware is the problem (but only the front one, rear backplane is fine)
- Power issue? I haven’t yet swapped power supplies, but plan to.
If the 90 minute pattern didn’t exist I would say any of these could be likely. But how on earth could a power issue only be an issue exactly on 90 minute intervals? Seems unlikely to me. Same goes for HBA and backplane firmware though - unless they have some sort of 90 minute timer in them that does something? Or is TrueNAS doing something to trigger this?
Does anyone have any ideas for what could be causing this? Thanks for your help in advance.