Drives start timing out every 90 minutes

Hi there,

I have an issue that has been plaguing me for weeks now that I’ve been trying to solve with the help of ServeTheHome forums. I’m hoping someone here who might be more familiar with TrueNAS can help figure out what’s going on. I apologize in advance for the wall of text. I hope it isn’t too rambly.

My problem presents itself like this: every so often, I start getting timeouts on my disks that cause the entire array to basically grind to a halt for a few minutes. Then it recovers and seems like all is fine until it happens again. The issue usually looks something like this (this is a small snippet taken from dmesg from the latest occurrence). It affects all drives (I have 8) randomly.

[86899.065541] sd 2:0:4:0: attempting task abort!scmd(0x00000000f412cd61), outstanding for 31192 ms & timeout 30000 ms
[86899.065550] sd 2:0:4:0: [sdg] tag#2548 CDB: Read(16) 88 00 00 00 00 00 46 4e 8a 18 00 00 07 e8 00 00
[86899.065553] scsi target2:0:4: handle(0x001c), sas_address(0x3474f4f584b3480c), phy(12)
[86899.065557] scsi target2:0:4: enclosure logical id(0x3474f4f584b3483e), slot(12)
[86899.065560] scsi target2:0:4: enclosure level(0x0000), connector name( C0  )
[86899.491231] sd 2:0:4:0: task abort: SUCCESS scmd(0x00000000f412cd61)
[86899.826771] sd 2:0:4:0: Power-on or device reset occurred
[86931.840247] sd 2:0:4:0: attempting task abort!scmd(0x000000007be636df), outstanding for 32000 ms & timeout 30000 ms
[86931.840261] sd 2:0:4:0: [sdg] tag#2499 CDB: Read(16) 88 00 00 00 00 00 46 4e 92 00 00 00 07 e8 00 00
[86931.840268] scsi target2:0:4: handle(0x001c), sas_address(0x3474f4f584b3480c), phy(12)
[86931.840275] scsi target2:0:4: enclosure logical id(0x3474f4f584b3483e), slot(12)
[86931.840281] scsi target2:0:4: enclosure level(0x0000), connector name( C0  )
[86932.239236] sd 2:0:4:0: task abort: SUCCESS scmd(0x000000007be636df)
[86932.572170] sd 2:0:4:0: Power-on or device reset occurred
[86964.606651] sd 2:0:0:0: attempting task abort!scmd(0x00000000f0b15c9f), outstanding for 31188 ms & timeout 30000 ms
[86964.606665] sd 2:0:0:0: [sdc] tag#2558 CDB: Write(16) 8a 00 00 00 00 06 db d6 86 98 00 00 00 20 00 00
[86964.606670] scsi target2:0:0: handle(0x0018), sas_address(0x3474f4f584b34808), phy(8)
[86964.606677] scsi target2:0:0: enclosure logical id(0x3474f4f584b3483e), slot(8)
[86964.606682] scsi target2:0:0: enclosure level(0x0000), connector name( C0  )
[86964.987181] sd 2:0:0:0: task abort: SUCCESS scmd(0x00000000f0b15c9f)
[86964.987193] sd 2:0:0:0: attempting task abort!scmd(0x000000005dc3fd26), outstanding for 31604 ms & timeout 30000 ms
[86964.987198] sd 2:0:0:0: [sdc] tag#2497 CDB: Write(16) 8a 00 00 00 00 06 db d6 86 c0 00 00 00 50 00 00
[86964.987200] scsi target2:0:0: handle(0x0018), sas_address(0x3474f4f584b34808), phy(8)
[86964.987204] scsi target2:0:0: enclosure logical id(0x3474f4f584b3483e), slot(8)
[86964.987207] scsi target2:0:0: enclosure level(0x0000), connector name( C0  )
[86964.987210] sd 2:0:0:0: No reference found at driver, assuming scmd(0x000000005dc3fd26) might have completed
[86964.987212] sd 2:0:0:0: task abort: SUCCESS scmd(0x000000005dc3fd26)
[86965.446238] sd 2:0:0:0: Power-on or device reset occurred

The big weird thing is the pattern to this issue. I can go a day or two completely error free. Or it happens a few times a day. The kicker is, if you look at the time each episode starts, it falls on EXACTLY (and I do mean exactly) an interval of 90 minutes (5400 seconds) from when the system was last booted. For example, it might happen at 4.5 hrs, 12 hrs, 25.5 hrs since the last boot. But literally NEVER at any other multiple. It is ALWAYS a multiple of 90 minutes since the last boot. Not every multiple, but always a multiple.

Yesterday I actually had a drive fail (l dropped one on concrete while moving drives around and it lasted a few days before giving up, RIP and not a good time to be needing to shop for a new spare …) so I swapped it out for a replacement and started the resilvering. During resilvering, the errors kicked off exactly on EVERY 90 minute interval from when I booted the system. It lasts for about 5-10 minutes each time and then is completely error free for the next 80-85 minutes. But it was resilvering the entire time so it’s not like loads were higher at some point. They were high for 24 hours and the issue only happened on every 90 minute interval.

My fundamental question for you all is: is there some scheduled task in TrueNAS that runs on a 90 minute interval? I have tried to use journalctl to find something but I can’t find anything kicking off every 90 minutes and I’m not sure where else to look. Or even what could cause something like this.

Okay, onto some specifics of my setup. It’s a AMD Epyc 7302p in a Supermicro H11SSL-i motherboard with 64GB ram in a Gooxi 36-bay chassis that has two backplanes, a 24-port front one and a 12-port rear one. I have a LSI 9400-16i HBA (it’s legit, not a ‘knock-off’, with latest firmware/bios) that I have connected with a single cable to the front backplane. 8x 20TB EXOS recertified drives in RAIDZ2 in the front of the chassis. The cable is brand new that I purchased for eliminating that as the problem so I know it’s not a cable issue (have tried many different ones).

I’m running proxmox on that with TrueNAS SCALE (25.10) virtualized. HBA is passed through as a PCIe device so I have access to the raw drives from the TrueNAS VM. It all works GREAT except for these recurring errors.

The folks over at ServeTheHome suggested a number of things, including removing proxmox from the equation so I did. I’ve bare-metal installed as well on a spare SSD and the results are the same. So I’m reasonably confident it’s not a proxmox/virtualized issue.

I’ve also tried all combinations of HBA ports and backplane ports - it makes no difference. The only thing that DID make a difference was that I could not reproduce the problem if I put all the drives in the rear backplane. If they were there, the problem disappeared. Put them in the front one, problem returns. That led me down the path of it being the backplane that was faulty. Well, I got a second of these chassis, swapped out the backplanes and the problem is identical. That reminds me that I also initially though that the problem was 100% reproducible during boot, but I learned the other day that this isn’t always true - I have been able to boot without any errors but then have them show up later (on a multiple of 90 minutes from boot).

At this point I’ve narrowed it down to:

  1. Bad HBA. I have a 9300-8i on the way to test with to eliminate that as the problem.
  2. Backplane firmware is the problem (but only the front one, rear backplane is fine)
  3. Power issue? I haven’t yet swapped power supplies, but plan to.

If the 90 minute pattern didn’t exist I would say any of these could be likely. But how on earth could a power issue only be an issue exactly on 90 minute intervals? Seems unlikely to me. Same goes for HBA and backplane firmware though - unless they have some sort of 90 minute timer in them that does something? Or is TrueNAS doing something to trigger this?

Does anyone have any ideas for what could be causing this? Thanks for your help in advance.

Which or what power supply is driving the MB and backplane?

Hi Mike,

It’s a Gooxi server chassis with redundant 1300W power supplies in it (hot swappable type). I normally only use one of the supplies and leave the second one empty - but I’ve tried using both or swapping them and it doesn’t make a difference. Last night I also completely swapped the entire unit from the second chassis so that all PSU wiring was swapped out - also made no difference. My peak load is only on the order 400W too and the issue doesn’t seem to correlate with actual high system loads. e.g. while resilvering the load was constant for 24hours yet the problem only occurred on the 90 minute intervals.

I think you have summarised the issue very well. This is defo what I would try first.

Yeah I’m hoping that’s all it is - replacement HBA should be here next week (I actually ordered 2x of them) so I’m crossing my fingers that it resolves the issue and I don’t have to think about this anymore. Lol.

At least it doesn’t seem to be affecting the integrity of my pool or data.

Ya, should fine on PSU consumption, thanks for validating.

Nothing obvious in the X11 assertion logs? HBA swap is good idea. Hey, now you can split ports even more! :upside_down_face:

Forgive me, I’m new to TrueNAS. What exactly are X11 assertion logs and where would I find them?

You would access your X11SSL motherboard IPMI management interface via the dedicated rj45 network port. How did initially setup your server hardware before Truenas install?

Oh, that X11. Yes I have poked around in the IPMI. I’m not seeing anything called an assertion log although the health log mentions assertion events. It’s all old entries though from when I was messing with my fans (swapped them out for some Noctua and had to override the speed thresholds).

But mobo BIOS wasn’t something I had thought about so maybe there’s something there to look at. I’m currently on 2.6a from 2023 but there is a much newer 3.4 from last July. I’ll add that to the list of things to try.

Alrighty, I have some updates:

  1. I updated H11SSL-i bios to latest (3.4). No change to the problem.
  2. Moved my production drives to the rear backplane and added 8x 14TB drives to the front backplane and made a new pool for testing. Problem now happens with the test pool and not the production one.
  3. I got the new HBA - it’s a Supermicro AOC-SAS3008-L8e that came with firmware 12.00.02.00-IT and I connected the front backplane to it (left the rear backplane connected to 9400-16i). Problem remains the same.
  4. Updated the SAS3008 to latest Supermicro firmware (16.00.14.00-IT) - no change to the issue.
  5. Cross-flashed it to the 9300-8i special firmware for TrueNAS (16.00.12.00-IT) - no change to the issue.

At this point it regularly boots without the issue appearing. To reproduce the issue I then just start a large copy from my production pool to the test pool and let that run. It copies at about 550MB/s. Without fail, the error pops up almost exactly 5620 seconds after boot (as reported by dmesg timestamps) and then repeats every 90 minutes (5400 seconds) while the copy is happening. There are NO errors being reported related to the rear backplane, it’s drives/pool, or the 9400-16i they’re attached to. The problem seems entirely isolated to this front backplane currently connected to the sas3008 (flashed to 9300-8i at the moment).

At this point I think I’ve ruled out everything other than the front backplane itself (something fundamentally incompatible) and TrueNAS. I guess at this point the thing to try next is a different operating system and if the problem continues then it has to be the backplane?

1 Like

I had a similar but different issue with a Supermicro 45 bay JBOD many years ago. Drives on the front worked great but when I went to replace a drive round the back the system would through a wobbly.

The original wiring method was HBA Input into Pri J0 and Sec J0, the cascade cable to the rear was going out from Pri J1 and Sec J1. After reconfiguring by moving cascade cable from J1 on both Pri and Sec to J2 on both Pri and Sec all worked great.

I’ve already tried all different combinations of connectors on backplanes and HBA. Currently I’m not cascading the backplanes at all - I just have a single cable from backplane port 0 to HBA. I’ve tried all the different ports, with new cables, and it doesn’t make a difference to the issue. Rear backplane is connected in the same way (directly to HBA) and doesn’t have any issues. Oddly enough it also didn’t seem to have any issues when cascaded through the problematic backplane, but I’m not confident I tested that thoroughly enough to say that for sure.

Seller of this chassis has the HBA he was using with it previously (an LSA 9311-8i) and he’s going to let me borrow it to test with. Not clear if he was using TrueNAS or ZFS but he claims he never had any issues (and he was running 100+ of these systems with the same configuration).

I’m really close to just scrapping the chassis and picking up a Supermicro 847. Or maybe I could retrofit a Supermicro backplane into this chassis … maybe that would be worth attempting before spending a ton on shipping a chassis to me.

The thing I just can’t wrap my head around is the 90 minute interval to the issue and how it runs flawlessly between the 90 minutes. Is it possible the backplane has some sort of timer in it that is causing this? If not then it must be TrueNAS doing something every 90 minutes to trigger this? But I can’t find anything in TrueNAS that would suggest it’s doing something every 90 minutes.

Hey Johnny,

I have a related question for you based on your system specs: I’m setting up a 24 JBOD disk array in a NetApp DS4246, and I planned on using the LSI 9300-8e SAS HBA (flashed in 16.00.12), but I was reading that the 16.00.12 firmware has issues with large capacity drives (and particularly the Seagate Exos), at least in the 9300-8i variant (which H3C has released a special 16.00.16 firmware to correct). My drives are all WD drives, but I nonetheless wanted to use the latest and greatest firmware, but couldn’t find the 16.00.16 firmware for the 9300-8e (only for the 9300-8i). Can I ask you what firmware you’re rocking in your LSI 9300-8e and whether you’ve had any issues with the 24TB drives in your pool?

Thanks so much in advance for your help.

Hi, I’m running FW 16.00.12.00 mentioned here LSI 9300-xx Firmware Update | TrueNAS Community

Where did you hear this FW had issues with larger drives?

I’ve had a lot of issues with these drives to be honest but it seems to be a bad batch and not HBA related.

Hey everyone, I’ve made some progress on this. No clear solution yet but I’ve figured out how to reproduce the issue and have isolated the problem.

TL;DR: The smartctl -l scterc /dev/sdx and -l scttemp commands triggers the issue and TrueNAS 25.10 seems to be calling these (or a command that includes these parameters) every 90 minutes. This shouldn’t normally be an issue but something about my front backplane is not happy with this command when it is under load.

  • Since the last update, I contacted the seller of this chassis again and asked what HBA they used with it. Turns out they used a pair of LSI 9311-8i and they let me borrow them for some testing. Unfortunately the problem is the same with them. All signs keep pointing to this being a backplane and/or software issue so I decided to dig into that a bit more.
  • I started with moving my test pool to XigmaNAS and was pleasantly surprised when the issue disappeared entirely. I was able to write to my test pool for more than 24 hours continuously without a single timeout or error.
  • Then I tried TrueNAS 25.04 and was stoked to see the problem also vanish there (another 24 hours and 30TB of data written without any timeouts or errors).
  • Then I upgraded that 25.04 install to 25.10 and the problem returned immediately (during boot and every 90 minutes after). My conclusion at this point is that the OS is definitely the one triggering the problem, but the underlying problem might be backplane still.
  • Changing boot environment back to 25.04 cause the problem to go away again.
  • I decided to try OpenMediaVault and also had no issues there, initially.
  • While I was poking around in OMV, I came upon the SMART section where you can view the SMART status for the drives and I discovered that as soon as I tried to view the SMART status it would trigger the timeout. Every time for the drive I was querying assuming writes were taking place. If I paused my testing and let the drives go IDLE then I can query the SMART status no problem with no errors or timeouts.
  • I dug into this a bit and learned that the SMART status page is just calling the smartctl -x /dev/sdx and sure enough, if I call that same command from the terminal it triggers the problem almost every single time (maybe 1 in 10 times it completes without errors or timeouts).
  • If I call smartctl -a /dev/sdx it does not trigger the timeouts.
  • Looking into the difference between -a and -x and querying the individual components of the command I learned that it’s the -l scttemp and -l scterc parameters that are actually causing the issue.

IDLE drives (commands complete instantly and no timeouts in dmesg:

root@nas04-omv:~# smartctl -l scttemp /dev/sdb
smartctl 7.4 2024-10-15 r5620 [x86_64-linux-6.17.13-2-pve] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SCT Status Version:                  3
SCT Version (vendor specific):       256 (0x0100)
Device State:                        Active (0)
Current Temperature:                    46 Celsius
Power Cycle Min/Max Temperature:     35/47 Celsius
Lifetime    Min/Max Temperature:     17/50 Celsius
Under/Over Temperature Limit Count:   0/0
SMART Status:                        0xc24f (PASSED)
Minimum supported ERC Time Limit:    65 (6.5 seconds)

SCT Temperature History Version:     2
Temperature Sampling Period:         1 minute
Temperature Logging Interval:        1 minute
Min/Max recommended Temperature:      0/65 Celsius
Min/Max Temperature Limit:           -40/70 Celsius
Temperature History Size (Index):    128 (16)

Index    Estimated Time   Temperature Celsius
  17    2026-04-03 18:38    46  ***************************
...    ..(126 skipped).    ..  ***************************
  16    2026-04-03 20:45    46  ***************************

root@nas04-omv:~# smartctl -l scterc /dev/sdb
smartctl 7.4 2024-10-15 r5620 [x86_64-linux-6.17.13-2-pve] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org

SCT Error Recovery Control:
           Read: Disabled
          Write: Disabled

Loaded drives (commands stall for a while and trigger timeouts in dmesg:

root@nas04-omv:~# smartctl -l scttemp /dev/sdb
smartctl 7.4 2024-10-15 r5620 [x86_64-linux-6.17.13-2-pve] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SCT Status Version:                  3
SCT Version (vendor specific):       256 (0x0100)
Device State:                        Active (0)
Current Temperature:                    45 Celsius
Power Cycle Min/Max Temperature:     35/47 Celsius
Lifetime    Min/Max Temperature:     17/50 Celsius
Under/Over Temperature Limit Count:   0/0
SMART Status:                        0xc24f (PASSED)
Minimum supported ERC Time Limit:    65 (6.5 seconds)

Read SCT Data Table failed: Input/output error
Read SCT Temperature History failed

root@nas04-omv:~# smartctl -l scterc /dev/sdb
smartctl 7.4 2024-10-15 r5620 [x86_64-linux-6.17.13-2-pve] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org

Read SCT Status failed: Input/output error
SCT (Get) Error Recovery Control command failed
  • At this point I wanted to reproduce this on TrueNAS so I switched back to TN 25.10 and tried the same commands. As you would expect, it has the same effect there, triggering the issue. So in TN 25.10 I can manually trigger the issue with these commands and it also does it automatically every 90 minutes (except worse because I think it’s doing some sort of health check on all drives so they all timeout around the same time and cause chaos).
  • Then I switched to 25.04 and found that it also triggers the timeouts there when I use the command. It just doesn’t do it automatically every 90 minutes like 25.10 does. I understand there were lots of changes in 25.10 relating to SMART and all that so I guess maybe that makes sense.
  • I now tried to reproduce the issue on the rear backplane (which has been running this entire time with my production pool on 25.10 and has not had a single issue in weeks) and was unable to do so. Under the same test conditions (long sustained writes at maximum throughput (about 400MB/s) to both a 8x drive RAIDZ2 pool and also a single drive pool) I can call the problematic commands over and over and I could not get them to fail once. So the rear backplane is somehow immune to the issue.
  • I also tried with less of a load, writing over the network from a slow USB drive and I observed that it seems to cache the writes for a while (no writes to disk) and then writes in a short quick burst. If I query the command while it’s caching then there are no timeouts, but if I time it when it’s writing then it times out. So I think this is why under normal everyday conditions it was only happening occasionally - it would also need a load actively on the drive to occur.

Conclusions:

  • It is health checking that is triggering the problem, and 25.10 seems to be doing this automatically every 90 minutes.
  • For some reason I don’t understand, my front backplane has the issue and the rear one does not.

Options to solve:

  • Maybe there is a firmware update for the backplane that solves this issue? I have been unable to get in touch with their support yet though so I am not too optimistic about that.
  • Buy 2x more of these chassis for $200 ($100 each), extract the rear backplanes from them and retrofit them into the front backplane location.
  • Buy a Supermicro backplane (similar cost to previous option) and try to retrofit it into the case - might not fit right though and I’m not sure if drive sleds and LEDs will align properly.
  • Buy a full Supermicro replacement chassis (~$1k CAD to get one here to me in Canada though)
  • Stick with 25.04 for now and hope that I can figure out how to fix or disable the problem in 25.10 (or later)? I did see some data corruption at some point so I don’t want to settle on a solution with possibility of risks.

At this point I think I’m going to continue to try to get a firmware update, and if that fails then I’ll buy 2x more of the chassis and retrofit in the smaller backplanes.

It was a reddit post (I can’t post links but the title is: “Server Down! Help Needed: Hunting for LSI 9300-8i (SAS3008) Firmware v16.00.16.00 to fix ZFS bootloop”)

It’s a report related to the 9300-8i (for which there is allegedly a 16.00.16 OOB firmware fix), but I can’t find confirmation that the issue also exists for the 9300-8e, and I find any firmware updates beyond the 16.00.12 you referenced.

1 Like

Thanks for raising this. I’ll certainly take a look but atm I haven’t linked any issues with 16.00.12.00 with issues using 24TB SAS Seagate EXOS drives. However based on recent experience at scale I’d suggest you consider WD instead.

Awesome, and glad to hear it. I’m about to spin up a pool of 20TB WDs using the 16.00.12 FW on the 9300-8e, and I’ll report back if I run into any issues. Glad to hear, however, that your array hasn’t experienced this. If you wouldn’t mind letting me know if you do end up running into any issues, I’d greatly appreciate it. I’ll do the same.

1 Like