ZFS Pool Degrading with LSI 9305_16e - Need ideas

marshalleq · October 10, 2024, 7:56am

Updated After upgrading to RC2 AND putting what I think is a faulty disk into my pool, I immediately got multiple disks in hdd1pool degrading (a 6x16TB RaidZ2 pool) and nearly simultaneously the same on hdd2pool (also a 6x16TB Raidz2 pool, attached to the same LSI card).

I think the fault is either my controller, or there is a software bug with the sas3 driver. I’m assuming the former due to the large number of people on here probably using sas3 driver and seemingly without issues. But, I must be very unlucky because I purchased that card to replace a sas2 card that was exhibiting the exact same behaviour. And the assumption that the new card was good has lead me down a long journey of painful fault finding.

Something else that could be a possibility is some incompatibility with the Threadripper CPU I suppose. But I’m clutching at straws with that.

I AM running Electric Eel Beta for a while without major issue, skipped RC1 and updated to RC2 last few days. Given for the most part this issue was present prior to Electric Eel it’s probably not that, except it did seem to be ‘more’ in RC2 and it seemed the new drives don’t show up in RC2 at all unless they’re already in the pool. Downgrading to beta 1 seemed to let me rebuild my pools. Perhaps coincidence and perhaps I need to retry that again sometime.

Hardware Installed
Previous system that exhibited the fault: Lenovo P700 with dual xeon / 96GB, Dell Perc H310 card, external 8 bay SAS disk case fully populated, plus internally fully populated with 4 HDD’s and 10 SSDs.

New system exhibiting the fault: Threadripper 2950x on an Asus X399 Prime-A motherboard with 128G RAM. 9305 SAS card, 10 internal SSDs, 16 external HDD’s (12 in use)

Fault finding so far:

Replaced SAS cables three times to cover different brands
Purchased new disk cabinets
Changed power supplies
Purchased all new disks (brand new ironwolf drives purchased yesterday are failing the replace procedure and it seems to be specific to the slot they’re in and or randomly
Swapped whole computer as per above
Changed the controller card from a 9200_16e to the 9305_16e requiring all new cables
New case to get better cooling. The 9200 was getting hot, in this case the 9305 seems to be quite cool now and that’s the reason I got the 9305, newer design, lower wattage, and this model has a large heatsink
Updated the firmware on all the hard drives
Updated the firmware on the 9305 card (was 13 now 16 and both were IT version.
Done a memory test which passed completed yesterday.
Swapped SAS cables to the internal SSD drives yesterday as those also decided to spuriously fault causing the entire pool to become unavailable. (note this is not via the LSI controller). Generally though the system has performed very well regarding all disks connected to it’s internal SATA ports (some Intel Datacentre SSD’s basically) other than an annoying Enabling discard_zeros_data message / hard resetting link, running them at 1.5Gbps I can’t seem to get rid of (thought it was trim - perhaps there’s still an auto trim turned on somewhere), but nevertheless that doesn’t seem to be causing my pools to degrade and the SSD’s never had an issue until yesterday.
Added extra case fans to push airflow through yesterday

Closing Questions and Thoughts
Has anyone seen or had any experience with plugging in a faulty drive that perhaps is flooding error messages back to the card and causing other errors like this? Pure guesswork, but a scenario like that I could see happening.

Of course to provide any real help, I should provide some logs from Beta1 / RC2 to compare, but I do need to let the pools rebuild overnight before I’d be willing to give that a shot at present.

I do know I had hot overheating LSI issues in the past, but I don’t think I’m getting those with the new card and new case airflow plus now I’ve added an extra fan at the front pushing air over the cards.

This is painful and troublesome and making me wish I just had a QNAP. it would have been cheaper which is not what rolling my own is supposed to be like!

richardm · October 12, 2024, 8:08pm

LSI overheating, perhaps? The unfortunate aspect of these HBAs is an airflow requirement that’s somewhat above that of typical desktop components…

etorix · October 12, 2024, 8:17pm

That, or a faulty PSU in the drive shelf.

marshalleq · October 12, 2024, 8:35pm

Thanks for your replies. I have updated the first post to be clearer and cover off the journey to date a bit better.

I definitely did have heating problems with the 9200 card and it’s tiny heatsink, however the new card I purchased is a 9305, particularly chosen as it’s newer design is meant to use less watts and put out less heat. Plus this model actually has a decent heatsink on it. See picture below.

I’ve also put my finger on the heatsink a few times to check temps and it’s warm rather than hot unlike the 9200 card which I could barely touch and even has a burn mark on the other side of the card! It could still be that under load, however even just adding a single disk this morning (one of the new ones as of yesterday) it starts the replace procedure and then get’s pulled from the array. This is after being idle overnight, the card isn’t hot at this point.

Here’s a picture to show you the heat sink size:

Is there any possibility there’s a compatibility issue between the driver and the card?

Here’s two messages I got this morning (I get them all the time), one of these is a new drive:

I got this error on a brand new drive this morning. Unable to register SCSI device /dev/sdq at line 23 of file /etc/smartd.conf and I get quite a few of these on various drives. Smart is meant to work on these cards isn’t it?

Regarding the power supply of the drive shelf, one of the things I’ve done is completely get a new drive cabinet complete with new power supply. Which is what I’m on now. So again, I’m either very unlucky, or it’s not that.

Another perhaps related, perhaps unrelated, the SSDpool had a huge fault and went offline yesterday evening. This is not connected to the LSI controlle, rather the internal ports. I replaced all cables with new ones as I was not sure I liked the older thinner ones. There is one drive in it that is still going offline / online. Is there something other than controllers that could cause this? This is madness.

BTW the Threadripper was a known good box as I used it as my desktop for many years. I don’t think it suddenly became bad just because I put truenas on it.

Thanks.

marshalleq · October 12, 2024, 9:06pm

There is definitely a card related faulty port in that swapping cables doesn’t make the drive slot work, but swapping card ports the faulty slot moves to another slot. This does make me think again it’s controller related - or something between the controller and the OS. Does anyone know of a known good LSI card I should buy or have one they want to sell that is known good? I live in New Zealand and as such I have to rely on eBay which takes quite a while to deliver.

My thought is a known good card would rule out the card.

In the mean time anyone know which logs I should submit? Here’s a snippet from /var/log/messages

Thanks.

Oct 13 10:03:17 Skywalker kernel: mpt3sas_cm0: handle(0x1f) sas_address(0x300062b20a744802) port_type(0x1)

Oct 13 10:03:18 Skywalker kernel: scsi 15:0:60:0: Direct-Access ATA ST18000NT001-3LU EN01 PQ: 0 ANSI: 6

Oct 13 10:03:18 Skywalker kernel: scsi 15:0:60:0: SATA: handle(0x001f), sas_addr(0x300062b20a744802), phy(2), device_name(0x0000000000000000)

Oct 13 10:03:18 Skywalker kernel: scsi 15:0:60:0: enclosure logical id (0x500062b20a744800), slot(8)

Oct 13 10:03:18 Skywalker kernel: scsi 15:0:60:0: enclosure level(0x0000), connector name( )

Oct 13 10:03:18 Skywalker kernel: scsi 15:0:60:0: atapi(n), ncq(y), asyn_notify(n), smart(y), fua(y), sw_preserve(y)

Oct 13 10:03:18 Skywalker kernel: scsi 15:0:60:0: qdepth(32), tagged(1), scsi_level(7), cmd_que(1)

Oct 13 10:03:18 Skywalker kernel: sd 15:0:60:0: Attached scsi generic sg17 type 0

Oct 13 10:03:18 Skywalker kernel: sd 15:0:60:0: Power-on or device reset occurred

Oct 13 10:03:18 Skywalker kernel: end_device-15:60: add: handle(0x001f), sas_addr(0x300062b20a744802)

Oct 13 10:03:18 Skywalker kernel: sd 15:0:60:0: [sdt] 35156656128 512-byte logical blocks: (18.0 TB/16.4 TiB)

Oct 13 10:03:18 Skywalker kernel: sd 15:0:60:0: [sdt] 4096-byte physical blocks

Oct 13 10:03:18 Skywalker kernel: sd 15:0:60:0: [sdt] Write Protect is off

Oct 13 10:03:18 Skywalker kernel: sd 15:0:60:0: [sdt] Write cache: enabled, read cache: enabled, supports DPO and FUA

Oct 13 10:03:18 Skywalker kernel: sdt: sdt1

Oct 13 10:03:18 Skywalker kernel: sd 15:0:60:0: [sdt] Attached SCSI disk

Oct 13 10:03:21 Skywalker kernel: mpt3sas_cm0: log_info(0x31110d00): originator(PL), code(0x11), sub_code(0x0d00)

Oct 13 10:03:21 Skywalker kernel: mpt3sas_cm0: log_info(0x31110d00): originator(PL), code(0x11), sub_code(0x0d00)

Oct 13 10:03:21 Skywalker kernel: mpt3sas_cm0: log_info(0x31110d00): originator(PL), code(0x11), sub_code(0x0d00)

Oct 13 10:03:22 Skywalker kernel: sd 15:0:53:0: device_block, handle(0x001d)

Oct 13 10:03:25 Skywalker kernel: sd 15:0:53:0: device_unblock and setting to running, handle(0x001d)

Oct 13 10:03:25 Skywalker kernel: zio pool=hdd1pool vdev=/dev/disk/by-partuuid/0840a677-c089-4baa-af1c-3aa14652219b error=5 type=1 offset=2156594257920 size=1048576 flags=1074267304

Oct 13 10:03:25 Skywalker kernel: zio pool=hdd1pool vdev=/dev/disk/by-partuuid/0840a677-c089-4baa-af1c-3aa14652219b error=5 type=1 offset=2156593209344 size=1048576 flags=1074267304

Oct 13 10:03:25 Skywalker kernel: zio pool=hdd1pool vdev=/dev/disk/by-partuuid/0840a677-c089-4baa-af1c-3aa14652219b error=5 type=1 offset=2156596355072 size=1048576 flags=1074267304

Oct 13 10:03:25 Skywalker kernel: zio pool=hdd1pool vdev=/dev/disk/by-partuuid/0840a677-c089-4baa-af1c-3aa14652219b error=5 type=1 offset=2156595306496 size=1048576 flags=1074267304

Oct 13 10:03:25 Skywalker kernel: zio pool=hdd1pool vdev=/dev/disk/by-partuuid/0840a677-c089-4baa-af1c-3aa14652219b error=5 type=1 offset=2156597403648 size=1048576 flags=1074267304

Oct 13 10:03:25 Skywalker kernel: zio pool=hdd1pool vdev=/dev/disk/by-partuuid/0840a677-c089-4baa-af1c-3aa14652219b error=5 type=5 offset=0 size=0 flags=1049728

Oct 13 10:03:25 Skywalker kernel: sd 15:0:53:0: [sdr] Synchronizing SCSI cache

Oct 13 10:03:25 Skywalker kernel: sd 15:0:53:0: [sdr] Synchronize Cache(10) failed: Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK

Oct 13 10:03:25 Skywalker kernel: mpt3sas_cm0: mpt3sas_transport_port_remove: removed: sas_addr(0x300062b20a744800)

Oct 13 10:03:25 Skywalker kernel: mpt3sas_cm0: removing handle(0x001d), sas_addr(0x300062b20a744800)

Oct 13 10:03:25 Skywalker kernel: mpt3sas_cm0: enclosure logical id(0x500062b20a744800), slot(11)

Oct 13 10:03:25 Skywalker kernel: mpt3sas_cm0: enclosure level(0x0000), connector name( )

Oct 13 10:03:25 Skywalker kernel: zio pool=hdd1pool vdev=/dev/disk/by-partuuid/0840a677-c089-4baa-af1c-3aa14652219b error=5 type=5 offset=0 size=0 flags=1049728

Oct 13 10:03:25 Skywalker kernel: zio pool=hdd1pool vdev=/dev/disk/by-partuuid/0840a677-c089-4baa-af1c-3aa14652219b error=5 type=5 offset=0 size=0 flags=1049728

Oct 13 10:03:25 Skywalker kernel: zio pool=hdd1pool vdev=/dev/disk/by-partuuid/0840a677-c089-4baa-af1c-3aa14652219b error=5 type=5 offset=0 size=0 flags=1049728

Oct 13 10:03:28 Skywalker kernel: zio pool=hdd1pool vdev=/dev/disk/by-partuuid/0840a677-c089-4baa-af1c-3aa14652219b error=5 type=5 offset=0 size=0 flags=1049728

Oct 13 10:03:28 Skywalker kernel: zio pool=hdd1pool vdev=/dev/disk/by-partuuid/0840a677-c089-4baa-af1c-3aa14652219b error=5 type=5 offset=0 size=0 flags=1049728

Oct 13 10:03:31 Skywalker kernel: zio pool=hdd1pool vdev=/dev/disk/by-partuuid/0840a677-c089-4baa-af1c-3aa14652219b error=5 type=5 offset=0 size=0 flags=1049728

Oct 13 10:03:33 Skywalker kernel: zio pool=hdd1pool vdev=/dev/disk/by-partuuid/0840a677-c089-4baa-af1c-3aa14652219b error=5 type=5 offset=0 size=0 flags=1049728

Oct 13 10:03:34 Skywalker kernel: mpt3sas_cm0: log_info(0x31110d00): originator(PL), code(0x11), sub_code(0x0d00)

Oct 13 10:03:34 Skywalker kernel: mpt3sas_cm0: log_info(0x31110d00): originator(PL), code(0x11), sub_code(0x0d00)

Oct 13 10:03:34 Skywalker kernel: mpt3sas_cm0: log_info(0x31110d00): originator(PL), code(0x11), sub_code(0x0d00)

Oct 13 10:03:34 Skywalker kernel: sd 15:0:60:0: device_block, handle(0x001f)

Oct 13 10:03:37 Skywalker kernel: sd 15:0:60:0: device_unblock and setting to running, handle(0x001f)

Oct 13 10:03:37 Skywalker kernel: zio pool=hdd1pool vdev=/dev/disk/by-partuuid/074df82f-bf92-4bd7-8bea-f8981dd3c9c4 error=5 type=1 offset=2159787999232 size=65536 flags=1074267304

Oct 13 10:03:37 Skywalker kernel: zio pool=hdd1pool vdev=/dev/disk/by-partuuid/074df82f-bf92-4bd7-8bea-f8981dd3c9c4 error=5 type=1 offset=2159788064768 size=32768 flags=1573032

Oct 13 10:03:37 Skywalker kernel: zio pool=hdd1pool vdev=/dev/disk/by-partuuid/074df82f-bf92-4bd7-8bea-f8981dd3c9c4 error=5 type=1 offset=2159788097536 size=1048576 flags=1074267304

Oct 13 10:03:37 Skywalker kernel: zio pool=hdd1pool vdev=/dev/disk/by-partuuid/074df82f-bf92-4bd7-8bea-f8981dd3c9c4 error=5 type=1 offset=2159787933696 size=65536 flags=1074267304

Oct 13 10:03:37 Skywalker kernel: zio pool=hdd1pool vdev=/dev/disk/by-partuuid/074df82f-bf92-4bd7-8bea-f8981dd3c9c4 error=5 type=1 offset=2159789146112 size=1048576 flags=1074267304

Oct 13 10:03:37 Skywalker kernel: zio pool=hdd1pool vdev=/dev/disk/by-partuuid/074df82f-bf92-4bd7-8bea-f8981dd3c9c4 error=5 type=5 offset=0 size=0 flags=1049728

Oct 13 10:03:37 Skywalker kernel: zio pool=hdd1pool vdev=/dev/disk/by-partuuid/074df82f-bf92-4bd7-8bea-f8981dd3c9c4 error=5 type=5 offset=0 size=0 flags=1049728

Oct 13 10:03:37 Skywalker kernel: sd 15:0:60:0: [sdt] Synchronizing SCSI cache

Oct 13 10:03:37 Skywalker kernel: sd 15:0:60:0: [sdt] Synchronize Cache(10) failed: Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK

Oct 13 10:03:37 Skywalker kernel: mpt3sas_cm0: mpt3sas_transport_port_remove: removed: sas_addr(0x300062b20a744802)

Oct 13 10:03:37 Skywalker kernel: mpt3sas_cm0: removing handle(0x001f), sas_addr(0x300062b20a744802)

Oct 13 10:03:37 Skywalker kernel: mpt3sas_cm0: enclosure logical id(0x500062b20a744800), slot(8)

Oct 13 10:03:37 Skywalker kernel: mpt3sas_cm0: enclosure level(0x0000), connector name( )

Oct 13 10:03:43 Skywalker netdata[1976565]: CONFIG: cannot load cloud config '/var/lib/netdata/cloud.d/cloud.conf'. Running with internal defaults.

Oct 13 10:03:45 Skywalker kernel: mpt3sas_cm0: handle(0x1d) sas_address(0x300062b20a744800) port_type(0x1)

Oct 13 10:03:46 Skywalker kernel: scsi 15:0:61:0: Direct-Access ATA ST18000NT001-3LU EN01 PQ: 0 ANSI: 6

Oct 13 10:03:46 Skywalker kernel: scsi 15:0:61:0: SATA: handle(0x001d), sas_addr(0x300062b20a744800), phy(0), device_name(0x0000000000000000)

Oct 13 10:03:46 Skywalker kernel: scsi 15:0:61:0: enclosure logical id (0x500062b20a744800), slot(11)

Oct 13 10:03:46 Skywalker kernel: scsi 15:0:61:0: enclosure level(0x0000), connector name( )

Oct 13 10:03:46 Skywalker kernel: scsi 15:0:61:0: atapi(n), ncq(y), asyn_notify(n), smart(y), fua(y), sw_preserve(y)

Oct 13 10:03:46 Skywalker kernel: scsi 15:0:61:0: qdepth(32), tagged(1), scsi_level(7), cmd_que(1)

Oct 13 10:03:46 Skywalker kernel: sd 15:0:61:0: Attached scsi generic sg15 type 0

Oct 13 10:03:46 Skywalker kernel: sd 15:0:61:0: Power-on or device reset occurred

Oct 13 10:03:46 Skywalker kernel: end_device-15:61: add: handle(0x001d), sas_addr(0x300062b20a744800)

Oct 13 10:03:46 Skywalker kernel: sd 15:0:61:0: [sdr] 35156656128 512-byte logical blocks: (18.0 TB/16.4 TiB)

Oct 13 10:03:46 Skywalker kernel: sd 15:0:61:0: [sdr] 4096-byte physical blocks

Oct 13 10:03:46 Skywalker kernel: sd 15:0:61:0: [sdr] Write Protect is off

Oct 13 10:03:46 Skywalker kernel: sd 15:0:61:0: [sdr] Write cache: enabled, read cache: enabled, supports DPO and FUA

Oct 13 10:03:46 Skywalker kernel: sdr: sdr1

Oct 13 10:03:46 Skywalker kernel: sd 15:0:61:0: [sdr] Attached SCSI disk

Oct 13 10:03:57 Skywalker kernel: mpt3sas_cm0: handle(0x1f) sas_address(0x300062b20a744802) port_type(0x1)

Oct 13 10:03:58 Skywalker kernel: scsi 15:0:62:0: Direct-Access ATA ST18000NT001-3LU EN01 PQ: 0 ANSI: 6

Oct 13 10:03:58 Skywalker kernel: scsi 15:0:62:0: SATA: handle(0x001f), sas_addr(0x300062b20a744802), phy(2), device_name(0x0000000000000000)

Oct 13 10:03:58 Skywalker kernel: scsi 15:0:62:0: enclosure logical id (0x500062b20a744800), slot(8)

Oct 13 10:03:58 Skywalker kernel: scsi 15:0:62:0: enclosure level(0x0000), connector name( )

Oct 13 10:03:58 Skywalker kernel: scsi 15:0:62:0: atapi(n), ncq(y), asyn_notify(n), smart(y), fua(y), sw_preserve(y)

Oct 13 10:03:58 Skywalker kernel: scsi 15:0:62:0: qdepth(32), tagged(1), scsi_level(7), cmd_que(1)

Oct 13 10:03:58 Skywalker kernel: sd 15:0:62:0: Attached scsi generic sg17 type 0

Oct 13 10:03:58 Skywalker kernel: sd 15:0:62:0: Power-on or device reset occurred

Oct 13 10:03:58 Skywalker kernel: end_device-15:62: add: handle(0x001f), sas_addr(0x300062b20a744802)

Oct 13 10:03:58 Skywalker kernel: sd 15:0:62:0: [sdt] 35156656128 512-byte logical blocks: (18.0 TB/16.4 TiB)

Oct 13 10:03:58 Skywalker kernel: sd 15:0:62:0: [sdt] 4096-byte physical blocks

Oct 13 10:03:58 Skywalker kernel: sd 15:0:62:0: [sdt] Write Protect is off

Oct 13 10:03:58 Skywalker kernel: sd 15:0:62:0: [sdt] Write cache: enabled, read cache: enabled, supports DPO and FUA

Oct 13 10:03:58 Skywalker kernel: sdt: sdt1

Oct 13 10:03:58 Skywalker kernel: sd 15:0:62:0: [sdt] Attached SCSI disk

Oct 13 10:04:08 Skywalker kernel: mpt3sas_cm0: log_info(0x31110d00): originator(PL), code(0x11), sub_code(0x0d00)

Oct 13 10:04:08 Skywalker kernel: mpt3sas_cm0: log_info(0x31110d00): originator(PL), code(0x11), sub_code(0x0d00)

Oct 13 10:04:08 Skywalker kernel: mpt3sas_cm0: log_info(0x31110d00): originator(PL), code(0x11), sub_code(0x0d00)

Oct 13 10:04:09 Skywalker kernel: sd 15:0:62:0: device_block, handle(0x001f)

Oct 13 10:04:12 Skywalker kernel: sd 15:0:62:0: device_unblock and setting to running, handle(0x001f)

Oct 13 10:04:12 Skywalker kernel: zio pool=hdd1pool vdev=/dev/disk/by-partuuid/074df82f-bf92-4bd7-8bea-f8981dd3c9c4 error=5 type=1 offset=2174122418176 size=32768 flags=1573032

Oct 13 10:04:12 Skywalker kernel: zio pool=hdd1pool vdev=/dev/disk/by-partuuid/074df82f-bf92-4bd7-8bea-f8981dd3c9c4 error=5 type=1 offset=2174122385408 size=32768 flags=1573032

Oct 13 10:04:12 Skywalker kernel: zio pool=hdd1pool vdev=/dev/disk/by-partuuid/074df82f-bf92-4bd7-8bea-f8981dd3c9c4 error=5 type=1 offset=2174122450944 size=1048576 flags=1074267304

Oct 13 10:04:12 Skywalker kernel: zio pool=hdd1pool vdev=/dev/disk/by-partuuid/074df82f-bf92-4bd7-8bea-f8981dd3c9c4 error=5 type=1 offset=2174122352640 size=32768 flags=1573032

Oct 13 10:04:12 Skywalker kernel: zio pool=hdd1pool vdev=/dev/disk/by-partuuid/074df82f-bf92-4bd7-8bea-f8981dd3c9c4 error=5 type=1 offset=2174123499520 size=1048576 flags=1074267304

Oct 13 10:04:12 Skywalker kernel: sd 15:0:62:0: [sdt] Synchronizing SCSI cache

Oct 13 10:04:12 Skywalker kernel: sd 15:0:62:0: [sdt] Synchronize Cache(10) failed: Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK

Oct 13 10:04:12 Skywalker kernel: mpt3sas_cm0: mpt3sas_transport_port_remove: removed: sas_addr(0x300062b20a744802)

Oct 13 10:04:12 Skywalker kernel: mpt3sas_cm0: removing handle(0x001f), sas_addr(0x300062b20a744802)

Oct 13 10:04:12 Skywalker kernel: mpt3sas_cm0: enclosure logical id(0x500062b20a744800), slot(8)

Oct 13 10:04:12 Skywalker kernel: mpt3sas_cm0: enclosure level(0x0000), connector name( )

Oct 13 10:04:12 Skywalker kernel: zio pool=hdd1pool vdev=/dev/disk/by-partuuid/074df82f-bf92-4bd7-8bea-f8981dd3c9c4 error=5 type=5 offset=0 size=0 flags=1049728

Oct 13 10:04:12 Skywalker kernel: zio pool=hdd1pool vdev=/dev/disk/by-partuuid/074df82f-bf92-4bd7-8bea-f8981dd3c9c4 error=5 type=5 offset=0 size=0 flags=1049728

Oct 13 10:04:15 Skywalker kernel: zio pool=hdd1pool vdev=/dev/disk/by-partuuid/074df82f-bf92-4bd7-8bea-f8981dd3c9c4 error=5 type=5 offset=0 size=0 flags=1049728

Oct 13 10:04:15 Skywalker kernel: zio pool=hdd1pool vdev=/dev/disk/by-partuuid/074df82f-bf92-4bd7-8bea-f8981dd3c9c4 error=5 type=5 offset=0 size=0 flags=1049728

Oct 13 10:04:21 Skywalker netdata[1979229]: CONFIG: cannot load cloud config '/var/lib/netdata/cloud.d/cloud.conf'. Running with internal defaults.

Oct 13 10:04:32 Skywalker kernel: mpt3sas_cm0: handle(0x1f) sas_address(0x300062b20a744802) port_type(0x1)

Oct 13 10:04:33 Skywalker kernel: scsi 15:0:63:0: Direct-Access ATA ST18000NT001-3LU EN01 PQ: 0 ANSI: 6

Oct 13 10:04:33 Skywalker kernel: scsi 15:0:63:0: SATA: handle(0x001f), sas_addr(0x300062b20a744802), phy(2), device_name(0x0000000000000000)

Oct 13 10:04:33 Skywalker kernel: scsi 15:0:63:0: enclosure logical id (0x500062b20a744800), slot(8)

Oct 13 10:04:33 Skywalker kernel: scsi 15:0:63:0: enclosure level(0x0000), connector name( )

Oct 13 10:04:33 Skywalker kernel: scsi 15:0:63:0: atapi(n), ncq(y), asyn_notify(n), smart(y), fua(y), sw_preserve(y)

Oct 13 10:04:33 Skywalker kernel: scsi 15:0:63:0: qdepth(32), tagged(1), scsi_level(7), cmd_que(1)

Oct 13 10:04:33 Skywalker kernel: sd 15:0:63:0: Attached scsi generic sg17 type 0

Oct 13 10:04:33 Skywalker kernel: end_device-15:63: add: handle(0x001f), sas_addr(0x300062b20a744802)

Oct 13 10:04:33 Skywalker kernel: sd 15:0:63:0: Power-on or device reset occurred

Oct 13 10:04:33 Skywalker kernel: sd 15:0:63:0: [sdt] 35156656128 512-byte logical blocks: (18.0 TB/16.4 TiB)

Oct 13 10:04:33 Skywalker kernel: sd 15:0:63:0: [sdt] 4096-byte physical blocks

Oct 13 10:04:33 Skywalker kernel: sd 15:0:63:0: [sdt] Write Protect is off

Oct 13 10:04:33 Skywalker kernel: sd 15:0:63:0: [sdt] Write cache: enabled, read cache: enabled, supports DPO and FUA

Oct 13 10:04:33 Skywalker kernel: sdt: sdt1

Oct 13 10:04:33 Skywalker kernel: sd 15:0:63:0: [sdt] Attached SCSI disk

Oct 13 10:04:33 Skywalker kernel: mpt3sas_cm0: log_info(0x31110d00): originator(PL), code(0x11), sub_code(0x0d00)

Oct 13 10:04:34 Skywalker kernel: sd 15:0:63:0: device_block, handle(0x001f)

Oct 13 10:04:37 Skywalker kernel: sd 15:0:63:0: device_unblock and setting to running, handle(0x001f)

Oct 13 10:04:37 Skywalker kernel: zio pool=hdd1pool vdev=/dev/disk/by-partuuid/074df82f-bf92-4bd7-8bea-f8981dd3c9c4 error=5 type=1 offset=2183223898112 size=1048576 flags=1074267304

Oct 13 10:04:37 Skywalker kernel: zio pool=hdd1pool vdev=/dev/disk/by-partuuid/074df82f-bf92-4bd7-8bea-f8981dd3c9c4 error=5 type=1 offset=2183224946688 size=1048576 flags=1074267304

Oct 13 10:04:37 Skywalker kernel: zio pool=hdd1pool vdev=/dev/disk/by-partuuid/074df82f-bf92-4bd7-8bea-f8981dd3c9c4 error=5 type=5 offset=0 size=0 flags=1049728

Oct 13 10:04:37 Skywalker kernel: zio pool=hdd1pool vdev=/dev/disk/by-partuuid/074df82f-bf92-4bd7-8bea-f8981dd3c9c4 error=5 type=5 offset=0 size=0 flags=1049728

Oct 13 10:04:37 Skywalker kernel: sd 15:0:63:0: [sdt] Synchronizing SCSI cache

Oct 13 10:04:37 Skywalker kernel: sd 15:0:63:0: [sdt] Synchronize Cache(10) failed: Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK

Oct 13 10:04:37 Skywalker kernel: mpt3sas_cm0: mpt3sas_transport_port_remove: removed: sas_addr(0x300062b20a744802)

Oct 13 10:04:37 Skywalker kernel: mpt3sas_cm0: removing handle(0x001f), sas_addr(0x300062b20a744802)

Oct 13 10:04:37 Skywalker kernel: mpt3sas_cm0: enclosure logical id(0x500062b20a744800), slot(8)

Oct 13 10:04:37 Skywalker kernel: mpt3sas_cm0: enclosure level(0x0000), connector name( )

Oct 13 10:04:39 Skywalker netdata[1980864]: CONFIG: cannot load cloud config '/var/lib/netdata/cloud.d/cloud.conf'. Running with internal defaults.

Oct 13 10:04:40 Skywalker kernel: zio pool=hdd1pool vdev=/dev/disk/by-partuuid/074df82f-bf92-4bd7-8bea-f8981dd3c9c4 error=5 type=5 offset=0 size=0 flags=1049728

Oct 13 10:04:40 Skywalker kernel: zio pool=hdd1pool vdev=/dev/disk/by-partuuid/074df82f-bf92-4bd7-8bea-f8981dd3c9c4 error=5 type=5 offset=0 size=0 flags=1049728

Oct 13 10:04:40 Skywalker kernel: zio pool=hdd1pool vdev=/dev/disk/by-partuuid/074df82f-bf92-4bd7-8bea-f8981dd3c9c4 error=5 type=5 offset=0 size=0 flags=1049728

Oct 13 10:04:40 Skywalker kernel: zio pool=hdd1pool vdev=/dev/disk/by-partuuid/074df82f-bf92-4bd7-8bea-f8981dd3c9c4 error=5 type=5 offset=0 size=0 flags=1049728

Oct 13 10:04:43 Skywalker kernel: zio pool=hdd1pool vdev=/dev/disk/by-partuuid/074df82f-bf92-4bd7-8bea-f8981dd3c9c4 error=5 type=5 offset=0 size=0 flags=1049728

Oct 13 10:04:43 Skywalker kernel: zio pool=hdd1pool vdev=/dev/disk/by-partuuid/074df82f-bf92-4bd7-8bea-f8981dd3c9c4 error=5 type=5 offset=0 size=0 flags=1049728

Oct 13 10:04:46 Skywalker kernel: zio pool=hdd1pool vdev=/dev/disk/by-partuuid/074df82f-bf92-4bd7-8bea-f8981dd3c9c4 error=5 type=5 offset=0 size=0 flags=1049728

Oct 13 10:04:57 Skywalker kernel: mpt3sas_cm0: handle(0x1f) sas_address(0x300062b20a744802) port_type(0x1)

Oct 13 10:04:57 Skywalker kernel: scsi 15:0:64:0: Direct-Access ATA ST18000NT001-3LU EN01 PQ: 0 ANSI: 6

Oct 13 10:04:57 Skywalker kernel: scsi 15:0:64:0: SATA: handle(0x001f), sas_addr(0x300062b20a744802), phy(2), device_name(0x0000000000000000)

Oct 13 10:04:57 Skywalker kernel: scsi 15:0:64:0: enclosure logical id (0x500062b20a744800), slot(8)

Oct 13 10:04:57 Skywalker kernel: scsi 15:0:64:0: enclosure level(0x0000), connector name( )

Oct 13 10:04:57 Skywalker kernel: scsi 15:0:64:0: atapi(n), ncq(y), asyn_notify(n), smart(y), fua(y), sw_preserve(y)

Oct 13 10:04:57 Skywalker kernel: scsi 15:0:64:0: qdepth(32), tagged(1), scsi_level(7), cmd_que(1)

Oct 13 10:04:57 Skywalker kernel: sd 15:0:64:0: Attached scsi generic sg17 type 0

Oct 13 10:04:57 Skywalker kernel: sd 15:0:64:0: Power-on or device reset occurred

Oct 13 10:04:57 Skywalker kernel: end_device-15:64: add: handle(0x001f), sas_addr(0x300062b20a744802)

Oct 13 10:04:57 Skywalker kernel: sd 15:0:64:0: [sdt] 35156656128 512-byte logical blocks: (18.0 TB/16.4 TiB)

Oct 13 10:04:57 Skywalker kernel: sd 15:0:64:0: [sdt] 4096-byte physical blocks

Oct 13 10:04:57 Skywalker kernel: sd 15:0:64:0: [sdt] Write Protect is off

Oct 13 10:04:57 Skywalker kernel: sd 15:0:64:0: [sdt] Write cache: enabled, read cache: enabled, supports DPO and FUA

Oct 13 10:04:57 Skywalker kernel: sdt: sdt1

Oct 13 10:04:57 Skywalker kernel: sd 15:0:64:0: [sdt] Attached SCSI disk

Oct 13 10:04:57 Skywalker netdata[1982379]: CONFIG: cannot load cloud config '/var/lib/netdata/cloud.d/cloud.conf'. Running with internal defaults.

As you can see, things are not happy.
log.txt (19.0 KB)
dmesg.txt (386.3 KB)

marshalleq · October 13, 2024, 5:33am

So today I purchased a cheap and dinky ASMedia based SATA card (6P6G-PCIE-SATA-CARD) that people online have said works with TrueNAS. There are a lot of people saying not to use ASmedia cards, but from what I can see those people don’t actually have that card - and given the recommended pathway isn’t working…

Here is a picture for a laugh, doesn’t look like much does it.

Anyway, since adding in that card I am now able to actually replace the older drives with the newer ones. Something I couldn’t even do for these last few drives on the LSI card.

Early days though, will see how it goes overnight.

Also, I caught that my SSDPool (6x480G Intel DC Drives) had one of my new 18TB drives assigned to it. Not sure how and when that happened, either as an accidental fix for the pool failure yesterday, or possibly the system assigned it when swapping between beta and RC and that killed the pool not sure. I did notice downgrading to the beta brought back pools that had been deleted, so it’s possible something got wonky. Not a very nice side effect that.

Anyway, the embarrassing moment of the day might just be that that wonky little card beats the LSI 9305. At least in my system. The rebuild speed is only 600M/s whereas the LSI was usually a little over 1G or put another way it’s going to take 9 hours instead of the 5 that the LSI took. A bit slower, but slower is better than killing all my pools. At least I can stabilise it while I figure out what to do.

marshalleq · October 14, 2024, 6:57am

So that controller has worked a treat. There was 8 checksum errors in total across the whole pools resilver and replacement of various drives and potentially there’s a solution for that to add pci=nommconf to the kernel startup parameters. Which got me thinking is there a possibility that there is something like PCI power management playing up with the LSI card. Unknown.

Now I’ve taken that pool offline again and swapped my second pool onto this card, which is now also rebuilding nicely. As a bit of a test I first tried to repair the pool on the LSI card, but was again met with dropping disks etc. Swap to this card and it’s all stable again.

So I’ll let this sort itself out (79 data errors apparently), but that’s not too bad considering what has been happening. Any other filesystem and I would have been dead in the water on the first day.

I think next troubleshooting step will be to get to an older 9200 series card in the same external bays to isolate the LSI card.

From what I can see, it can be only 1 of 4 things:

LSI Card
The external cabinet
The driver
Some BIOS / card setting / incompatibility

Writing this all here for the next person that comes along with a similar issue obviously.

Regaring pci=nommconf, that is to remove the below errors which seem to have showed up since adding this cheap card in.

Oct 13 19:09:50 Skywalker kernel: pcieport 0000:00:01.1: AER: Corrected error message received from 0000:00:00.0
Oct 13 19:09:50 Skywalker kernel: pcieport 0000:00:01.1: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Receiver ID)
Oct 13 19:09:50 Skywalker kernel: pcieport 0000:00:01.1: device [1022:1453] error status/mask=00000040/00006000

Stux · October 16, 2024, 9:09pm

Early Gen ryzens had stability issues unless steps were taken to avoid low power modes.

Might want to look into that.

It’s concerning that you saw checksum errors on disks that weren’t even connected to the hba.

Sounds like PSU issues, even though I know you said you replaced/swapped.

LSI cards are prone to counterfeiting, and these sort of issues can be the result.

And finally, unless you have a fan directly blowing on the heat sink I would assume it is overheating.

marshalleq · October 21, 2024, 12:30am

This is now solved I think.

Yes, the early Ryzens did. In face the Threadripper 1950x had that too, but the 2950x doesn’t seem to have that problem and the problem seems to have gone away (fixed I assume) for the 1950x as well.

I found out what caused the checksum errors of the different HBA, I had (or truenas had) assigned a spinning 16TB disk to a 6 disk ssd striped mirror. It didn’t seem to like having one spinning disk formatted to 480G assigned to it, go figure ;). Anyway I replaced that with the original SSD that it was meant to ha ve and problem went away. What was far more concerning to me is how this happened. I noted when I downgraded from RC2 to beta1 that the previous pool configuration came back (i.e. whole pools I deleted were now back). Perhaps it was due to that. Luckily not too much had changed, but perhaps this was part of it. Or perhaps I did it. Either way, problem solved.

I also isolated the external bays by plugging into two additional controllers. A known good LSI card that I’ve used for years and the Startech card above. The known good LSI card didn’t cut the mustard which is weird cause it always worked before - there does seem to be something different about the extent to which the issue is occurring since electric eel. More likely it’s the new motherboard and case, but the timing of both of those changes I don’t recall enough to make any more than a subjective judgement. The Startech card has been amazing though. Passed all tests no issues.

The Startech card is a real champ. I was mistaken earlier thinking it had errors, those were zfs checksum errors caused by the LSI card, after clearing them I have resilvered 14 disks, done multiple scrubs (these are large 16TBx6 RaidZ2 Pools which take between 5-14 hours to scrub depending on which one I am connected to). There hasn’t actually been a single error on these pools while using this card. It also supports hot swap. I am surprisingly finding myself deciding to just keep that card for the internal drives as it’s not hot, doesn’t require extra cooling and the performance of the resilver and scrub (just noting the speed in zpool status) has been slightly slower and slightly faster than the LSI. If someone is looking for a cheaper option that doesn’t require fancy cables, stuffing around adding fans and so on, that seems to be a great option.

I don’t think I have a counterfeit card, I didn’t purchase it from China, but from eBay from someone who seems to sell a lot of them without negative reviews. But could be wrong. If the issues come back again I’ll look at it again.

And yes, what seems to have fixed it for me, is cable tying a fan directly to the LSI card. So you were spot on with that. So far so good. Managed to complete a scrub with 0 errors (sort of become a defacto test for me as previously a scrub would trigger the issues).

So thanks for everyones help, at present I think I’m happy!

marshalleq · October 26, 2024, 7:42am

Well, it’s happening again. But only on the LSI card. Again.

And it’s very annoying because one of the pools won’t mount properly and that’s a whole problem in itself that’s painful at boot, the TrueNAS alerting is now giving me false information and after this it won’t shut down nicely either, even though it appears to be running and resilvering. I have half a mind to just go and buy another startech card and be done with this LSI nightmare. That is still rock solid.

How does one go from a perfectly functioning pool, to 20317 data errors in an instant.

  pool: hdd2pool
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
	continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Sat Oct 26 19:40:00 2024
	9.03T / 64.8T scanned at 2.24G/s, 30.5G / 64.8T issued at 7.57M/s
	5.48G resilvered, 0.05% done, no estimated completion time
config:

	NAME                                        STATE     READ WRITE CKSUM
	hdd2pool                                    DEGRADED     0     0     0
	  raidz2-0                                  DEGRADED 1.33K     2     0
	    a0e78a38-08cd-4eca-a1d2-accc481be172    ONLINE       0     0 38.1K  (resilvering)
	    c7114aa0-b137-4524-8ed3-d58861fd69dc    ONLINE     550   474 1.17K  (resilvering)
	    replacing-2                             DEGRADED     0     0 39.0K
	      4c839316-35a7-4c92-9649-9057ecdb0fdc  OFFLINE      0     0     0
	      5d221999-98e4-4ffa-b32b-ca7d4ee9301a  ONLINE       0     0     0  (resilvering)
	    6c3edee1-6714-4bd4-a5b0-e86fe15cd890    ONLINE       0     0 38.1K  (resilvering)
	    b8199126-52c2-46cf-997a-0d63c6a58a0a    ONLINE   4.47K   209 36.9K  (resilvering)
	    3b689aa5-2575-473a-8bb7-a25b545bce2e    ONLINE     554   641     0  (resilvering)

errors: 20317 data errors, use '-v' for a list

Some little extra bit, is this Call Trace pops up I think at the point that the pool first get’s suspended. I assume that that’s something that’s just complaining because the pool got ripped out, but thought I’d post it here in case.

[ 3746.657463] INFO: task txg_sync:5307 blocked for more than 120 seconds.
[ 3746.657747]       Tainted: P           OE      6.6.44-production+truenas #1
[ 3746.658033] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 3746.658341] task:txg_sync        state:D stack:0     pid:5307  ppid:2      flags:0x00004000
[ 3746.658349] Call Trace:
[ 3746.658353]  <TASK>
[ 3746.658359]  __schedule+0x349/0x950
[ 3746.658372]  schedule+0x5b/0xa0
[ 3746.658378]  schedule_timeout+0x98/0x160
[ 3746.658385]  ? __pfx_process_timeout+0x10/0x10
[ 3746.658393]  io_schedule_timeout+0x50/0x80
[ 3746.658400]  __cv_timedwait_common+0x12a/0x160 [spl]
[ 3746.658424]  ? __pfx_autoremove_wake_function+0x10/0x10
[ 3746.658434]  __cv_timedwait_io+0x19/0x20 [spl]
[ 3746.658457]  zio_wait+0x124/0x240 [zfs]
[ 3746.658795]  dsl_pool_sync_mos+0x37/0xa0 [zfs]
[ 3746.659149]  dsl_pool_sync+0x3b9/0x410 [zfs]
[ 3746.659518]  spa_sync_iterate_to_convergence+0xd8/0x200 [zfs]
[ 3746.659874]  spa_sync+0x30a/0x600 [zfs]
[ 3746.660227]  txg_sync_thread+0x1ec/0x270 [zfs]
[ 3746.660629]  ? __pfx_txg_sync_thread+0x10/0x10 [zfs]
[ 3746.661016]  ? __pfx_thread_generic_wrapper+0x10/0x10 [spl]
[ 3746.661043]  thread_generic_wrapper+0x5e/0x70 [spl]
[ 3746.661070]  kthread+0xe8/0x120
[ 3746.661077]  ? __pfx_kthread+0x10/0x10
[ 3746.661082]  ret_from_fork+0x34/0x50
[ 3746.661089]  ? __pfx_kthread+0x10/0x10
[ 3746.661095]  ret_from_fork_asm+0x1b/0x30
[ 3746.661109]  </TASK>
[ 3746.661147] INFO: task agents:9338 blocked for more than 120 seconds.
[ 3746.661497]       Tainted: P           OE      6.6.44-production+truenas #1
[ 3746.661813] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 3746.662138] task:agents          state:D stack:0     pid:9338  ppid:1      flags:0x00004002
[ 3746.662146] Call Trace:
[ 3746.662149]  <TASK>
[ 3746.662154]  __schedule+0x349/0x950
[ 3746.662165]  schedule+0x5b/0xa0
[ 3746.662169]  io_schedule+0x46/0x70
[ 3746.662174]  cv_wait_common+0xaa/0x130 [spl]
[ 3746.662199]  ? __pfx_autoremove_wake_function+0x10/0x10
[ 3746.662208]  txg_wait_synced_impl+0xc0/0x110 [zfs]
[ 3746.662563]  txg_wait_synced+0x10/0x40 [zfs]
[ 3746.662903]  spa_vdev_state_exit+0x95/0x150 [zfs]
[ 3746.663263]  zfs_ioc_vdev_set_state+0xea/0x1c0 [zfs]
[ 3746.663592]  zfsdev_ioctl_common+0x680/0x790 [zfs]
[ 3746.663921]  ? __kmalloc_node+0xc6/0x150
[ 3746.663932]  zfsdev_ioctl+0x53/0xe0 [zfs]
[ 3746.664260]  __x64_sys_ioctl+0x97/0xd0
[ 3746.664269]  do_syscall_64+0x59/0xb0
[ 3746.664281]  ? srso_return_thunk+0x5/0x5f
[ 3746.664290]  ? sysvec_call_function_single+0xe/0x90
[ 3746.664298]  ? srso_return_thunk+0x5/0x5f
[ 3746.664303]  ? asm_sysvec_call_function_single+0x1a/0x20
[ 3746.664316]  ? srso_return_thunk+0x5/0x5f
[ 3746.664320]  ? flush_tlb_func+0x1b6/0x1f0
[ 3746.664328]  ? srso_return_thunk+0x5/0x5f
[ 3746.664332]  ? smp_call_function_many_cond+0xfe/0x4f0
[ 3746.664339]  ? __pfx_flush_tlb_func+0x10/0x10
[ 3746.664348]  ? srso_return_thunk+0x5/0x5f
[ 3746.664353]  ? __mod_memcg_lruvec_state+0x4e/0xa0
[ 3746.664361]  ? srso_return_thunk+0x5/0x5f
[ 3746.664365]  ? __mod_lruvec_page_state+0x97/0x130
[ 3746.664373]  ? srso_return_thunk+0x5/0x5f
[ 3746.664378]  ? do_wp_page+0x6db/0xb80
[ 3746.664389]  ? srso_return_thunk+0x5/0x5f
[ 3746.664395]  ? __handle_mm_fault+0xa8f/0xd90
[ 3746.664402]  ? __x64_sys_futex+0x92/0x1d0
[ 3746.664414]  ? srso_return_thunk+0x5/0x5f
[ 3746.664420]  ? __count_memcg_events+0x4d/0x90
[ 3746.664424]  ? srso_return_thunk+0x5/0x5f
[ 3746.664429]  ? count_memcg_events.constprop.0+0x1a/0x30
[ 3746.664437]  ? srso_return_thunk+0x5/0x5f
[ 3746.664441]  ? flush_tlb_func+0x1b6/0x1f0
[ 3746.664449]  ? __pfx_flush_tlb_func+0x10/0x10
[ 3746.664455]  ? srso_return_thunk+0x5/0x5f
[ 3746.664459]  ? __flush_smp_call_function_queue+0x9e/0x420
[ 3746.664466]  ? srso_return_thunk+0x5/0x5f
[ 3746.664473]  ? __irq_exit_rcu+0x3b/0xc0
[ 3746.664482]  entry_SYSCALL_64_after_hwframe+0x78/0xe2
[ 3746.664489] RIP: 0033:0x7fdcfe3fdc5b
[ 3746.664495] RSP: 002b:00007fdcfd30da00 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[ 3746.664502] RAX: ffffffffffffffda RBX: 00007fdcf01e6370 RCX: 00007fdcfe3fdc5b
[ 3746.664507] RDX: 00007fdcfd30da70 RSI: 0000000000005a0d RDI: 000000000000000a
[ 3746.664511] RBP: 00007fdcfd311460 R08: 0000000000000001 R09: 0000000000000000
[ 3746.664514] R10: 5003314ccf4313c0 R11: 0000000000000246 R12: 00007fdcfd311020
[ 3746.664517] R13: 000055f3224047b0 R14: 00007fdcf0032fd0 R15: 00007fdcfd30da70
[ 3746.664527]  </TASK>
[ 3867.489537] INFO: task txg_sync:5307 blocked for more than 241 seconds.
[ 3867.489819]       Tainted: P           OE      6.6.44-production+truenas #1
[ 3867.490098] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 3867.490414] task:txg_sync        state:D stack:0     pid:5307  ppid:2      flags:0x00004000
[ 3867.490423] Call Trace:
[ 3867.490426]  <TASK>
[ 3867.490432]  __schedule+0x349/0x950
[ 3867.490445]  schedule+0x5b/0xa0
[ 3867.490450]  schedule_timeout+0x98/0x160
[ 3867.490456]  ? __pfx_process_timeout+0x10/0x10
[ 3867.490464]  io_schedule_timeout+0x50/0x80
[ 3867.490471]  __cv_timedwait_common+0x12a/0x160 [spl]
[ 3867.490494]  ? __pfx_autoremove_wake_function+0x10/0x10
[ 3867.490503]  __cv_timedwait_io+0x19/0x20 [spl]
[ 3867.490525]  zio_wait+0x124/0x240 [zfs]
[ 3867.490842]  dsl_pool_sync_mos+0x37/0xa0 [zfs]
[ 3867.491174]  dsl_pool_sync+0x3b9/0x410 [zfs]
[ 3867.491552]  spa_sync_iterate_to_convergence+0xd8/0x200 [zfs]
[ 3867.491884]  spa_sync+0x30a/0x600 [zfs]
[ 3867.492227]  txg_sync_thread+0x1ec/0x270 [zfs]
[ 3867.492569]  ? __pfx_txg_sync_thread+0x10/0x10 [zfs]
[ 3867.492886]  ? __pfx_thread_generic_wrapper+0x10/0x10 [spl]
[ 3867.492909]  thread_generic_wrapper+0x5e/0x70 [spl]
[ 3867.492932]  kthread+0xe8/0x120
[ 3867.492938]  ? __pfx_kthread+0x10/0x10
[ 3867.492943]  ret_from_fork+0x34/0x50
[ 3867.492950]  ? __pfx_kthread+0x10/0x10
[ 3867.492954]  ret_from_fork_asm+0x1b/0x30
[ 3867.492966]  </TASK>
[ 3867.493003] INFO: task agents:9338 blocked for more than 241 seconds.
[ 3867.493329]       Tainted: P           OE      6.6.44-production+truenas #1
[ 3867.493657] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 3867.493973] task:agents          state:D stack:0     pid:9338  ppid:1      flags:0x00004002
[ 3867.493980] Call Trace:
[ 3867.493982]  <TASK>
[ 3867.493987]  __schedule+0x349/0x950
[ 3867.493997]  schedule+0x5b/0xa0
[ 3867.494001]  io_schedule+0x46/0x70
[ 3867.494006]  cv_wait_common+0xaa/0x130 [spl]
[ 3867.494027]  ? __pfx_autoremove_wake_function+0x10/0x10
[ 3867.494035]  txg_wait_synced_impl+0xc0/0x110 [zfs]
[ 3867.494392]  txg_wait_synced+0x10/0x40 [zfs]
[ 3867.494709]  spa_vdev_state_exit+0x95/0x150 [zfs]
[ 3867.495033]  zfs_ioc_vdev_set_state+0xea/0x1c0 [zfs]
[ 3867.495396]  zfsdev_ioctl_common+0x680/0x790 [zfs]
[ 3867.495703]  ? __kmalloc_node+0xc6/0x150
[ 3867.495713]  zfsdev_ioctl+0x53/0xe0 [zfs]
[ 3867.496009]  __x64_sys_ioctl+0x97/0xd0
[ 3867.496017]  do_syscall_64+0x59/0xb0
[ 3867.496023]  ? srso_return_thunk+0x5/0x5f
[ 3867.496028]  ? sysvec_call_function_single+0xe/0x90
[ 3867.496033]  ? srso_return_thunk+0x5/0x5f
[ 3867.496037]  ? asm_sysvec_call_function_single+0x1a/0x20
[ 3867.496048]  ? srso_return_thunk+0x5/0x5f
[ 3867.496052]  ? flush_tlb_func+0x1b6/0x1f0
[ 3867.496060]  ? srso_return_thunk+0x5/0x5f
[ 3867.496064]  ? smp_call_function_many_cond+0xfe/0x4f0
[ 3867.496069]  ? __pfx_flush_tlb_func+0x10/0x10
[ 3867.496076]  ? srso_return_thunk+0x5/0x5f
[ 3867.496080]  ? __mod_memcg_lruvec_state+0x4e/0xa0
[ 3867.496086]  ? srso_return_thunk+0x5/0x5f
[ 3867.496090]  ? __mod_lruvec_page_state+0x97/0x130
[ 3867.496097]  ? srso_return_thunk+0x5/0x5f
[ 3867.496101]  ? do_wp_page+0x6db/0xb80
[ 3867.496110]  ? srso_return_thunk+0x5/0x5f
[ 3867.496114]  ? __handle_mm_fault+0xa8f/0xd90
[ 3867.496120]  ? __x64_sys_futex+0x92/0x1d0
[ 3867.496130]  ? srso_return_thunk+0x5/0x5f
[ 3867.496134]  ? __count_memcg_events+0x4d/0x90
[ 3867.496137]  ? srso_return_thunk+0x5/0x5f
[ 3867.496141]  ? count_memcg_events.constprop.0+0x1a/0x30
[ 3867.496147]  ? srso_return_thunk+0x5/0x5f
[ 3867.496151]  ? flush_tlb_func+0x1b6/0x1f0
[ 3867.496157]  ? __pfx_flush_tlb_func+0x10/0x10
[ 3867.496162]  ? srso_return_thunk+0x5/0x5f
[ 3867.496166]  ? __flush_smp_call_function_queue+0x9e/0x420
[ 3867.496171]  ? srso_return_thunk+0x5/0x5f
[ 3867.496175]  ? __irq_exit_rcu+0x3b/0xc0
[ 3867.496183]  entry_SYSCALL_64_after_hwframe+0x78/0xe2
[ 3867.496190] RIP: 0033:0x7fdcfe3fdc5b
[ 3867.496196] RSP: 002b:00007fdcfd30da00 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[ 3867.496202] RAX: ffffffffffffffda RBX: 00007fdcf01e6370 RCX: 00007fdcfe3fdc5b
[ 3867.496206] RDX: 00007fdcfd30da70 RSI: 0000000000005a0d RDI: 000000000000000a
[ 3867.496210] RBP: 00007fdcfd311460 R08: 0000000000000001 R09: 0000000000000000
[ 3867.496213] R10: 5003314ccf4313c0 R11: 0000000000000246 R12: 00007fdcfd311020
[ 3867.496217] R13: 000055f3224047b0 R14: 00007fdcf0032fd0 R15: 00007fdcfd30da70
[ 3867.496226]  </TASK>
[ 3988.321646] INFO: task agents:9338 blocked for more than 362 seconds.
[ 3988.321935]       Tainted: P           OE      6.6.44-production+truenas #1
[ 3988.322224] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 3988.322539] task:agents          state:D stack:0     pid:9338  ppid:1      flags:0x00004002
[ 3988.322547] Call Trace:
[ 3988.322550]  <TASK>
[ 3988.322556]  __schedule+0x349/0x950
[ 3988.322569]  schedule+0x5b/0xa0
[ 3988.322574]  io_schedule+0x46/0x70
[ 3988.322580]  cv_wait_common+0xaa/0x130 [spl]
[ 3988.322603]  ? __pfx_autoremove_wake_function+0x10/0x10
[ 3988.322613]  txg_wait_synced_impl+0xc0/0x110 [zfs]
[ 3988.322944]  txg_wait_synced+0x10/0x40 [zfs]
[ 3988.323277]  spa_vdev_state_exit+0x95/0x150 [zfs]
[ 3988.323602]  zfs_ioc_vdev_set_state+0xea/0x1c0 [zfs]
[ 3988.323910]  zfsdev_ioctl_common+0x680/0x790 [zfs]
[ 3988.324219]  ? __kmalloc_node+0xc6/0x150
[ 3988.324232]  zfsdev_ioctl+0x53/0xe0 [zfs]
[ 3988.324533]  __x64_sys_ioctl+0x97/0xd0
[ 3988.324540]  do_syscall_64+0x59/0xb0
[ 3988.324546]  ? srso_return_thunk+0x5/0x5f
[ 3988.324552]  ? sysvec_call_function_single+0xe/0x90
[ 3988.324557]  ? srso_return_thunk+0x5/0x5f
[ 3988.324561]  ? asm_sysvec_call_function_single+0x1a/0x20
[ 3988.324571]  ? srso_return_thunk+0x5/0x5f
[ 3988.324576]  ? flush_tlb_func+0x1b6/0x1f0
[ 3988.324583]  ? srso_return_thunk+0x5/0x5f
[ 3988.324587]  ? smp_call_function_many_cond+0xfe/0x4f0
[ 3988.324593]  ? __pfx_flush_tlb_func+0x10/0x10
[ 3988.324599]  ? srso_return_thunk+0x5/0x5f
[ 3988.324603]  ? __mod_memcg_lruvec_state+0x4e/0xa0
[ 3988.324610]  ? srso_return_thunk+0x5/0x5f
[ 3988.324614]  ? __mod_lruvec_page_state+0x97/0x130
[ 3988.324620]  ? srso_return_thunk+0x5/0x5f
[ 3988.324624]  ? do_wp_page+0x6db/0xb80
[ 3988.324633]  ? srso_return_thunk+0x5/0x5f
[ 3988.324637]  ? __handle_mm_fault+0xa8f/0xd90
[ 3988.324643]  ? __x64_sys_futex+0x92/0x1d0
[ 3988.324653]  ? srso_return_thunk+0x5/0x5f
[ 3988.324657]  ? __count_memcg_events+0x4d/0x90
[ 3988.324661]  ? srso_return_thunk+0x5/0x5f
[ 3988.324665]  ? count_memcg_events.constprop.0+0x1a/0x30
[ 3988.324671]  ? srso_return_thunk+0x5/0x5f
[ 3988.324675]  ? flush_tlb_func+0x1b6/0x1f0
[ 3988.324681]  ? __pfx_flush_tlb_func+0x10/0x10
[ 3988.324685]  ? srso_return_thunk+0x5/0x5f
[ 3988.324689]  ? __flush_smp_call_function_queue+0x9e/0x420
[ 3988.324694]  ? srso_return_thunk+0x5/0x5f
[ 3988.324698]  ? __irq_exit_rcu+0x3b/0xc0
[ 3988.324705]  entry_SYSCALL_64_after_hwframe+0x78/0xe2
[ 3988.324711] RIP: 0033:0x7fdcfe3fdc5b
[ 3988.324716] RSP: 002b:00007fdcfd30da00 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[ 3988.324721] RAX: ffffffffffffffda RBX: 00007fdcf01e6370 RCX: 00007fdcfe3fdc5b
[ 3988.324724] RDX: 00007fdcfd30da70 RSI: 0000000000005a0d RDI: 000000000000000a
[ 3988.324727] RBP: 00007fdcfd311460 R08: 0000000000000001 R09: 0000000000000000
[ 3988.324730] R10: 5003314ccf4313c0 R11: 0000000000000246 R12: 00007fdcfd311020
[ 3988.324733] R13: 000055f3224047b0 R14: 00007fdcf0032fd0 R15: 00007fdcfd30da70
[ 3988.324742]  </TASK>
[ 4109.153718] INFO: task agents:9338 blocked for more than 483 seconds.
[ 4109.153946]       Tainted: P           OE      6.6.44-production+truenas #1
[ 4109.154169] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 4109.154402] task:agents          state:D stack:0     pid:9338  ppid:1      flags:0x00004002
[ 4109.154408] Call Trace:
[ 4109.154411]  <TASK>
[ 4109.154416]  __schedule+0x349/0x950
[ 4109.154427]  schedule+0x5b/0xa0
[ 4109.154430]  io_schedule+0x46/0x70
[ 4109.154435]  cv_wait_common+0xaa/0x130 [spl]
[ 4109.154454]  ? __pfx_autoremove_wake_function+0x10/0x10
[ 4109.154463]  txg_wait_synced_impl+0xc0/0x110 [zfs]
[ 4109.154757]  txg_wait_synced+0x10/0x40 [zfs]
[ 4109.155025]  spa_vdev_state_exit+0x95/0x150 [zfs]
[ 4109.155288]  zfs_ioc_vdev_set_state+0xea/0x1c0 [zfs]
[ 4109.155564]  zfsdev_ioctl_common+0x680/0x790 [zfs]
[ 4109.155819]  ? __kmalloc_node+0xc6/0x150
[ 4109.155827]  zfsdev_ioctl+0x53/0xe0 [zfs]
[ 4109.156073]  __x64_sys_ioctl+0x97/0xd0
[ 4109.156079]  do_syscall_64+0x59/0xb0
[ 4109.156084]  ? srso_return_thunk+0x5/0x5f
[ 4109.156088]  ? sysvec_call_function_single+0xe/0x90
[ 4109.156092]  ? srso_return_thunk+0x5/0x5f
[ 4109.156095]  ? asm_sysvec_call_function_single+0x1a/0x20
[ 4109.156103]  ? srso_return_thunk+0x5/0x5f
[ 4109.156105]  ? flush_tlb_func+0x1b6/0x1f0
[ 4109.156111]  ? srso_return_thunk+0x5/0x5f
[ 4109.156113]  ? smp_call_function_many_cond+0xfe/0x4f0
[ 4109.156117]  ? __pfx_flush_tlb_func+0x10/0x10
[ 4109.156122]  ? srso_return_thunk+0x5/0x5f
[ 4109.156124]  ? __mod_memcg_lruvec_state+0x4e/0xa0
[ 4109.156129]  ? srso_return_thunk+0x5/0x5f
[ 4109.156132]  ? __mod_lruvec_page_state+0x97/0x130
[ 4109.156136]  ? srso_return_thunk+0x5/0x5f
[ 4109.156139]  ? do_wp_page+0x6db/0xb80
[ 4109.156145]  ? srso_return_thunk+0x5/0x5f
[ 4109.156148]  ? __handle_mm_fault+0xa8f/0xd90
[ 4109.156152]  ? __x64_sys_futex+0x92/0x1d0
[ 4109.156158]  ? srso_return_thunk+0x5/0x5f
[ 4109.156161]  ? __count_memcg_events+0x4d/0x90
[ 4109.156164]  ? srso_return_thunk+0x5/0x5f
[ 4109.156166]  ? count_memcg_events.constprop.0+0x1a/0x30
[ 4109.156170]  ? srso_return_thunk+0x5/0x5f
[ 4109.156173]  ? flush_tlb_func+0x1b6/0x1f0
[ 4109.156177]  ? __pfx_flush_tlb_func+0x10/0x10
[ 4109.156179]  ? srso_return_thunk+0x5/0x5f
[ 4109.156182]  ? __flush_smp_call_function_queue+0x9e/0x420
[ 4109.156185]  ? srso_return_thunk+0x5/0x5f
[ 4109.156188]  ? __irq_exit_rcu+0x3b/0xc0
[ 4109.156193]  entry_SYSCALL_64_after_hwframe+0x78/0xe2
[ 4109.156197] RIP: 0033:0x7fdcfe3fdc5b
[ 4109.156201] RSP: 002b:00007fdcfd30da00 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[ 4109.156205] RAX: ffffffffffffffda RBX: 00007fdcf01e6370 RCX: 00007fdcfe3fdc5b
[ 4109.156208] RDX: 00007fdcfd30da70 RSI: 0000000000005a0d RDI: 000000000000000a
[ 4109.156210] RBP: 00007fdcfd311460 R08: 0000000000000001 R09: 0000000000000000
[ 4109.156211] R10: 5003314ccf4313c0 R11: 0000000000000246 R12: 00007fdcfd311020
[ 4109.156213] R13: 000055f3224047b0 R14: 00007fdcf0032fd0 R15: 00007fdcfd30da70
[ 4109.156219]  </TASK>
[ 4229.985814] INFO: task agents:9338 blocked for more than 604 seconds.
[ 4229.986100]       Tainted: P           OE      6.6.44-production+truenas #1
[ 4229.986398] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 4229.986705] task:agents          state:D stack:0     pid:9338  ppid:1      flags:0x00004002
[ 4229.986712] Call Trace:
[ 4229.986715]  <TASK>
[ 4229.986722]  __schedule+0x349/0x950
[ 4229.986735]  schedule+0x5b/0xa0
[ 4229.986740]  io_schedule+0x46/0x70
[ 4229.986746]  cv_wait_common+0xaa/0x130 [spl]
[ 4229.986769]  ? __pfx_autoremove_wake_function+0x10/0x10
[ 4229.986780]  txg_wait_synced_impl+0xc0/0x110 [zfs]
[ 4229.987110]  txg_wait_synced+0x10/0x40 [zfs]
[ 4229.987442]  spa_vdev_state_exit+0x95/0x150 [zfs]
[ 4229.987766]  zfs_ioc_vdev_set_state+0xea/0x1c0 [zfs]
[ 4229.988074]  zfsdev_ioctl_common+0x680/0x790 [zfs]
[ 4229.988391]  ? __kmalloc_node+0xc6/0x150
[ 4229.988401]  zfsdev_ioctl+0x53/0xe0 [zfs]
[ 4229.988698]  __x64_sys_ioctl+0x97/0xd0
[ 4229.988705]  do_syscall_64+0x59/0xb0
[ 4229.988712]  ? srso_return_thunk+0x5/0x5f
[ 4229.988717]  ? sysvec_call_function_single+0xe/0x90
[ 4229.988722]  ? srso_return_thunk+0x5/0x5f
[ 4229.988726]  ? asm_sysvec_call_function_single+0x1a/0x20
[ 4229.988737]  ? srso_return_thunk+0x5/0x5f
[ 4229.988741]  ? flush_tlb_func+0x1b6/0x1f0
[ 4229.988749]  ? srso_return_thunk+0x5/0x5f
[ 4229.988753]  ? smp_call_function_many_cond+0xfe/0x4f0
[ 4229.988758]  ? __pfx_flush_tlb_func+0x10/0x10
[ 4229.988765]  ? srso_return_thunk+0x5/0x5f
[ 4229.988769]  ? __mod_memcg_lruvec_state+0x4e/0xa0
[ 4229.988775]  ? srso_return_thunk+0x5/0x5f
[ 4229.988780]  ? __mod_lruvec_page_state+0x97/0x130
[ 4229.988786]  ? srso_return_thunk+0x5/0x5f
[ 4229.988790]  ? do_wp_page+0x6db/0xb80
[ 4229.988799]  ? srso_return_thunk+0x5/0x5f
[ 4229.988803]  ? __handle_mm_fault+0xa8f/0xd90
[ 4229.988809]  ? __x64_sys_futex+0x92/0x1d0
[ 4229.988819]  ? srso_return_thunk+0x5/0x5f
[ 4229.988823]  ? __count_memcg_events+0x4d/0x90
[ 4229.988826]  ? srso_return_thunk+0x5/0x5f
[ 4229.988830]  ? count_memcg_events.constprop.0+0x1a/0x30
[ 4229.988836]  ? srso_return_thunk+0x5/0x5f
[ 4229.988840]  ? flush_tlb_func+0x1b6/0x1f0
[ 4229.988846]  ? __pfx_flush_tlb_func+0x10/0x10
[ 4229.988851]  ? srso_return_thunk+0x5/0x5f
[ 4229.988855]  ? __flush_smp_call_function_queue+0x9e/0x420
[ 4229.988860]  ? srso_return_thunk+0x5/0x5f
[ 4229.988864]  ? __irq_exit_rcu+0x3b/0xc0
[ 4229.988872]  entry_SYSCALL_64_after_hwframe+0x78/0xe2
[ 4229.988877] RIP: 0033:0x7fdcfe3fdc5b
[ 4229.988882] RSP: 002b:00007fdcfd30da00 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[ 4229.988887] RAX: ffffffffffffffda RBX: 00007fdcf01e6370 RCX: 00007fdcfe3fdc5b
[ 4229.988890] RDX: 00007fdcfd30da70 RSI: 0000000000005a0d RDI: 000000000000000a
[ 4229.988893] RBP: 00007fdcfd311460 R08: 0000000000000001 R09: 0000000000000000
[ 4229.988896] R10: 5003314ccf4313c0 R11: 0000000000000246 R12: 00007fdcfd311020
[ 4229.988899] R13: 000055f3224047b0 R14: 00007fdcf0032fd0 R15: 00007fdcfd30da70
[ 4229.988908]  </TASK>
[ 4350.817888] INFO: task agents:9338 blocked for more than 724 seconds.
[ 4350.818167]       Tainted: P           OE      6.6.44-production+truenas #1
[ 4350.818455] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 4350.818744] task:agents          state:D stack:0     pid:9338  ppid:1      flags:0x00004002
[ 4350.818751] Call Trace:
[ 4350.818754]  <TASK>
[ 4350.818760]  __schedule+0x349/0x950
[ 4350.818773]  schedule+0x5b/0xa0
[ 4350.818778]  io_schedule+0x46/0x70
[ 4350.818784]  cv_wait_common+0xaa/0x130 [spl]
[ 4350.818807]  ? __pfx_autoremove_wake_function+0x10/0x10
[ 4350.818819]  txg_wait_synced_impl+0xc0/0x110 [zfs]
[ 4350.819150]  txg_wait_synced+0x10/0x40 [zfs]
[ 4350.819480]  spa_vdev_state_exit+0x95/0x150 [zfs]
[ 4350.819803]  zfs_ioc_vdev_set_state+0xea/0x1c0 [zfs]
[ 4350.820111]  zfsdev_ioctl_common+0x680/0x790 [zfs]
[ 4350.820428]  ? __kmalloc_node+0xc6/0x150
[ 4350.820438]  zfsdev_ioctl+0x53/0xe0 [zfs]
[ 4350.820735]  __x64_sys_ioctl+0x97/0xd0
[ 4350.820742]  do_syscall_64+0x59/0xb0
[ 4350.820748]  ? srso_return_thunk+0x5/0x5f
[ 4350.820754]  ? sysvec_call_function_single+0xe/0x90
[ 4350.820759]  ? srso_return_thunk+0x5/0x5f
[ 4350.820763]  ? asm_sysvec_call_function_single+0x1a/0x20
[ 4350.820774]  ? srso_return_thunk+0x5/0x5f
[ 4350.820778]  ? flush_tlb_func+0x1b6/0x1f0
[ 4350.820786]  ? srso_return_thunk+0x5/0x5f
[ 4350.820790]  ? smp_call_function_many_cond+0xfe/0x4f0
[ 4350.820795]  ? __pfx_flush_tlb_func+0x10/0x10
[ 4350.820802]  ? srso_return_thunk+0x5/0x5f
[ 4350.820806]  ? __mod_memcg_lruvec_state+0x4e/0xa0
[ 4350.820812]  ? srso_return_thunk+0x5/0x5f
[ 4350.820816]  ? __mod_lruvec_page_state+0x97/0x130
[ 4350.820823]  ? srso_return_thunk+0x5/0x5f
[ 4350.820827]  ? do_wp_page+0x6db/0xb80
[ 4350.820836]  ? srso_return_thunk+0x5/0x5f
[ 4350.820840]  ? __handle_mm_fault+0xa8f/0xd90
[ 4350.820846]  ? __x64_sys_futex+0x92/0x1d0
[ 4350.820855]  ? srso_return_thunk+0x5/0x5f
[ 4350.820859]  ? __count_memcg_events+0x4d/0x90
[ 4350.820863]  ? srso_return_thunk+0x5/0x5f
[ 4350.820867]  ? count_memcg_events.constprop.0+0x1a/0x30
[ 4350.820873]  ? srso_return_thunk+0x5/0x5f
[ 4350.820877]  ? flush_tlb_func+0x1b6/0x1f0
[ 4350.820883]  ? __pfx_flush_tlb_func+0x10/0x10
[ 4350.820888]  ? srso_return_thunk+0x5/0x5f
[ 4350.820892]  ? __flush_smp_call_function_queue+0x9e/0x420
[ 4350.820897]  ? srso_return_thunk+0x5/0x5f
[ 4350.820901]  ? __irq_exit_rcu+0x3b/0xc0
[ 4350.820908]  entry_SYSCALL_64_after_hwframe+0x78/0xe2
[ 4350.820914] RIP: 0033:0x7fdcfe3fdc5b
[ 4350.820919] RSP: 002b:00007fdcfd30da00 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[ 4350.820924] RAX: ffffffffffffffda RBX: 00007fdcf01e6370 RCX: 00007fdcfe3fdc5b
[ 4350.820927] RDX: 00007fdcfd30da70 RSI: 0000000000005a0d RDI: 000000000000000a
[ 4350.820930] RBP: 00007fdcfd311460 R08: 0000000000000001 R09: 0000000000000000
[ 4350.820933] R10: 5003314ccf4313c0 R11: 0000000000000246 R12: 00007fdcfd311020
[ 4350.820935] R13: 000055f3224047b0 R14: 00007fdcf0032fd0 R15: 00007fdcfd30da70
[ 4350.820944]  </TASK>
[ 4471.653945] INFO: task agents:9338 blocked for more than 845 seconds.
[ 4471.654225]       Tainted: P           OE      6.6.44-production+truenas #1
[ 4471.654522] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 4471.654812] task:agents          state:D stack:0     pid:9338  ppid:1      flags:0x00004002
[ 4471.654819] Call Trace:
[ 4471.654822]  <TASK>
[ 4471.654829]  __schedule+0x349/0x950
[ 4471.654842]  schedule+0x5b/0xa0
[ 4471.654847]  io_schedule+0x46/0x70
[ 4471.654853]  cv_wait_common+0xaa/0x130 [spl]
[ 4471.654876]  ? __pfx_autoremove_wake_function+0x10/0x10
[ 4471.654886]  txg_wait_synced_impl+0xc0/0x110 [zfs]
[ 4471.655220]  txg_wait_synced+0x10/0x40 [zfs]
[ 4471.655549]  spa_vdev_state_exit+0x95/0x150 [zfs]
[ 4471.655873]  zfs_ioc_vdev_set_state+0xea/0x1c0 [zfs]
[ 4471.656181]  zfsdev_ioctl_common+0x680/0x790 [zfs]
[ 4471.656502]  ? __kmalloc_node+0xc6/0x150
[ 4471.656512]  zfsdev_ioctl+0x53/0xe0 [zfs]
[ 4471.656808]  __x64_sys_ioctl+0x97/0xd0
[ 4471.656815]  do_syscall_64+0x59/0xb0
[ 4471.656821]  ? srso_return_thunk+0x5/0x5f
[ 4471.656827]  ? sysvec_call_function_single+0xe/0x90
[ 4471.656832]  ? srso_return_thunk+0x5/0x5f
[ 4471.656836]  ? asm_sysvec_call_function_single+0x1a/0x20
[ 4471.656847]  ? srso_return_thunk+0x5/0x5f
[ 4471.656851]  ? flush_tlb_func+0x1b6/0x1f0
[ 4471.656858]  ? srso_return_thunk+0x5/0x5f
[ 4471.656862]  ? smp_call_function_many_cond+0xfe/0x4f0
[ 4471.656868]  ? __pfx_flush_tlb_func+0x10/0x10
[ 4471.656875]  ? srso_return_thunk+0x5/0x5f
[ 4471.656879]  ? __mod_memcg_lruvec_state+0x4e/0xa0
[ 4471.656885]  ? srso_return_thunk+0x5/0x5f
[ 4471.656889]  ? __mod_lruvec_page_state+0x97/0x130
[ 4471.656895]  ? srso_return_thunk+0x5/0x5f
[ 4471.656899]  ? do_wp_page+0x6db/0xb80
[ 4471.656908]  ? srso_return_thunk+0x5/0x5f
[ 4471.656912]  ? __handle_mm_fault+0xa8f/0xd90
[ 4471.656918]  ? __x64_sys_futex+0x92/0x1d0
[ 4471.656928]  ? srso_return_thunk+0x5/0x5f
[ 4471.656932]  ? __count_memcg_events+0x4d/0x90
[ 4471.656936]  ? srso_return_thunk+0x5/0x5f
[ 4471.656940]  ? count_memcg_events.constprop.0+0x1a/0x30
[ 4471.656946]  ? srso_return_thunk+0x5/0x5f
[ 4471.656950]  ? flush_tlb_func+0x1b6/0x1f0
[ 4471.656956]  ? __pfx_flush_tlb_func+0x10/0x10
[ 4471.656960]  ? srso_return_thunk+0x5/0x5f
[ 4471.656964]  ? __flush_smp_call_function_queue+0x9e/0x420
[ 4471.656970]  ? srso_return_thunk+0x5/0x5f
[ 4471.656974]  ? __irq_exit_rcu+0x3b/0xc0
[ 4471.656981]  entry_SYSCALL_64_after_hwframe+0x78/0xe2
[ 4471.656987] RIP: 0033:0x7fdcfe3fdc5b
[ 4471.656991] RSP: 002b:00007fdcfd30da00 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[ 4471.656997] RAX: ffffffffffffffda RBX: 00007fdcf01e6370 RCX: 00007fdcfe3fdc5b
[ 4471.657000] RDX: 00007fdcfd30da70 RSI: 0000000000005a0d RDI: 000000000000000a
[ 4471.657003] RBP: 00007fdcfd311460 R08: 0000000000000001 R09: 0000000000000000
[ 4471.657005] R10: 5003314ccf4313c0 R11: 0000000000000246 R12: 00007fdcfd311020
[ 4471.657008] R13: 000055f3224047b0 R14: 00007fdcf0032fd0 R15: 00007fdcfd30da70
[ 4471.657017]  </TASK>
[ 4592.482003] INFO: task agents:9338 blocked for more than 966 seconds.
[ 4592.482309]       Tainted: P           OE      6.6.44-production+truenas #1
[ 4592.482606] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 4592.482912] task:agents          state:D stack:0     pid:9338  ppid:1      flags:0x00004002
[ 4592.482920] Call Trace:
[ 4592.482923]  <TASK>
[ 4592.482930]  __schedule+0x349/0x950
[ 4592.482943]  schedule+0x5b/0xa0
[ 4592.482948]  io_schedule+0x46/0x70
[ 4592.482954]  cv_wait_common+0xaa/0x130 [spl]
[ 4592.482978]  ? __pfx_autoremove_wake_function+0x10/0x10
[ 4592.482990]  txg_wait_synced_impl+0xc0/0x110 [zfs]
[ 4592.483353]  txg_wait_synced+0x10/0x40 [zfs]
[ 4592.483695]  spa_vdev_state_exit+0x95/0x150 [zfs]
[ 4592.484042]  zfs_ioc_vdev_set_state+0xea/0x1c0 [zfs]
[ 4592.484381]  zfsdev_ioctl_common+0x680/0x790 [zfs]
[ 4592.484709]  ? __kmalloc_node+0xc6/0x150
[ 4592.484719]  zfsdev_ioctl+0x53/0xe0 [zfs]
[ 4592.485037]  __x64_sys_ioctl+0x97/0xd0
[ 4592.485045]  do_syscall_64+0x59/0xb0
[ 4592.485051]  ? srso_return_thunk+0x5/0x5f
[ 4592.485057]  ? sysvec_call_function_single+0xe/0x90
[ 4592.485062]  ? srso_return_thunk+0x5/0x5f
[ 4592.485067]  ? asm_sysvec_call_function_single+0x1a/0x20
[ 4592.485078]  ? srso_return_thunk+0x5/0x5f
[ 4592.485082]  ? flush_tlb_func+0x1b6/0x1f0
[ 4592.485090]  ? srso_return_thunk+0x5/0x5f
[ 4592.485095]  ? smp_call_function_many_cond+0xfe/0x4f0
[ 4592.485101]  ? __pfx_flush_tlb_func+0x10/0x10
[ 4592.485108]  ? srso_return_thunk+0x5/0x5f
[ 4592.485112]  ? __mod_memcg_lruvec_state+0x4e/0xa0
[ 4592.485119]  ? srso_return_thunk+0x5/0x5f
[ 4592.485124]  ? __mod_lruvec_page_state+0x97/0x130
[ 4592.485131]  ? srso_return_thunk+0x5/0x5f
[ 4592.485135]  ? do_wp_page+0x6db/0xb80
[ 4592.485144]  ? srso_return_thunk+0x5/0x5f
[ 4592.485149]  ? __handle_mm_fault+0xa8f/0xd90
[ 4592.485155]  ? __x64_sys_futex+0x92/0x1d0
[ 4592.485165]  ? srso_return_thunk+0x5/0x5f
[ 4592.485170]  ? __count_memcg_events+0x4d/0x90
[ 4592.485174]  ? srso_return_thunk+0x5/0x5f
[ 4592.485178]  ? count_memcg_events.constprop.0+0x1a/0x30
[ 4592.485185]  ? srso_return_thunk+0x5/0x5f
[ 4592.485189]  ? flush_tlb_func+0x1b6/0x1f0
[ 4592.485195]  ? __pfx_flush_tlb_func+0x10/0x10
[ 4592.485200]  ? srso_return_thunk+0x5/0x5f
[ 4592.485205]  ? __flush_smp_call_function_queue+0x9e/0x420
[ 4592.485210]  ? srso_return_thunk+0x5/0x5f
[ 4592.485214]  ? __irq_exit_rcu+0x3b/0xc0
[ 4592.485222]  entry_SYSCALL_64_after_hwframe+0x78/0xe2
[ 4592.485229] RIP: 0033:0x7fdcfe3fdc5b
[ 4592.485235] RSP: 002b:00007fdcfd30da00 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[ 4592.485242] RAX: ffffffffffffffda RBX: 00007fdcf01e6370 RCX: 00007fdcfe3fdc5b
[ 4592.485247] RDX: 00007fdcfd30da70 RSI: 0000000000005a0d RDI: 000000000000000a
[ 4592.485251] RBP: 00007fdcfd311460 R08: 0000000000000001 R09: 0000000000000000
[ 4592.485254] R10: 5003314ccf4313c0 R11: 0000000000000246 R12: 00007fdcfd311020
[ 4592.485258] R13: 000055f3224047b0 R14: 00007fdcf0032fd0 R15: 00007fdcfd30da70
[ 4592.485268]  </TASK>
[ 4592.485270] Future hung task reports are suppressed, see sysctl kernel.hung_task_warnings

sfatula · October 26, 2024, 11:16pm

Most current firmware possible on the LSI card? ZFS picky about firmware at times, many older LSI firmwares have issues.

marshalleq · October 26, 2024, 11:37pm

Yes, I updated to latest as part of the previous investigation. It is a 16e card and I loaded it up and was doing copies in all directions to test it out. It sustained that for many hours no problem so I thought it was fine. Best idea I can come up with is the fan isn’t quite enough and it very slowly got hot. I’ve got some thermal probes coming and I hope to connect them to the LSI heatsink somehow that should help to set the fan speed. The card actually has a really nice large heatsink and I got a slightly larger than normal fan for it too, so I think it should be enough really. I wish I knew how to find the right and first issue in the logs that was causing this as it’s not likely to be disk issues first, it should be some kind of PCIe bus error I think.

roberth58 · October 27, 2024, 11:06am

I have the same 16e card and was having weird errors even with a fan. I got a 80x15 high static pressure fan and thought this would be enough. My error was using a 4pin motherboard connector. The fan was never spinning fast enough to keep the card cool, I switched to a seperate fan controller, cranked it up and the card has been solid ever since running 3 KTL JBOD’s with 31 drives.

marshalleq · October 27, 2024, 8:18pm

Thanks so much for sharing that. Because it aligns with the particular theory about the fan not being enough, which was just a theory but obviously you have now added some definition to it. I will see if I can speed it up somehow. The heatsink is probably big enough to put two fans on it too, so perhaps I will try that. Awesome information thankyou! I have 2.4M checksum errors on those drives now. I think it’s time to format the pool. It’s even showing up funny characters where the file attributes should be for some of them. rsync says that it can’t copy many of them saying input / output error. Surprisingly the checksum errors are still only showing up on the drive and not on the pool, which I think means for that part the data is OK. But, it’s no longer data I can trust.

For anyone recommending these cards in the future, I think they should totally be saying the disclaimer only with a fan, which I don’t ever recall seeing.

For internal drives, I’d only ever bother with the startech now, it’s just not worth the expense and hassle of the LSI, but for external, especially if you’ve got a lot, it’s a bit more tricky.

sfatula · October 28, 2024, 5:55am

I run an LSI and internal drives, and no fan, runs fine. Depends on many factors, the case, the case fans, what else is in the case and how much heat that generates, etc. I don’t think you can just say use a fan.

That’s a lot of errors! Definitely time to start over!!

Chris_Holzer · October 28, 2024, 10:48am

I also had issues with my 9305 (random read errors, randomly a HDD dissapeared and reappeared after a reboot) - all of that was fixed by putting a small noctua fan on the headsink.

Cards like these are meant for enterprise servers which have fans designed to push massive amounts of air through the case and so cool all components.

When you put server grade hardware in a “normal” case then you also have to take care about cooling these components.

marshalleq · March 1, 2025, 3:10am

Just thought I’d circle back here and mention I now have 3 of these Startech cards. I was overseas and with the house shut and summer coming, the last remaining LSI card was failing randomly, but typically I’d have to remote in and hard power off the server, then start it again - because the array didn’t seem to allow me to shut the server down nicely.

Anyway, swapping this to this startech card, I have had no issues since. I think that’s the last time I use the LSI cards, they’re just too finicky for me. I would like to know if the 8 port startech cards work well, but I only tried the 6 port cause I know it works. Probably should have gone for the 8.