Truenas Core Uncorrectable I/O failure on storage pool

pdxjustin · July 9, 2025, 9:53pm

Whenever I try to do a non-read only import, it immediately gives an I/O failure.

I tried running zpool import -f -F -n WD40, and it also gave an I/O failure.
But this time, in messages, it also had:

messages after zpool import -f -F -n WD40

Jul  9 14:42:17 truenas (da3:mps0:0:3:0): READ(16). CDB: 88 00 00 00 00 07 be 40 24 c8 00 00 01 00 00 00
Jul  9 14:42:17 truenas (da3:mps0:0:3:0): CAM status: Data Overrun error
Jul  9 14:42:17 truenas (da3:mps0:0:3:0): Retrying command, 3 more tries remain
Jul  9 14:42:17 truenas (da3:mps0:0:3:0): READ(16). CDB: 88 00 00 00 00 07 be 40 23 80 00 00 00 48 00 00
Jul  9 14:42:17 truenas (da3:mps0:0:3:0): CAM status: Data Overrun error
Jul  9 14:42:17 truenas (da3:mps0:0:3:0): Retrying command, 3 more tries remain
Jul  9 14:42:17 truenas (da3:mps0:0:3:0): READ(16). CDB: 88 00 00 00 00 07 11 16 a4 20 00 00 00 08 00 00
Jul  9 14:42:17 truenas (da3:mps0:0:3:0): CAM status: Data Overrun error
Jul  9 14:42:17 truenas (da3:mps0:0:3:0): Retrying command, 3 more tries remain
Jul  9 14:42:17 truenas (da3:mps0:0:3:0): READ(10). CDB: 28 00 06 40 20 e0 00 00 08 00
Jul  9 14:42:17 truenas (da3:mps0:0:3:0): CAM status: Data Overrun error
Jul  9 14:42:17 truenas (da3:mps0:0:3:0): Retrying command, 3 more tries remain
Jul  9 14:42:17 truenas (da3:mps0:0:3:0): READ(16). CDB: 88 00 00 00 00 02 82 9c c9 20 00 00 00 08 00 00
Jul  9 14:42:17 truenas (da3:mps0:0:3:0): CAM status: Data Overrun error
Jul  9 14:42:17 truenas (da3:mps0:0:3:0): Retrying command, 3 more tries remain
Jul  9 14:42:17 truenas (da3:mps0:0:3:0): READ(10). CDB: 28 00 c8 ee c5 b0 00 00 08 00
Jul  9 14:42:17 truenas (da3:mps0:0:3:0): CAM status: Data Overrun error
Jul  9 14:42:17 truenas (da3:mps0:0:3:0): Retrying command, 3 more tries remain
Jul  9 14:42:17 truenas (da3:mps0:0:3:0): READ(16). CDB: 88 00 00 00 00 02 82 9c c1 28 00 00 00 08 00 00
Jul  9 14:42:17 truenas (da3:mps0:0:3:0): CAM status: Data Overrun error
Jul  9 14:42:17 truenas (da3:mps0:0:3:0): Retrying command, 3 more tries remain
Jul  9 14:42:17 truenas (da3:mps0:0:3:0): READ(10). CDB: 28 00 c4 48 75 c8 00 00 08 00
Jul  9 14:42:17 truenas (da3:mps0:0:3:0): CAM status: Data Overrun error
Jul  9 14:42:17 truenas (da3:mps0:0:3:0): Retrying command, 3 more tries remain
Jul  9 14:42:17 truenas (da3:mps0:0:3:0): READ(10). CDB: 28 00 c4 48 75 c8 00 00 08 00
Jul  9 14:42:17 truenas (da3:mps0:0:3:0): CAM status: SCSI Status Error
Jul  9 14:42:17 truenas (da3:mps0:0:3:0): SCSI status: Check Condition
Jul  9 14:42:17 truenas (da3:mps0:0:3:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Jul  9 14:42:17 truenas (da3:mps0:0:3:0): Retrying command (per sense data)
Jul  9 14:42:18 truenas mps0: Controller reported scsi ioc terminated tgt 5 SMID 1093 loginfo 31120308
Jul  9 14:42:18 truenas mps0: Controller reported scsi ioc terminated tgt 1 SMID 1090 loginfo 31120308
Jul  9 14:42:18 truenas (da5:mps0:0:5:0): WRITE(16). CDB: 8a 00 00 00 00 04 54 40 28 80 00 00 00 08 00 00
Jul  9 14:42:18 truenas (da5:mps0:0:5:0): CAM status: CCB request completed with an error
Jul  9 14:42:18 truenas (da5:mps0:0:5:0): Retrying command, 3 more tries remain
Jul  9 14:42:18 truenas (da1:mps0:0:1:0): WRITE(16). CDB: 8a 00 00 00 00 04 54 40 28 80 00 00 00 08 00 00
Jul  9 14:42:18 truenas (da1:mps0:0:1:0): CAM status: CCB request completed with an error
Jul  9 14:42:18 truenas (da1:mps0:0:1:0): Retrying command, 3 more tries remain
Jul  9 14:42:18 truenas (da5:mps0:0:5:0): WRITE(16). CDB: 8a 00 00 00 00 04 54 40 28 80 00 00 00 08 00 00
Jul  9 14:42:18 truenas (da5:mps0:0:5:0): CAM status: SCSI Status Error
Jul  9 14:42:18 truenas (da5:mps0:0:5:0): SCSI status: Check Condition
Jul  9 14:42:18 truenas (da5:mps0:0:5:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Jul  9 14:42:18 truenas (da5:mps0:0:5:0): Retrying command (per sense data)
Jul  9 14:42:18 truenas (da1:mps0:0:1:0): WRITE(16). CDB: 8a 00 00 00 00 04 54 40 28 80 00 00 00 08 00 00
Jul  9 14:42:18 truenas (da1:mps0:0:1:0): CAM status: SCSI Status Error
Jul  9 14:42:18 truenas (da1:mps0:0:1:0): SCSI status: Check Condition
Jul  9 14:42:18 truenas (da1:mps0:0:1:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Jul  9 14:42:18 truenas (da1:mps0:0:1:0): Retrying command (per sense data)
Jul  9 14:42:18 truenas (da3:mps0:0:3:0): WRITE(10). CDB: 2a 00 80 40 20 d8 00 00 08 00
Jul  9 14:42:18 truenas (da3:mps0:0:3:0): CAM status: SCSI Status Error
Jul  9 14:42:18 truenas (da3:mps0:0:3:0): SCSI status: Check Condition
Jul  9 14:42:18 truenas (da3:mps0:0:3:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Jul  9 14:42:18 truenas (da3:mps0:0:3:0): Retrying command (per sense data)
Jul  9 14:42:19 truenas (da1:mps0:0:1:0): WRITE(16). CDB: 8a 00 00 00 00 04 54 40 28 b8 00 00 00 58 00 00
Jul  9 14:42:19 truenas (da1:mps0:0:1:0): CAM status: SCSI Status Error
Jul  9 14:42:19 truenas (da1:mps0:0:1:0): SCSI status: Check Condition
Jul  9 14:42:19 truenas (da1:mps0:0:1:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Jul  9 14:42:19 truenas (da1:mps0:0:1:0): Retrying command (per sense data)
Jul  9 14:42:19 truenas Solaris: WARNING: Pool 'WD40' has encountered an uncorrectable I/O failure and has been suspended.

It lists half the drives, but I don’t know how to map which drive is on what breakout cable from the SAS controller.

sysctl kstat.zfs.misc.dbgmsg didn’t show anything different than the last time I tried a R/W import.

pdxjustin · July 9, 2025, 10:58pm

the man page says the -F in zpool import -f -F -n WD40 is ignored if pool is importable. Since I can import it as read only, is it considered “importable” and so -F has no effect?

NickF1227 · July 10, 2025, 12:08am

I think in this case, the syntax/format indicates rData is a ZVOL not a Dataset.

pdxjustin:

Jul  9 14:42:19 truenas (da1:mps0:0:1:0): CAM status: SCSI Status Error
Jul  9 14:42:19 truenas (da1:mps0:0:1:0): SCSI status: Check Condition
Jul  9 14:42:19 truenas (da1:mps0:0:1:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred

We see the above for da1,da3 and da5.

Since you have access to the data, I guess next steps would be to determine what went wrong so you can re-seed the data back here after you have it backed up off someplace else.

How are these drives connected physically? Do you have an 8087–>sata breakout cable or is there an actual server SAS backplane in play? This looks like it could potentially be caused by some sort of physical layer/hardware problem in part and thats likely for two reasons:

Seeing “Power on, reset, or bus device reset occurred” for multiple drives, but importantly, not all drives, when you attempt to import read/write is not normal.
The pool got corrupted somehow, that is also not normal, especially considering the relative age of this hardware.

pdxjustin · July 10, 2025, 1:07am

There are 2 8087->sata breakout cables connected.
I agree that initially the problem might have been caused by cables. While I have new cables on the way, I’ve used both different controllers and different cables, and they didn’t help with recovery.

My notes don’t show that the drives in question are connected to the same breakout cable, but my plan is to shut down the server and pull the drives after matching dmesg serial numbers to devices and check my notes. If they are all on the same breakout cable then I’ll feel much better being pretty sure that either one drive is somehow affecting the others, or (more likely) there is a problem with the cable.