Hello!
I’m getting some ZFS pools marked as unhealthy after connecting them to LSI SAS 9300-16i board.
Before I was using some nvme to sata adpter connnected into a pcie-16 board and using pcie bifurcation. I wanted to upgrade since my motherboard (asrock x570d4u-2l2t) has limitation on bifurcation settings when using a cpu with integrated gpu. I needed more drives to be connected and the only way was to use these boards.
FYI: I’m using a setup where each drive (a classic 3.5 sata drive of about 20 TB) belongs to a ZFS pool. I don’t care if that drive breaks, so let’s not go into having a stripe setup is not good… I’m using 24 drives connected to 9300-16i boards.
The workload is ready only: once I fill up a drive (I do it once) I only need to read from it (and almost all drives are already filled). For a day is quite normal to read about 3-4 TB of data among all drives connected to 9300-16i cards.
I bough 2 used cards LSI SAS 9300-16i. They looks in good conditions and the cables too. Anyway I don’t know if there is a test I can run to determine if cards / cables are faulty… One card is installed in the PCIe x16 slot and the other in the PCIe x8 slot. I also placed a fan just in front of both cards and by touching them I can confirm temperature is under control: you can touch them and they are just a bit warm. I used them for about a day before installing the fan and you could’t touch them for how hot they were.
Initially they looked running fine, but after some days I noticed a lot of pools going offline or getting errors. After doing zfs clear
I was able to get back the device. ZFS was sometimes reporting errors on some files, but it was not since I store on separate pool (Z2 pool) sha512 of all files and the sha matched. A scrub was able to mark the device healthy again (even if the pool has no redundancy at all, but simply because the data itself was not corrupted).
I started looking at logs in /var/log/messages
and I noticed a tons of:
Dec 7 19:36:29 truenas kernel: mpt3sas_cm1: log_info(0x31110e03): originator(PL), code(0x11), sub_code(0x0e03)
Dec 7 19:36:29 truenas kernel: mpt3sas_cm1: log_info(0x31110e03): originator(PL), code(0x11), sub_code(0x0e03)
Dec 7 19:36:29 truenas kernel: mpt3sas_cm1: log_info(0x31110e03): originator(PL), code(0x11), sub_code(0x0e03)
Dec 7 19:36:29 truenas kernel: sd 37:0:1:0: Power-on or device reset occurred
Doing some research I found out that for these HBA you need a special firmware to work with truenas: https://www.truenas.com/community/resources/lsi-9300-xx-firmware-update.145/
.
I was using some old firmware and so I updated as described in the post. Here the logs of my 2 boards:
root@truenas[/home/admin]# sas3flash -listall
Avago Technologies SAS3 Flash Utility
Version 16.00.00.00 (2017.05.02)
Copyright 2008-2017 Avago Technologies. All rights reserved.
Adapter Selected is a Avago SAS: SAS3008(C0)
Num Ctlr FW Ver NVDATA x86-BIOS PCI Addr
----------------------------------------------------------------------------
0 SAS3008(C0) 16.00.12.00 0e.01.00.03 08.15.00.00 00:12:00:00
1 SAS3008(C0) 16.00.12.00 0e.01.00.03 08.15.00.00 00:14:00:00
2 SAS3008(C0) 16.00.12.00 0e.01.00.03 08.15.00.00 00:2f:00:00
3 SAS3008(C0) 16.00.12.00 0e.01.00.03 08.15.00.00 00:31:00:00
Finished Processing Commands Successfully.
Exiting SAS3Flash.
root@truenas[/home/admin]# sas3flash -c 0 -list
Avago Technologies SAS3 Flash Utility
Version 16.00.00.00 (2017.05.02)
Copyright 2008-2017 Avago Technologies. All rights reserved.
Adapter Selected is a Avago SAS: SAS3008(C0)
Controller Number : 0
Controller : SAS3008(C0)
PCI Address : 00:12:00:00
SAS Address : 500062b-2-015d-03c0
NVDATA Version (Default) : 0e.01.00.03
NVDATA Version (Persistent) : 0e.01.00.03
Firmware Product ID : 0x2221 (IT)
Firmware Version : 16.00.12.00
NVDATA Vendor : LSI
NVDATA Product ID : SAS9300-16i
BIOS Version : 08.15.00.00
UEFI BSD Version : 06.00.00.00
FCODE Version : N/A
Board Name : SAS9300-16i
Board Assembly : 03-25600-01B
Board Tracer Number : SP62102950
Finished Processing Commands Successfully.
Exiting SAS3Flash.
root@truenas[/home/admin]# sas3flash -c 1 -list
Avago Technologies SAS3 Flash Utility
Version 16.00.00.00 (2017.05.02)
Copyright 2008-2017 Avago Technologies. All rights reserved.
Adapter Selected is a Avago SAS: SAS3008(C0)
Controller Number : 1
Controller : SAS3008(C0)
PCI Address : 00:14:00:00
SAS Address : 500062b-2-015d-2940
NVDATA Version (Default) : 0e.01.00.03
NVDATA Version (Persistent) : 0e.01.00.03
Firmware Product ID : 0x2221 (IT)
Firmware Version : 16.00.12.00
NVDATA Vendor : LSI
NVDATA Product ID : SAS9300-16i
BIOS Version : 08.15.00.00
UEFI BSD Version : 06.00.00.00
FCODE Version : N/A
Board Name : SAS9300-16i
Board Assembly : 03-25600-01B
Board Tracer Number : SP62102950
Finished Processing Commands Successfully.
Exiting SAS3Flash.
root@truenas[/home/admin]# sas3flash -c 2 -list
Avago Technologies SAS3 Flash Utility
Version 16.00.00.00 (2017.05.02)
Copyright 2008-2017 Avago Technologies. All rights reserved.
Adapter Selected is a Avago SAS: SAS3008(C0)
Controller Number : 2
Controller : SAS3008(C0)
PCI Address : 00:2f:00:00
SAS Address : 500062b-2-015d-00c0
NVDATA Version (Default) : 0e.01.00.03
NVDATA Version (Persistent) : 0e.01.00.03
Firmware Product ID : 0x2221 (IT)
Firmware Version : 16.00.12.00
NVDATA Vendor : LSI
NVDATA Product ID : SAS9300-16i
BIOS Version : 08.15.00.00
UEFI BSD Version : 06.00.00.00
FCODE Version : N/A
Board Name : SAS9300-16i
Board Assembly : 03-25600-01B
Board Tracer Number : SP62102749
Finished Processing Commands Successfully.
Exiting SAS3Flash.
root@truenas[/home/admin]# sas3flash -c 3 -list
Avago Technologies SAS3 Flash Utility
Version 16.00.00.00 (2017.05.02)
Copyright 2008-2017 Avago Technologies. All rights reserved.
Adapter Selected is a Avago SAS: SAS3008(C0)
Controller Number : 3
Controller : SAS3008(C0)
PCI Address : 00:31:00:00
SAS Address : 500062b-2-015d-2640
NVDATA Version (Default) : 0e.01.00.03
NVDATA Version (Persistent) : 0e.01.00.03
Firmware Product ID : 0x2221 (IT)
Firmware Version : 16.00.12.00
NVDATA Vendor : LSI
NVDATA Product ID : SAS9300-16i
BIOS Version : 08.15.00.00
UEFI BSD Version : 06.00.00.00
FCODE Version : N/A
Board Name : SAS9300-16i
Board Assembly : 03-25600-01B
Board Tracer Number : SP62102749
Finished Processing Commands Successfully.
Exiting SAS3Flash.
root@truenas[/home/admin]#
BIOS Version
and UEFI BSD Version
looks old but I think we only care about Firmware Version
, right ?
Then I rebooted truenas and things got better…
I did firmaware update before going to bed and on the next morning I noticed a pool being suspended (as it was happening with more drives before). It was from a pool under scrub (trying to clear the false errors from previous faulty reads):
Dec 8 05:25:04 truenas kernel: mpt3sas_cm3: log_info(0x31110d00): originator(PL), code(0x11), sub_code(0x0d00)
Dec 8 05:25:04 truenas kernel: sd 39:0:3:0: [sdo] tag#7214 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=3s
Dec 8 05:25:04 truenas kernel: sd 39:0:3:0: [sdo] tag#7214 Sense Key : Not Ready [current]
Dec 8 05:25:04 truenas kernel: sd 39:0:3:0: [sdo] tag#7214 Add. Sense: Logical unit not ready, cause not reportable
Dec 8 05:25:04 truenas kernel: sd 39:0:3:0: [sdo] tag#7214 CDB: Read(16) 88 00 00 00 00 01 96 5c ec 60 00 00 01 00 00 00
Dec 8 05:25:04 truenas kernel: zio pool=my_pool_12 vdev=/dev/disk/by-partuuid/169f3144-7ec4-4ecf-9642-af8fced28480 error=5 type=1 offset=3490629337088 size=131072 flags=1572992
Dec 8 05:25:04 truenas kernel: sd 39:0:3:0: [sdo] tag#7215 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s
Dec 8 05:25:04 truenas kernel: sd 39:0:3:0: [sdo] tag#7215 Sense Key : Not Ready [current]
Dec 8 05:25:04 truenas kernel: sd 39:0:3:0: [sdo] tag#7215 Add. Sense: Logical unit not ready, cause not reportable
Dec 8 05:25:04 truenas kernel: sd 39:0:3:0: [sdo] tag#7215 CDB: Read(16) 88 00 00 00 00 00 00 00 12 10 00 00 00 10 00 00
Dec 8 05:25:04 truenas kernel: zio pool=my_pool_12 vdev=/dev/disk/by-partuuid/169f3144-7ec4-4ecf-9642-af8fced28480 error=5 type=1 offset=270336 size=8192 flags=721089
Dec 8 05:25:04 truenas kernel: sd 39:0:3:0: [sdo] tag#7216 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s
Dec 8 05:25:04 truenas kernel: sd 39:0:3:0: [sdo] tag#7216 Sense Key : Not Ready [current]
Dec 8 05:25:04 truenas kernel: sd 39:0:3:0: [sdo] tag#7216 Add. Sense: Logical unit not ready, cause not reportable
Dec 8 05:25:04 truenas kernel: sd 39:0:3:0: [sdo] tag#7216 CDB: Read(16) 88 00 00 00 00 08 2f 7f f4 10 00 00 00 10 00 00
Dec 8 05:25:04 truenas kernel: zio pool=my_pool_12 vdev=/dev/disk/by-partuuid/169f3144-7ec4-4ecf-9642-af8fced28480 error=5 type=1 offset=18000204275712 size=8192 flags=721089
Dec 8 05:25:04 truenas kernel: sd 39:0:3:0: [sdo] tag#7217 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s
Dec 8 05:25:04 truenas kernel: sd 39:0:3:0: [sdo] tag#7217 Sense Key : Not Ready [current]
Dec 8 05:25:04 truenas kernel: sd 39:0:3:0: [sdo] tag#7217 Add. Sense: Logical unit not ready, cause not reportable
Dec 8 05:25:04 truenas kernel: sd 39:0:3:0: [sdo] tag#7217 CDB: Read(16) 88 00 00 00 00 08 2f 7f f6 10 00 00 00 10 00 00
Dec 8 05:25:04 truenas kernel: zio pool=my_pool_12 vdev=/dev/disk/by-partuuid/169f3144-7ec4-4ecf-9642-af8fced28480 error=5 type=1 offset=18000204537856 size=8192 flags=721089
Dec 8 05:25:07 truenas kernel: WARNING: Pool 'my_pool_12' has encountered an uncorrectable I/O failure and has been suspended.
For sure it’s less then before, but if it was only because of the firmware why it happened again ?
Now I zfs clear
the pool again and scrub is continuing…
No more error at time of writing, but it’s just few hours after.
I found this post all-disks-in-vdev-faulted-all-at-once-but-no-other-drives-on-backplane.110077 which links to this reddit post: scale_drive_resets_with_lsi_93008i_looking_for. It seems I have the same issue, do you agree ?
As per comments, the fix should be to upgrade firmware to the specific version (which I did it) and to do blacklist mpt3sas
. The last step I did not tried and I wanted to check people opinions… That seems not something you want to do in truenas, since you will change settings in the OS…
Do you have any suggestions on how to proceed ?
If one (or both) HBA board is faulty how can I detect it ?
I may also buy a brand new card (HBA 9600-24i in example)… The goal is to connect as many drives I can (where each drive needs I would say 220 MB/s since they are 3.5 mechanical drives).
Any help in troubleshooting this problem is very appreciated!
Thanks!