Scrub task stuck

Hey all :slight_smile:

my scrub got stuck, the last time it ran.

  • This is on a 5-wide RAIDZ2.
  • The scrub started as scheduled at midnight.
  • I noticed about 20 hours later when it was stuck at 5.98%.
  • I issued a scrub stop (via the UI) at about 90 minutes ago at the time of writing but it still hasn’t stopped (or progressed). Neither has the output of zpool status -v changed
zpool status -v
  pool: kea
 state: SUSPENDED
status: One or more devices are faulted in response to IO failures.
action: Make sure the affected devices are connected, then run 'zpool clear'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-JQ
  scan: scrub in progress since Sun Apr 21 00:00:02 2024
        19.8T / 19.8T scanned, 1.18T / 19.8T issued at 15.7M/s
        0B repaired, 5.98% done, 14 days 08:37:05 to go
config:

        NAME                                      STATE     READ WRITE CKSUM
        kea                                       ONLINE       0     0     0
          raidz2-0                                ONLINE   23.7K    25     0
            4cfc32d2-5950-49a9-b5fa-f1b1cb6ac9ed  ONLINE     406   171     0
            39d9d498-4434-4de2-8561-fb77b95bf4f0  ONLINE     262    27     0
            e82e6527-35ac-4c6b-b174-6042b5f18543  ONLINE     158    31     0
            fb7756a4-bd7c-4acf-8d62-c0de10a82cdb  ONLINE     358   128     0
            d2c76d13-6db2-497a-a2fa-a7db4cd28009  ONLINE       0     0     0

errors: List of errors unavailable: pool I/O is currently suspended

All drives are connected. multi_report.sh was running on these drives daily and did not identify errors previously. Running smartctl -a now also does not show any issues with the drives in SMART (items 5, 197, 198, 199 are all 0).

smartctl -a
root@truenas[/mnt/kea/home/jonas]# smartctl -a /dev/sdi | grep -P "5 Re|197 Cu|198 Off|199 UD"
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
197 Current_Pending_Sector  0x0032   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
root@truenas[/mnt/kea/home/jonas]# smartctl -a /dev/sdj | grep -P "5 Re|197 Cu|198 Off|199 UD"
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
197 Current_Pending_Sector  0x0032   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
root@truenas[/mnt/kea/home/jonas]# smartctl -a /dev/sdk | grep -P "5 Re|197 Cu|198 Off|199 UD"
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
197 Current_Pending_Sector  0x0032   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
root@truenas[/mnt/kea/home/jonas]# smartctl -a /dev/sdl | grep -P "5 Re|197 Cu|198 Off|199 UD"
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
197 Current_Pending_Sector  0x0032   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0

I noticed the same behavior a month ago (when the scrub was last scheduled to run), and it got stuck at 27% but I more or less dismissed it as a one-off then (forced a reboot). I did not see issues previous to that but I also don’t know if scrubs actually ran ever before.

I’d appreciate input on

  1. How to proceed
    Wait more? Run zpool clear? Try to force a reboot of the whole system?

  2. What might have caused this.
    It looks to me the drives are fine (?). Is it possible they somehow ‘disconnected’ or otherwise faulted during the scrub due to the high system load? I’m running this with a consumer PSU (be quiet! Dark Power Pro P7 650W ATX). I did extensive tests at full CPU and RAM load before setting up the system as well as extended disk tests on all drives simultaneously without issues. However, I don’t think I ever did a test with all-drive-I/O plus high CPU load. I guess it could be the HBA as well. Could I rule either option out by any other means besides just getting new hardware and testing with that? Full hardware list below in case you see something else that could be a potential culprit.

  • Edit: The PSU label says it has 20A on each of the four 12V rails, but 54A max over all of them. I don’t have access to the server right now and will check tomorrow, how those are actually connected, but even in the worst case of four drives all on one rail, this should work, no?
  • Edit2: I went over the PSU sizing guide and based on that would probably by something pretty similar again, actually.
Part Model
Mobo Supermicro X12SCZ-TLN4F
CPU Xeon W-1270E
RAM 3x Kingston Server Premier DIMM 32GB, DDR4-3200 (KSM32ED8/32HC) + 1x Micron 32GB, DDR4-3200 (MTA18ADF4G72AZ-3G2F1R)
HBA AOC-S3008L-L8e
PSU be quiet! Dark Power Pro P7 650W ATX
Boot Drives Mirror of Crucial M4-CT128M4SSD2 / Samsung MS5SPA128HMCD
Affected pool’s drives Toshiba MG08 16TB (MG08ACA16TE)

Thanks a lot :slight_smile:

Things I could think of:

  1. sas2flash -list ← make sure HBA is flashed to IT mode

  2. zpool scrub -s PoolName ← stop the scrub without having to reboot

  3. In System Settings - General Enable “Show Console Messages” & scroll through when Scrub started & whenever you think it apprx failed & see if anything stands out. Using ‘dmesg’ in cli also works. The only annoying part there is that the timestamp is boot time of system in seconds.

#Edit

  1. Always a good idea to include the output of smartctl -a /dev/daX for all the drives on the pool; just because they pass doesn’t meant something isn’t suspicious.

Thanks a lot for your input!

  1. sas2flash doesn’t seem to be recognizing the card. I’ve never used the tool before, so I don’t know if this is because of the current issue or has always been like this. Also, I might be wrong here, but I’m not sure if this card even has anything other than IT mode?
Summary
root@truenas[/mnt/kea/home/jonas]# sas2flash -list
LSI Corporation SAS2 Flash Utility
Version 20.00.00.00 (2014.09.18) 
Copyright (c) 2008-2014 LSI Corporation. All rights reserved 

        No LSI SAS adapters found! Limited Command Set Available!
        ERROR: Command Not allowed without an adapter!
        ERROR: Couldn't Create Command -list
        Exiting Program.
  1. My assumption was that zpool scrub -s #PoolName is exactly what the button in the UI does, but just in case I issued it again (many hours ago) and no changes. This was the same behavior I already saw a month ago when the issue first occured and tried to use zpool scub -s on the commandline directly.

  2. The console looked jumbled (not sorted by timestamps?) but I found a relevant section in dmesg

dmesg
[2993073.575212] I/O error, dev sdc, sector 1120566224 op 0x0:(READ) flags 0x0 phys_seg 103 prio class 2
[2993073.575226] zio pool=kea vdev=/dev/disk/by-partuuid/4cfc32d2-5950-49a9-b5fa-f1b1cb6ac9ed error=5 type=1 offset=571582357504 size=1015808 flags=1074267312
[2993073.575293] sd 0:0:1:0: [sdc] tag#265 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK cmd_age=3s
[2993073.575299] sd 0:0:1:0: [sdc] tag#265 CDB: Read(16) 88 00 00 00 00 02 9b 19 41 b8 00 00 02 a8 00 00
[2993073.575301] I/O error, dev sdc, sector 11192058296 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 2
[2993073.575309] zio pool=kea vdev=/dev/disk/by-partuuid/4cfc32d2-5950-49a9-b5fa-f1b1cb6ac9ed error=5 type=1 offset=5728186298368 size=348160 flags=1572992
[2993073.575408] sd 0:0:1:0: [sdc] tag#73 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK cmd_age=0s
[2993073.575409] mpt3sas_cm0: log_info(0x31110d00): originator(PL), code(0x11), sub_code(0x0d00)
[2993073.575419] sd 0:0:1:0: [sdc] tag#73 CDB: Read(16) 88 00 00 00 00 00 00 40 02 90 00 00 00 10 00 00
[2993073.575417] mpt3sas_cm0: log_info(0x31110d00): originator(PL), code(0x11), sub_code(0x0d00)
[2993073.575426] I/O error, dev sdc, sector 4194960 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 2
[2993073.575428] mpt3sas_cm0: log_info(0x31110d00): originator(PL), code(0x11), sub_code(0x0d00)
[2993073.575438] zio pool=kea vdev=/dev/disk/by-partuuid/4cfc32d2-5950-49a9-b5fa-f1b1cb6ac9ed error=5 type=1 offset=270336 size=8192 flags=721089
[2993073.575448] sd 0:0:5:0: [sdg] tag#358 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK cmd_age=4s
[2993073.575459] sd 0:0:5:0: [sdg] tag#358 CDB: Read(16) 88 00 00 00 00 00 42 ca b2 28 00 00 07 c0 00 00
[2993073.575458] sd 0:0:5:0: [sdg] tag#64 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK cmd_age=4s
[2993073.575465] sd 0:0:1:0: [sdc] tag#74 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK cmd_age=0s
[2993073.575466] sd 0:0:5:0: [sdg] tag#258 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK cmd_age=4s
[2993073.575464] I/O error, dev sdg, sector 1120580136 op 0x0:(READ) flags 0x0 phys_seg 115 prio class 2
[2993073.575470] sd 0:0:5:0: [sdg] tag#64 CDB: Read(16) 88 00 00 00 00 00 42 ca b9 e8 00 00 02 60 00 00
[2993073.575473] sd 0:0:1:0: [sdc] tag#74 CDB: Read(16) 88 00 00 00 00 07 46 bf fa 90 00 00 00 10 00 00
[2993073.575478] sd 0:0:5:0: [sdg] tag#258 CDB: Read(16) 88 00 00 00 00 00 42 ca aa 60 00 00 07 c8 00 00
[2993073.575476] I/O error, dev sdg, sector 1120582120 op 0x0:(READ) flags 0x0 phys_seg 27 prio class 2
[2993073.575478] I/O error, dev sdc, sector 31251757712 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 2
[2993073.575479] zio pool=kea vdev=/dev/disk/by-partuuid/39d9d498-4434-4de2-8561-fb77b95bf4f0 error=5 type=1 offset=571589480448 size=1015808 flags=1074267312
[2993073.575484] I/O error, dev sdg, sector 1120578144 op 0x0:(READ) flags 0x0 phys_seg 113 prio class 2
[2993073.575488] zio pool=kea vdev=/dev/disk/by-partuuid/4cfc32d2-5950-49a9-b5fa-f1b1cb6ac9ed error=5 type=1 offset=15998752399360 size=8192 flags=721089
[2993073.575491] zio pool=kea vdev=/dev/disk/by-partuuid/39d9d498-4434-4de2-8561-fb77b95bf4f0 error=5 type=1 offset=571590496256 size=311296 flags=1074267312
[2993073.575496] zio pool=kea vdev=/dev/disk/by-partuuid/39d9d498-4434-4de2-8561-fb77b95bf4f0 error=5 type=1 offset=571588460544 size=1019904 flags=1074267312
[2993073.575505] sd 0:0:1:0: [sdc] tag#75 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK cmd_age=0s
[2993073.575512] sd 0:0:1:0: [sdc] tag#75 CDB: Read(16) 88 00 00 00 00 07 46 bf fc 90 00 00 00 10 00 00
[2993073.575517] I/O error, dev sdc, sector 31251758224 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 2
[2993073.575523] zio pool=kea vdev=/dev/disk/by-partuuid/4cfc32d2-5950-49a9-b5fa-f1b1cb6ac9ed error=5 type=1 offset=15998752661504 size=8192 flags=721089
[2993073.575704] sd 0:0:5:0: [sdg] tag#76 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK cmd_age=0s
[2993073.575702] mpt3sas_cm0: log_info(0x31110d00): originator(PL), code(0x11), sub_code(0x0d00)
[2993073.575705] mpt3sas_cm0: log_info(0x31110d00): originator(PL), code(0x11), sub_code(0x0d00)
[2993073.575717] sd 0:0:5:0: [sdg] tag#76 CDB: Read(16) 88 00 00 00 00 00 42 ca bc 90 00 00 07 a8 00 00
[2993073.575723] I/O error, dev sdg, sector 1120582800 op 0x0:(READ) flags 0x4000 phys_seg 128 prio class 2
[2993073.575734] sd 0:0:3:0: [sdb] tag#260 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK cmd_age=4s
[2993073.575732] I/O error, dev sdc, sector 1120568208 op 0x0:(READ) flags 0x0 phys_seg 118 prio class 2
[2993073.575742] sd 0:0:3:0: [sdb] tag#260 CDB: Read(16) 88 00 00 00 00 00 42 ca d3 e0 00 00 07 c0 00 00
[2993073.575748] zio pool=kea vdev=/dev/disk/by-partuuid/39d9d498-4434-4de2-8561-fb77b95bf4f0 error=5 type=1 offset=571591864320 size=1019904 flags=1074267312
[2993073.575750] zio pool=kea vdev=/dev/disk/by-partuuid/e82e6527-35ac-4c6b-b174-6042b5f18543 error=5 type=1 offset=571593900032 size=1015808 flags=1074267312
[2993073.575751] zio pool=kea vdev=/dev/disk/by-partuuid/4cfc32d2-5950-49a9-b5fa-f1b1cb6ac9ed error=5 type=1 offset=571583373312 size=1019904 flags=1074267312
[2993073.575757] zio pool=kea vdev=/dev/disk/by-partuuid/39d9d498-4434-4de2-8561-fb77b95bf4f0 error=5 type=1 offset=270336 size=8192 flags=721089
[2993073.575771] zio pool=kea vdev=/dev/disk/by-partuuid/39d9d498-4434-4de2-8561-fb77b95bf4f0 error=5 type=1 offset=571590844416 size=1019904 flags=1074267312
[2993073.575785] zio pool=kea vdev=/dev/disk/by-partuuid/39d9d498-4434-4de2-8561-fb77b95bf4f0 error=5 type=1 offset=571592884224 size=1015808 flags=1074267312
[2993073.575804] zio pool=kea vdev=/dev/disk/by-partuuid/39d9d498-4434-4de2-8561-fb77b95bf4f0 error=5 type=1 offset=15998752399360 size=8192 flags=721089
[2993073.575816] zio pool=kea vdev=/dev/disk/by-partuuid/39d9d498-4434-4de2-8561-fb77b95bf4f0 error=5 type=1 offset=15998752661504 size=8192 flags=721089
[2993073.575992] mpt3sas_cm0: log_info(0x31110d00): originator(PL), code(0x11), sub_code(0x0d00)
[2993073.575992] mpt3sas_cm0: log_info(0x31110d00): originator(PL), code(0x11), sub_code(0x0d00)
[2993073.575997] mpt3sas_cm0: log_info(0x31110d00): originator(PL), code(0x11), sub_code(0x0d00)
[2993073.576021] zio pool=kea vdev=/dev/disk/by-partuuid/e82e6527-35ac-4c6b-b174-6042b5f18543 error=5 type=1 offset=571595935744 size=1015808 flags=1074267312
[2993073.576024] zio pool=kea vdev=/dev/disk/by-partuuid/fb7756a4-bd7c-4acf-8d62-c0de10a82cdb error=5 type=1 offset=571584389120 size=1019904 flags=1074267312
[2993073.576030] zio pool=kea vdev=/dev/disk/by-partuuid/fb7756a4-bd7c-4acf-8d62-c0de10a82cdb error=5 type=1 offset=571586424832 size=1015808 flags=1074267312
[2993073.576035] zio pool=kea vdev=/dev/disk/by-partuuid/e82e6527-35ac-4c6b-b174-6042b5f18543 error=5 type=1 offset=270336 size=8192 flags=721089
[2993073.576045] zio pool=kea vdev=/dev/disk/by-partuuid/e82e6527-35ac-4c6b-b174-6042b5f18543 error=5 type=1 offset=15998752399360 size=8192 flags=721089
[2993073.576051] zio pool=kea vdev=/dev/disk/by-partuuid/e82e6527-35ac-4c6b-b174-6042b5f18543 error=5 type=1 offset=15998752661504 size=8192 flags=721089
[2993073.576059] zio pool=kea vdev=/dev/disk/by-partuuid/39d9d498-4434-4de2-8561-fb77b95bf4f0 error=5 type=1 offset=571593900032 size=1015808 flags=1074267312
[2993073.576064] zio pool=kea vdev=/dev/disk/by-partuuid/fb7756a4-bd7c-4acf-8d62-c0de10a82cdb error=5 type=1 offset=5728186294272 size=352256 flags=1572992
[2993073.576072] zio pool=kea vdev=/dev/disk/by-partuuid/fb7756a4-bd7c-4acf-8d62-c0de10a82cdb error=5 type=1 offset=571585409024 size=1015808 flags=1074267312
[2993073.576272] mpt3sas_cm0: log_info(0x31110d00): originator(PL), code(0x11), sub_code(0x0d00)
[2993073.576910] zio pool=kea vdev=/dev/disk/by-partuuid/e82e6527-35ac-4c6b-b174-6042b5f18543 error=5 type=1 offset=571594915840 size=1019904 flags=1074267312
[2993073.577268] zio pool=kea vdev=/dev/disk/by-partuuid/4cfc32d2-5950-49a9-b5fa-f1b1cb6ac9ed error=5 type=1 offset=571584393216 size=1015808 flags=1074267312
[2993073.604096] zio pool=kea vdev=/dev/disk/by-partuuid/e82e6527-35ac-4c6b-b174-6042b5f18543 error=5 type=1 offset=571592880128 size=1019904 flags=1074267312
[2993073.604185] zio pool=kea vdev=/dev/disk/by-partuuid/4cfc32d2-5950-49a9-b5fa-f1b1cb6ac9ed error=5 type=1 offset=571585409024 size=1019904 flags=1074267312
[2993073.604863] zio pool=kea vdev=/dev/disk/by-partuuid/e82e6527-35ac-4c6b-b174-6042b5f18543 error=5 type=1 offset=571596951552 size=1015808 flags=1074267312
[2993073.605119] zio pool=kea vdev=/dev/disk/by-partuuid/39d9d498-4434-4de2-8561-fb77b95bf4f0 error=5 type=1 offset=571594915840 size=1019904 flags=1074267312
[2993073.617914] sd 0:0:4:0: [sde] Synchronizing SCSI cache
[2993073.618661] sd 0:0:4:0: [sde] Synchronize Cache(10) failed: Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
[2993073.619358] mpt3sas_cm0: mpt3sas_transport_port_remove: removed: sas_addr(0x4433221106000000)
[2993073.620036] mpt3sas_cm0: removing handle(0x000d), sas_addr(0x4433221106000000)
[2993073.620521] mpt3sas_cm0: enclosure logical id(0x500304802415a001), slot(6)
[2993073.620884] mpt3sas_cm0: enclosure level(0x0000), connector name(     )
[2993073.693875] sd 0:0:1:0: [sdc] Synchronizing SCSI cache
[2993073.694320] sd 0:0:1:0: [sdc] Synchronize Cache(10) failed: Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
[2993073.695214] mpt3sas_cm0: mpt3sas_transport_port_remove: removed: sas_addr(0x4433221100000000)
[2993073.695784] mpt3sas_cm0: removing handle(0x0009), sas_addr(0x4433221100000000)
[2993073.696135] mpt3sas_cm0: enclosure logical id(0x500304802415a001), slot(0)
[2993073.696458] mpt3sas_cm0: enclosure level(0x0000), connector name(     )
[2993073.703650] WARNING: Pool 'kea' has encountered an uncorrectable I/O failure and has been suspended.

[2993073.731782] systemd-journald[717]: Data hash table of /var/log/journal/10d9c37795904ec3a977d52bd6eb6a2a/system.journal has a fill level at 75.0 (8535 of 11377 items, 6553600 file size, 767 bytes per hash table item), suggesting rotation.
[2993073.733448] systemd-journald[717]: /var/log/journal/10d9c37795904ec3a977d52bd6eb6a2a/system.journal: Journal header limits reached or header out-of-date, rotating.
  1. Good point on smartctl, here’s the full output for all five drives in the pool
smartctl -a
root@truenas[/mnt/kea/home/jonas]# smartctl -a /dev/sda
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.1.74-production+truenas] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Toshiba MG08ACA... Enterprise Capacity HDD
Device Model:     TOSHIBA MG08ACA16TE
Serial Number:    Y270A02ZFVGG
LU WWN Device Id: 5 000039 c28c91865
Firmware Version: 0103
User Capacity:    16,000,900,661,248 bytes [16.0 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database 7.3/5528
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.3, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Mon Apr 22 13:35:44 2024 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever 
                                        been run.
Total time to complete Offline 
data collection:                (  120) seconds.
Offline data collection
capabilities:                    (0x5b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine 
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        (1434) minutes.
SCT capabilities:              (0x003d) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   100   100   050    Pre-fail  Always       -       0
  2 Throughput_Performance  0x0005   100   100   050    Pre-fail  Offline      -       0
  3 Spin_Up_Time            0x0027   100   100   001    Pre-fail  Always       -       7950
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       24
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000b   100   100   050    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0005   100   100   050    Pre-fail  Offline      -       0
  9 Power_On_Hours          0x0032   075   075   000    Old_age   Always       -       10081
 10 Spin_Retry_Count        0x0033   100   100   030    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       24
 23 Helium_Condition_Lower  0x0023   100   100   075    Pre-fail  Always       -       0
 24 Helium_Condition_Upper  0x0023   100   100   075    Pre-fail  Always       -       0
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       23
193 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       38
194 Temperature_Celsius     0x0022   100   100   000    Old_age   Always       -       26 (Min/Max 15/38)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
220 Disk_Shift              0x0002   100   100   000    Old_age   Always       -       34340871
222 Loaded_Hours            0x0032   075   075   000    Old_age   Always       -       10061
223 Load_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
224 Load_Friction           0x0022   100   100   000    Old_age   Always       -       0
226 Load-in_Time            0x0026   100   100   000    Old_age   Always       -       531
240 Head_Flying_Hours       0x0001   100   100   001    Pre-fail  Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%     10046         -
# 2  Short offline       Completed without error       00%     10022         -
# 3  Short offline       Completed without error       00%      9998         -
# 4  Short offline       Completed without error       00%      9974         -
# 5  Short offline       Completed without error       00%      9950         -
# 6  Short offline       Completed without error       00%      9926         -
# 7  Short offline       Completed without error       00%      9902         -
# 8  Short offline       Completed without error       00%      9878         -
# 9  Short offline       Completed without error       00%      9854         -
#10  Short offline       Completed without error       00%      9830         -
#11  Short offline       Completed without error       00%      9806         -
#12  Short offline       Completed without error       00%      9782         -
#13  Short offline       Completed without error       00%      9758         -
#14  Short offline       Completed without error       00%      9734         -
#15  Short offline       Completed without error       00%      9710         -
#16  Short offline       Completed without error       00%      9686         -
#17  Short offline       Completed without error       00%      9662         -
#18  Short offline       Completed without error       00%      9638         -
#19  Short offline       Completed without error       00%      9614         -
#20  Short offline       Completed without error       00%      9590         -
#21  Short offline       Completed without error       00%      9566         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

The above only provides legacy SMART information - try 'smartctl -x' for more

root@truenas[/mnt/kea/home/jonas]# smartctl -a /dev/sdi
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.1.74-production+truenas] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Toshiba MG08ACA... Enterprise Capacity HDD
Device Model:     TOSHIBA MG08ACA16TE
Serial Number:    Y250A2RCFVGG
LU WWN Device Id: 5 000039 c28c8bd90
Firmware Version: 0103
User Capacity:    16,000,900,661,248 bytes [16.0 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database 7.3/5528
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.3, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Mon Apr 22 13:36:52 2024 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever 
                                        been run.
Total time to complete Offline 
data collection:                (  120) seconds.
Offline data collection
capabilities:                    (0x5b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine 
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        (1482) minutes.
SCT capabilities:              (0x003d) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   100   100   050    Pre-fail  Always       -       0
  2 Throughput_Performance  0x0005   100   100   050    Pre-fail  Offline      -       0
  3 Spin_Up_Time            0x0027   100   100   001    Pre-fail  Always       -       2610
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       27
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000b   100   100   050    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0005   100   100   050    Pre-fail  Offline      -       0
  9 Power_On_Hours          0x0032   075   075   000    Old_age   Always       -       10081
 10 Spin_Retry_Count        0x0033   100   100   030    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       27
 23 Helium_Condition_Lower  0x0023   100   100   075    Pre-fail  Always       -       0
 24 Helium_Condition_Upper  0x0023   100   100   075    Pre-fail  Always       -       0
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       26
193 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       41
194 Temperature_Celsius     0x0022   100   100   000    Old_age   Always       -       25 (Min/Max 14/37)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
220 Disk_Shift              0x0002   100   100   000    Old_age   Always       -       917507
222 Loaded_Hours            0x0032   075   075   000    Old_age   Always       -       10029
223 Load_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
224 Load_Friction           0x0022   100   100   000    Old_age   Always       -       0
226 Load-in_Time            0x0026   100   100   000    Old_age   Always       -       528
240 Head_Flying_Hours       0x0001   100   100   001    Pre-fail  Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%     10046         -
# 2  Short offline       Completed without error       00%     10022         -
# 3  Short offline       Completed without error       00%      9998         -
# 4  Short offline       Completed without error       00%      9974         -
# 5  Short offline       Completed without error       00%      9950         -
# 6  Short offline       Completed without error       00%      9926         -
# 7  Short offline       Completed without error       00%      9902         -
# 8  Short offline       Completed without error       00%      9878         -
# 9  Short offline       Completed without error       00%      9854         -
#10  Short offline       Completed without error       00%      9830         -
#11  Short offline       Completed without error       00%      9806         -
#12  Short offline       Completed without error       00%      9782         -
#13  Short offline       Completed without error       00%      9758         -
#14  Short offline       Completed without error       00%      9734         -
#15  Short offline       Completed without error       00%      9710         -
#16  Short offline       Completed without error       00%      9686         -
#17  Short offline       Completed without error       00%      9662         -
#18  Short offline       Completed without error       00%      9638         -
#19  Short offline       Completed without error       00%      9614         -
#20  Short offline       Completed without error       00%      9590         -
#21  Short offline       Completed without error       00%      9566         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

The above only provides legacy SMART information - try 'smartctl -x' for more

root@truenas[/mnt/kea/home/jonas]# smartctl -a /dev/sdj
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.1.74-production+truenas] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Toshiba MG08ACA... Enterprise Capacity HDD
Device Model:     TOSHIBA MG08ACA16TE
Serial Number:    92T0A0M2FVGG
LU WWN Device Id: 5 000039 c08d88b30
Firmware Version: 0103
User Capacity:    16,000,900,661,248 bytes [16.0 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database 7.3/5528
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.3, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Mon Apr 22 13:36:55 2024 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever 
                                        been run.
Total time to complete Offline 
data collection:                (  120) seconds.
Offline data collection
capabilities:                    (0x5b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine 
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        (1462) minutes.
SCT capabilities:              (0x003d) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   100   100   050    Pre-fail  Always       -       0
  2 Throughput_Performance  0x0005   100   100   050    Pre-fail  Offline      -       0
  3 Spin_Up_Time            0x0027   100   100   001    Pre-fail  Always       -       2644
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       25
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000b   100   100   050    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0005   100   100   050    Pre-fail  Offline      -       0
  9 Power_On_Hours          0x0032   075   075   000    Old_age   Always       -       10081
 10 Spin_Retry_Count        0x0033   100   100   030    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       25
 23 Helium_Condition_Lower  0x0023   100   100   075    Pre-fail  Always       -       0
 24 Helium_Condition_Upper  0x0023   100   100   075    Pre-fail  Always       -       0
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       24
193 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       41
194 Temperature_Celsius     0x0022   100   100   000    Old_age   Always       -       26 (Min/Max 16/38)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
220 Disk_Shift              0x0002   100   100   000    Old_age   Always       -       50724869
222 Loaded_Hours            0x0032   075   075   000    Old_age   Always       -       10041
223 Load_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
224 Load_Friction           0x0022   100   100   000    Old_age   Always       -       0
226 Load-in_Time            0x0026   100   100   000    Old_age   Always       -       528
240 Head_Flying_Hours       0x0001   100   100   001    Pre-fail  Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%     10046         -
# 2  Short offline       Completed without error       00%     10022         -
# 3  Short offline       Completed without error       00%      9998         -
# 4  Short offline       Completed without error       00%      9974         -
# 5  Short offline       Completed without error       00%      9950         -
# 6  Short offline       Completed without error       00%      9926         -
# 7  Short offline       Completed without error       00%      9902         -
# 8  Short offline       Completed without error       00%      9878         -
# 9  Short offline       Completed without error       00%      9854         -
#10  Short offline       Completed without error       00%      9830         -
#11  Short offline       Completed without error       00%      9806         -
#12  Short offline       Completed without error       00%      9782         -
#13  Short offline       Completed without error       00%      9758         -
#14  Short offline       Completed without error       00%      9734         -
#15  Short offline       Completed without error       00%      9710         -
#16  Short offline       Completed without error       00%      9686         -
#17  Short offline       Completed without error       00%      9662         -
#18  Short offline       Completed without error       00%      9638         -
#19  Short offline       Completed without error       00%      9614         -
#20  Short offline       Completed without error       00%      9590         -
#21  Short offline       Completed without error       00%      9566         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

The above only provides legacy SMART information - try 'smartctl -x' for more

root@truenas[/mnt/kea/home/jonas]# smartctl -a /dev/sdk
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.1.74-production+truenas] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Toshiba MG08ACA... Enterprise Capacity HDD
Device Model:     TOSHIBA MG08ACA16TE
Serial Number:    92T0A0KSFVGG
LU WWN Device Id: 5 000039 c08d88ae7
Firmware Version: 0103
User Capacity:    16,000,900,661,248 bytes [16.0 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database 7.3/5528
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.3, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Mon Apr 22 13:36:58 2024 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever 
                                        been run.
Total time to complete Offline 
data collection:                (  120) seconds.
Offline data collection
capabilities:                    (0x5b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine 
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        (1475) minutes.
SCT capabilities:              (0x003d) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   100   100   050    Pre-fail  Always       -       0
  2 Throughput_Performance  0x0005   100   100   050    Pre-fail  Offline      -       0
  3 Spin_Up_Time            0x0027   100   100   001    Pre-fail  Always       -       2772
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       30
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000b   100   100   050    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0005   100   100   050    Pre-fail  Offline      -       0
  9 Power_On_Hours          0x0032   075   075   000    Old_age   Always       -       10081
 10 Spin_Retry_Count        0x0033   100   100   030    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       30
 23 Helium_Condition_Lower  0x0023   100   100   075    Pre-fail  Always       -       0
 24 Helium_Condition_Upper  0x0023   100   100   075    Pre-fail  Always       -       0
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       1
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       29
193 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       44
194 Temperature_Celsius     0x0022   100   100   000    Old_age   Always       -       26 (Min/Max 15/38)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
220 Disk_Shift              0x0002   100   100   000    Old_age   Always       -       236584964
222 Loaded_Hours            0x0032   075   075   000    Old_age   Always       -       10029
223 Load_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
224 Load_Friction           0x0022   100   100   000    Old_age   Always       -       0
226 Load-in_Time            0x0026   100   100   000    Old_age   Always       -       528
240 Head_Flying_Hours       0x0001   100   100   001    Pre-fail  Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%     10046         -
# 2  Short offline       Completed without error       00%     10022         -
# 3  Short offline       Completed without error       00%      9998         -
# 4  Short offline       Completed without error       00%      9974         -
# 5  Short offline       Completed without error       00%      9950         -
# 6  Short offline       Completed without error       00%      9926         -
# 7  Short offline       Completed without error       00%      9902         -
# 8  Short offline       Completed without error       00%      9878         -
# 9  Short offline       Completed without error       00%      9854         -
#10  Short offline       Completed without error       00%      9830         -
#11  Short offline       Completed without error       00%      9806         -
#12  Short offline       Completed without error       00%      9782         -
#13  Short offline       Completed without error       00%      9758         -
#14  Short offline       Completed without error       00%      9734         -
#15  Short offline       Completed without error       00%      9710         -
#16  Short offline       Completed without error       00%      9686         -
#17  Short offline       Completed without error       00%      9662         -
#18  Short offline       Completed without error       00%      9638         -
#19  Short offline       Completed without error       00%      9614         -
#20  Short offline       Completed without error       00%      9590         -
#21  Short offline       Completed without error       00%      9566         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

The above only provides legacy SMART information - try 'smartctl -x' for more

root@truenas[/mnt/kea/home/jonas]# smartctl -a /dev/sdl
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.1.74-production+truenas] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Toshiba MG08ACA... Enterprise Capacity HDD
Device Model:     TOSHIBA MG08ACA16TE
Serial Number:    92T0A0M1FVGG
LU WWN Device Id: 5 000039 c08d88b2e
Firmware Version: 0103
User Capacity:    16,000,900,661,248 bytes [16.0 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database 7.3/5528
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.3, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Mon Apr 22 13:37:01 2024 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever 
                                        been run.
Total time to complete Offline 
data collection:                (  120) seconds.
Offline data collection
capabilities:                    (0x5b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine 
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        (1450) minutes.
SCT capabilities:              (0x003d) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   100   100   050    Pre-fail  Always       -       0
  2 Throughput_Performance  0x0005   100   100   050    Pre-fail  Offline      -       0
  3 Spin_Up_Time            0x0027   100   100   001    Pre-fail  Always       -       2815
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       28
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000b   100   100   050    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0005   100   100   050    Pre-fail  Offline      -       0
  9 Power_On_Hours          0x0032   075   075   000    Old_age   Always       -       10081
 10 Spin_Retry_Count        0x0033   100   100   030    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       28
 23 Helium_Condition_Lower  0x0023   100   100   075    Pre-fail  Always       -       0
 24 Helium_Condition_Upper  0x0023   100   100   075    Pre-fail  Always       -       0
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       27
193 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       43
194 Temperature_Celsius     0x0022   100   100   000    Old_age   Always       -       26 (Min/Max 15/38)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
220 Disk_Shift              0x0002   100   001   000    Old_age   Always       -       51118096
222 Loaded_Hours            0x0032   075   075   000    Old_age   Always       -       10030
223 Load_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
224 Load_Friction           0x0022   100   100   000    Old_age   Always       -       0
226 Load-in_Time            0x0026   100   100   000    Old_age   Always       -       531
240 Head_Flying_Hours       0x0001   100   100   001    Pre-fail  Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%     10046         -
# 2  Short offline       Completed without error       00%     10022         -
# 3  Short offline       Completed without error       00%      9998         -
# 4  Short offline       Completed without error       00%      9974         -
# 5  Short offline       Completed without error       00%      9950         -
# 6  Short offline       Completed without error       00%      9926         -
# 7  Short offline       Completed without error       00%      9902         -
# 8  Short offline       Completed without error       00%      9878         -
# 9  Short offline       Completed without error       00%      9854         -
#10  Short offline       Completed without error       00%      9830         -
#11  Short offline       Completed without error       00%      9806         -
#12  Short offline       Completed without error       00%      9782         -
#13  Short offline       Completed without error       00%      9758         -
#14  Short offline       Completed without error       00%      9734         -
#15  Short offline       Completed without error       00%      9710         -
#16  Short offline       Completed without error       00%      9686         -
#17  Short offline       Completed without error       00%      9662         -
#18  Short offline       Completed without error       00%      9638         -
#19  Short offline       Completed without error       00%      9614         -
#20  Short offline       Completed without error       00%      9590         -
#21  Short offline       Completed without error       00%      9566         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

The above only provides legacy SMART information - try 'smartctl -x' for more

I have now asked the system to shutdown. Last time that took hours before it actually happened.

Once done, I will check the power (and data cabling). I also thought that the HBA might be running too hot, since there isn’t a huge amount of airflow in that section of the case. Right now it’s ‘touchable’ but I don’t know how it is under load and I once had a RAID controller that wouldn’t have survived without a 80mm fan strapped to it. I might try the same here.

From what I briefly read it seems that there are two versions of that card & AOC-S3008L-L8e is indeed the one by default flashed to IT mode, but they can cross flash between the two modes… sadly I got nothing else of use to offer to confirm if yours is in IT mode.

Yeah dmesg is a pain to read since the timestamps are seconds from when the system originally booted… but looking at the fast that you have multiple drives reporting IO errors I think we can assume that (hopefully) you don’t have multiple drives failing at the same time & HBA and/or wiring is of more interest.

smartctl seems to show no obvious failure other than the fact you’ve neglected to setup smartctl long tests. Oh, also disk sn#92T0A0KSFVGG complaining that you dropped it physically (g-sense error). However, considering we have multiple io errors I don’t think that is the cause of your problems.

Overheating of the HBA is a strong possibility - I keep forgetting not everyone just dumps fans into every nook of their case.

I’d investigate the HBA & try to find a way to confirm it is in IT mode & that it isn’t overheating; start with whichever sounds easier. Also to setup some smartctl -t long once you got everything working.

Hopefully that either works or smarter minds than mine on the forums can give better advice.

You can use zpool clear kea and then zpool scrub kea in order to confirm the issue.
Does sas3flash -list work?

Not surprising because you’d need sas3flash for SAS3 cards.

2 Likes

I guess that makes sense… :slight_smile: and sas3flash indeed works

sas3flash
root@truenas[/mnt/kea/home/jonas]# sas3flash -list
Avago Technologies SAS3 Flash Utility
Version 16.00.00.00 (2017.05.02) 
Copyright 2008-2017 Avago Technologies. All rights reserved.

        Adapter Selected is a Avago SAS: SAS3008(C0)

        Controller Number              : 0
        Controller                     : SAS3008(C0)
        PCI Address                    : 00:01:00:00
        SAS Address                    : 5003048-0-2415-a001
        NVDATA Version (Default)       : 0e.01.30.28
        NVDATA Version (Persistent)    : 0e.01.30.28
        Firmware Product ID            : 0x2221 (IT)
        Firmware Version               : 16.00.01.00
        NVDATA Vendor                  : LSI
        NVDATA Product ID              : LSI3008-IT
        BIOS Version                   : 08.37.00.00
        UEFI BSD Version               : 18.00.00.00
        FCODE Version                  : N/A
        Board Name                     : LSI3008-IT
        Board Assembly                 : N/A
        Board Tracer Number            : N/A

        Finished Processing Commands Successfully.
        Exiting SAS3Flash.

I hooked up a small 80mm fan directly in front of the HBA. It’s not powerful but should help somewhat.

After booting again, the errors no longer displayed in zpool status. I ran zpool clear and have now re-started the scrub. Let’s see what happens…

I am happy to report the scrub finished through the night. Therefore the HBA getting too hot is likely the culprit of the issues I saw.

I did get the following messages while the scrub was running about one disk being removed. Not sure what happened there, but I’ve seen this before with the same disk.

device fault and resilver
ZFS has detected that a device was removed.

impact: Fault tolerance of the pool may be compromised.
eid: 21
class: statechange
state: REMOVED
host: truenas
time: 2024-04-22 22:33:05+0200
vpath: /dev/disk/by-partuuid/39d9d498-4434-4de2-8561-fb77b95bf4f0
vguid: 0x13FFADB518C74464
pool: kea (0x0E5F1A98AC2F920F)

--- 
ZFS has finished a resilver:

eid: 33
class: resilver_finish
host: truenas
time: 2024-04-22 22:33:57+0200
pool: kea
state: ONLINE
status: Some supported and requested features are not enabled on the pool.
The pool can still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
the pool may no longer be accessible by software that does not support
the features. See zpool-features(7) for details.
scan: resilvered 505M in 00:00:25 with 0 errors on Mon Apr 22 22:33:57 2024
config:

NAME STATE READ WRITE CKSUM
kea ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
4cfc32d2-5950-49a9-b5fa-f1b1cb6ac9ed ONLINE 0 0 0
39d9d498-4434-4de2-8561-fb77b95bf4f0 ONLINE 0 0 0
e82e6527-35ac-4c6b-b174-6042b5f18543 ONLINE 0 0 0
fb7756a4-bd7c-4acf-8d62-c0de10a82cdb ONLINE 0 0 0
d2c76d13-6db2-497a-a2fa-a7db4cd28009 ONLINE 0 0 0

errors: No known data errors

I’ve now also started SMART long tests on all disks in the pool and will schedule them to run regularly (thanks for the hint Fleshmauler).

For the amusement/condemnation of the crowd, here’s what I jerry-rigged together last night to cool the HBA down. Will work on something better :wink:

zip ties ftw

[/details]

2 Likes

Wow! This old firmware may not be the cause of your issues, but you want to upgrade it to 16.00.12.00.

1 Like

It is indeed. To be honest, it wasn’t on my radar at all. I plucked it in, it worked, and I never thought about it. Upgrading that is next on the list after the long tests finished. Appreciate the hint! :+1:

Server got into errors again while the SMART long tests were still at 10%. I was (re-)starting some apps as well at the time, so I can’t say for sure what caused it this time around but I doubt it was the apps.

zpool status
root@truenas[/mnt/kea/home/jonas]# zpool status -vL kea
  pool: kea
 state: SUSPENDED
status: One or more devices are faulted in response to IO failures.
action: Make sure the affected devices are connected, then run 'zpool clear'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-JQ
  scan: resilvered 505M in 00:00:25 with 0 errors on Mon Apr 22 22:33:57 2024
config:

        NAME        STATE     READ WRITE CKSUM
        kea         ONLINE       0     0     0
          raidz2-0  ONLINE      21    16     0
            sdk2    ONLINE       3    22     0
            sdl2    ONLINE       3    21     0
            sdi2    ONLINE       3    18     0
            sdj2    ONLINE       3    18     0
            sda2    ONLINE       0     0     0

However, again /sda is the only drive without errors. The long test on /sda also kept running along, while those for the other drives stopped.

This made me think that maybe it’s also a power issue after all and /sda happens to have it’s own 12V rail while the others are all on one. This indeed happened to be the case. I’ve distributed power to the HDDs to 2+2+1 now (with the SSDs on the fourth remaining one).

I then started another scrub task and within 15 minutes or so got the same issue again with UUID 39d9d498-4434-4de2-8561-fb77b95bf4f0 supposedly having been removed and re-appearing on its own seconds later. This also cancelled the scrub task, so the one last night didn’t actually complete either.

This is the same drive that has the G-Sense_Error_Rate of 1, maybe it really has an issue? A SMART short test on it completes fine, I now started another long one just on this drive.
I could also try another cable to the HBA but my trust in this drive is fading.

What I don’t understand is why I always get errors on (the same) four drives. Shouldn’t it be just this single one if it were the sole culprit?


Independent of that, is my understanding correct that this is how I can update my HBA’s firmware? Do I need the BIOS part as well?

wget https://www.supermicro.com/wdl/driver/SAS/Broadcom/3008/Firmware/3008_FW_PH16.00.14.00.rar
7z t 3008_FW_PH16.00.14.00.rar
7z x 3008_FW_PH16.00.14.00.rar
cd IT/UEFI
sas3flash -l 2024-04-23_firmware_update.txt -o -f SAS9300_8i_IT.bin

I wouldn’t place a bet on that. I was right.

I don’t believe it’s the PSU either… Which cables to connect the drives to the HBA? Are all drives connected to it?

SMART long on that one drive and that one only is still running and made it to 30%. That’s further than ‘ever before’.

I should maybe note, that I did disk_burnin on all of these drives followed by SMART long tests before even setting up the pool. And those tests also ran in parallel, so not sure what broke since.

Which cables to connect the drives to the HBA? Are all drives connected to it?

All five drives are hooked up to the same AOC-S3008L-L8E using two 1x SFF-8643 to 4x SATA cables (Supermicro CBL-SAST-0556). Two drives are on one cable, three on the other.

You did what?

Sorry, ran this thing

SMART long on the single disk (that always gets ‘removed’ and has G-Sense_Error_Rate of 1) completed fine this morning.

smartctl -a /dev/sdf2

root@truenas[/mnt/kea/home/jonas]# smartctl -a /dev/sdf2                           
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.1.74-production+truenas] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Toshiba MG08ACA... Enterprise Capacity HDD
Device Model:     TOSHIBA MG08ACA16TE
Serial Number:    92T0A0KSFVGG
LU WWN Device Id: 5 000039 c08d88ae7
Firmware Version: 0103
User Capacity:    16,000,900,661,248 bytes [16.0 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database 7.3/5528
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.3, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Wed Apr 24 12:56:52 2024 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever 
                                        been run.
Total time to complete Offline 
data collection:                (  120) seconds.
Offline data collection
capabilities:                    (0x5b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine 
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        (1475) minutes.
SCT capabilities:              (0x003d) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   100   100   050    Pre-fail  Always       -       0
  2 Throughput_Performance  0x0005   100   100   050    Pre-fail  Offline      -       0
  3 Spin_Up_Time            0x0027   100   100   001    Pre-fail  Always       -       8340
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       37
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000b   100   100   050    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0005   100   100   050    Pre-fail  Offline      -       0
  9 Power_On_Hours          0x0032   075   075   000    Old_age   Always       -       10126
 10 Spin_Retry_Count        0x0033   100   100   030    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       37
 23 Helium_Condition_Lower  0x0023   100   100   075    Pre-fail  Always       -       0
 24 Helium_Condition_Upper  0x0023   100   100   075    Pre-fail  Always       -       0
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       1
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       36
193 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       56
194 Temperature_Celsius     0x0022   100   100   000    Old_age   Always       -       23 (Min/Max 15/38)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
220 Disk_Shift              0x0002   100   100   000    Old_age   Always       -       34471945
222 Loaded_Hours            0x0032   075   075   000    Old_age   Always       -       10068
223 Load_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
224 Load_Friction           0x0022   100   100   000    Old_age   Always       -       0
226 Load-in_Time            0x0026   100   100   000    Old_age   Always       -       595
240 Head_Flying_Hours       0x0001   100   100   001    Pre-fail  Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%     10125         -
# 2  Short offline       Completed without error       00%     10103         -
# 3  Short offline       Completed without error       00%     10046         -
# 4  Short offline       Completed without error       00%     10022         -
# 5  Short offline       Completed without error       00%      9998         -
# 6  Short offline       Completed without error       00%      9974         -
# 7  Short offline       Completed without error       00%      9950         -
# 8  Short offline       Completed without error       00%      9926         -
# 9  Short offline       Completed without error       00%      9902         -
#10  Short offline       Completed without error       00%      9878         -
#11  Short offline       Completed without error       00%      9854         -
#12  Short offline       Completed without error       00%      9830         -
#13  Short offline       Completed without error       00%      9806         -
#14  Short offline       Completed without error       00%      9782         -
#15  Short offline       Completed without error       00%      9758         -
#16  Short offline       Completed without error       00%      9734         -
#17  Short offline       Completed without error       00%      9710         -
#18  Short offline       Completed without error       00%      9686         -
#19  Short offline       Completed without error       00%      9662         -
#20  Short offline       Completed without error       00%      9638         -
#21  Short offline       Completed without error       00%      9614         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

The above only provides legacy SMART information - try 'smartctl -x' for more

I’ll do long tests on all other disks as well but first…

I upgraded the HBAs firmware to 16.00.14.00 now, opting for the safer route of not doing this on the live system but rather the UEFI shell. Only the firmware was updated, not BIOS.

Details if anyone finds this via Google

If Google sends anyone else here first, it somehow took me a while as well to come across this guide for flashing the firmware.

I also only now came across the fun that was in the past with TrueNAS (Core) pretty much requiring the ‘special’ 16.00.12.00 firmware to work properly.

What I did specifically

  • Created a FAT32 formatted USB Stick
  • (Downloaded UEFI shell ISO from here. Extracted ISO and copied UEFI-Shell-2.2-23H2-RELEASE\efi\boot\bootx64.efi to USB stick in /efi/boot/bootx64.efi (via here, but this would not have been necessary, there already is an UEFI shell built-in I could’ve used))
  • Downloaded HBA firmware from here
  • Extracted archive contents and copied contents of 3008_FW_PH16.00.14.00\IT\UEFI to root of USB stick
  • Plucked in USB Stick, entered Boot menu, selected UEFI shell. In there
  • fs0:
  • sas3flash.efi -listall
  • sas3flash.efi -c 0 -list
  • sas3flash.efi -c 0 -l 2024-04-24_firmware-upgrade.txt -o -f 3008IT16.ROM
  • sas3flash.efi -c 0 -list

With the updated firmware I then triggered another scrub which made it further than before, but still crashed eventually. Again /dev/sdf/ was removed and, in addition to previous rounds, also /dev/sda/. Both came back, a resilver ran and the pool is now back without errors.

dmesg
[11012.707687] sd 0:0:2:0: device_block, handle(0x000a)
[11012.708302] sd 0:0:3:0: device_block, handle(0x000b)
[11014.707107] sd 0:0:2:0: device_unblock and setting to running, handle(0x000a)
[11014.707620] mpt3sas_cm0: log_info(0x31110d00): originator(PL), code(0x11), sub_code(0x0d00)
[11014.707629] mpt3sas_cm0: log_info(0x31110d00): originator(PL), code(0x11), sub_code(0x0d00)
[11014.707641] mpt3sas_cm0: log_info(0x31110d00): originator(PL), code(0x11), sub_code(0x0d00)
[11014.707663] sd 0:0:2:0: [sdf] tag#866 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK cmd_age=3s
[11014.707671] sd 0:0:2:0: [sdf] tag#866 CDB: Read(16) 88 00 00 00 00 02 0a c5 43 98 00 00 07 b0 00 00
[11014.707674] I/O error, dev sdf, sector 8770634648 op 0x0:(READ) flags 0x0 phys_seg 111 prio class 2
[11014.707686] zio pool=kea vdev=/dev/disk/by-partuuid/39d9d498-4434-4de2-8561-fb77b95bf4f0 error=5 type=1 offset=4488417390592 size=1007616 flags=1074267312
[11014.707713] sd 0:0:2:0: [sdf] tag#876 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK cmd_age=3s
[11014.707717] sd 0:0:2:0: [sdf] tag#876 CDB: Read(16) 88 00 00 00 00 02 0a c5 4b 48 00 00 07 c0 00 00
[11014.707720] I/O error, dev sdf, sector 8770636616 op 0x0:(READ) flags 0x0 phys_seg 95 prio class 2
[11014.707726] zio pool=kea vdev=/dev/disk/by-partuuid/39d9d498-4434-4de2-8561-fb77b95bf4f0 error=5 type=1 offset=4488418398208 size=1015808 flags=1074267312
[11014.707758] sd 0:0:3:0: device_unblock and setting to running, handle(0x000b)
[11014.707912] mpt3sas_cm0: log_info(0x31110d00): originator(PL), code(0x11), sub_code(0x0d00)
[11014.707916] sd 0:0:2:0: [sdf] tag#902 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK cmd_age=0s
[11014.707922] mpt3sas_cm0: log_info(0x31110d00): originator(PL), code(0x11), sub_code(0x0d00)
[11014.707925] sd 0:0:2:0: [sdf] tag#902 CDB: Read(16) 88 00 00 00 00 02 0a c5 5b 08 00 00 07 c0 00 00
[11014.707929] I/O error, dev sdf, sector 8770640648 op 0x0:(READ) flags 0x0 phys_seg 113 prio class 2
[11014.707939] zio pool=kea vdev=/dev/disk/by-partuuid/39d9d498-4434-4de2-8561-fb77b95bf4f0 error=5 type=1 offset=4488420462592 size=1015808 flags=1074267312
[11014.707959] sd 0:0:2:0: [sdf] tag#903 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK cmd_age=0s
[11014.707960] sd 0:0:3:0: [sda] tag#894 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK cmd_age=3s
[11014.707963] sd 0:0:2:0: [sdf] tag#903 CDB: Read(16) 88 00 00 00 00 00 00 40 02 90 00 00 00 10 00 00
[11014.707966] I/O error, dev sdf, sector 4194960 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 2
[11014.707970] sd 0:0:3:0: [sda] tag#894 CDB: Read(16) 88 00 00 00 00 02 0a c5 5c 60 00 00 00 58 00 00
[11014.707972] zio pool=kea vdev=/dev/disk/by-partuuid/39d9d498-4434-4de2-8561-fb77b95bf4f0 error=5 type=1 offset=270336 size=8192 flags=721089
[11014.707976] I/O error, dev sda, sector 8770640992 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 2
[11014.707988] sd 0:0:2:0: [sdf] tag#904 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK cmd_age=0s
[11014.707990] zio pool=kea vdev=/dev/disk/by-partuuid/e82e6527-35ac-4c6b-b174-6042b5f18543 error=5 type=1 offset=4488420638720 size=45056 flags=1573040
[11014.707995] sd 0:0:2:0: [sdf] tag#904 CDB: Read(16) 88 00 00 00 00 07 46 bf fa 90 00 00 00 10 00 00
[11014.707999] I/O error, dev sdf, sector 31251757712 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 2
[11014.708007] zio pool=kea vdev=/dev/disk/by-partuuid/39d9d498-4434-4de2-8561-fb77b95bf4f0 error=5 type=1 offset=15998752399360 size=8192 flags=721089
[11014.708016] sd 0:0:3:0: [sda] tag#892 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK cmd_age=3s
[11014.708021] sd 0:0:2:0: [sdf] tag#905 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK cmd_age=0s
[11014.708027] sd 0:0:3:0: [sda] tag#892 CDB: Read(16) 88 00 00 00 00 02 0a c5 5c 08 00 00 00 58 00 00
[11014.708028] sd 0:0:2:0: [sdf] tag#905 CDB: Read(16) 88 00 00 00 00 07 46 bf fc 90 00 00 00 10 00 00
[11014.708032] I/O error, dev sdf, sector 31251758224 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 2
[11014.708034] I/O error, dev sda, sector 8770640904 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 2
[11014.708039] zio pool=kea vdev=/dev/disk/by-partuuid/39d9d498-4434-4de2-8561-fb77b95bf4f0 error=5 type=1 offset=15998752661504 size=8192 flags=721089
[11014.708048] zio pool=kea vdev=/dev/disk/by-partuuid/e82e6527-35ac-4c6b-b174-6042b5f18543 error=5 type=1 offset=4488420593664 size=45056 flags=1573040
[11014.708711] sd 0:0:2:0: [sdf] tag#886 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK cmd_age=3s
[11014.708983] mpt3sas_cm0: log_info(0x31110d00): originator(PL), code(0x11), sub_code(0x0d00)
[11014.709201] sd 0:0:2:0: [sdf] tag#886 CDB: Read(16) 88 00 00 00 00 02 0a c5 53 08 00 00 08 00 00 00
[11014.709422] sd 0:0:3:0: [sda] tag#833 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK cmd_age=3s
[11014.709540] I/O error, dev sdf, sector 8770638600 op 0x0:(READ) flags 0x0 phys_seg 104 prio class 2
[11014.709737] sd 0:0:3:0: [sda] tag#833 CDB: Read(16) 88 00 00 00 00 02 0a c5 5c b8 00 00 06 c0 00 00
[11014.710017] zio pool=kea vdev=/dev/disk/by-partuuid/39d9d498-4434-4de2-8561-fb77b95bf4f0 error=5 type=1 offset=4488419414016 size=1048576 flags=1074267312
[11014.721807] I/O error, dev sda, sector 8770641080 op 0x0:(READ) flags 0x0 phys_seg 95 prio class 2
[11014.722115] zio pool=kea vdev=/dev/disk/by-partuuid/e82e6527-35ac-4c6b-b174-6042b5f18543 error=5 type=1 offset=4488420683776 size=884736 flags=1074267312
[11014.722760] zio pool=kea vdev=/dev/disk/by-partuuid/e82e6527-35ac-4c6b-b174-6042b5f18543 error=5 type=1 offset=4488421568512 size=1015808 flags=1074267312
[11014.723334] zio pool=kea vdev=/dev/disk/by-partuuid/e82e6527-35ac-4c6b-b174-6042b5f18543 error=5 type=1 offset=4488422584320 size=1015808 flags=1074267312
[11014.723931] zio pool=kea vdev=/dev/disk/by-partuuid/e82e6527-35ac-4c6b-b174-6042b5f18543 error=5 type=1 offset=15998752399360 size=8192 flags=721089
[11014.724003] zio pool=kea vdev=/dev/disk/by-partuuid/e82e6527-35ac-4c6b-b174-6042b5f18543 error=5 type=1 offset=4488423600128 size=1019904 flags=1074267312
[11014.724745] zio pool=kea vdev=/dev/disk/by-partuuid/e82e6527-35ac-4c6b-b174-6042b5f18543 error=5 type=1 offset=15998752661504 size=8192 flags=721089
[11014.725391] zio pool=kea vdev=/dev/disk/by-partuuid/e82e6527-35ac-4c6b-b174-6042b5f18543 error=5 type=1 offset=4488424620032 size=1028096 flags=1074267312
[11014.726001] zio pool=kea vdev=/dev/disk/by-partuuid/e82e6527-35ac-4c6b-b174-6042b5f18543 error=5 type=1 offset=270336 size=8192 flags=721089
[11014.726700] zio pool=kea vdev=/dev/disk/by-partuuid/e82e6527-35ac-4c6b-b174-6042b5f18543 error=5 type=1 offset=4488425648128 size=1015808 flags=1074267312
[11014.732047] zio pool=kea vdev=/dev/disk/by-partuuid/39d9d498-4434-4de2-8561-fb77b95bf4f0 error=5 type=1 offset=270336 size=8192 flags=721601
[11014.733520] zio pool=kea vdev=/dev/disk/by-partuuid/39d9d498-4434-4de2-8561-fb77b95bf4f0 error=5 type=1 offset=15998752399360 size=8192 flags=721601
[11014.779694] sd 0:0:2:0: [sdf] Synchronizing SCSI cache
[11014.788483] sd 0:0:2:0: [sdf] Synchronize Cache(10) failed: Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
[11014.789018] mpt3sas_cm0: mpt3sas_transport_port_remove: removed: sas_addr(0x4433221101000000)
[11014.789636] mpt3sas_cm0: removing handle(0x000a), sas_addr(0x4433221101000000)
[11014.789986] mpt3sas_cm0: enclosure logical id(0x500304802415a001), slot(1)
[11014.790295] mpt3sas_cm0: enclosure level(0x0000), connector name(     )
[11014.839578] sd 0:0:3:0: [sda] Synchronizing SCSI cache
[11014.840040] sd 0:0:3:0: [sda] Synchronize Cache(10) failed: Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
[11014.840670] mpt3sas_cm0: mpt3sas_transport_port_remove: removed: sas_addr(0x4433221102000000)
[11014.841022] mpt3sas_cm0: removing handle(0x000b), sas_addr(0x4433221102000000)
[11014.841345] mpt3sas_cm0: enclosure logical id(0x500304802415a001), slot(2)
[11014.841657] mpt3sas_cm0: enclosure level(0x0000), connector name(     )
[11028.707918] mpt3sas_cm0: handle(0xb) sas_address(0x4433221102000000) port_type(0x1)
[11028.967024] scsi 0:0:5:0: Direct-Access     ATA      TOSHIBA MG08ACA1 0103 PQ: 0 ANSI: 6
[11028.967302] scsi 0:0:5:0: SATA: handle(0x000b), sas_addr(0x4433221102000000), phy(2), device_name(0x0000000000000000)
[11028.967602] scsi 0:0:5:0: enclosure logical id (0x500304802415a001), slot(2) 
[11028.967855] scsi 0:0:5:0: enclosure level(0x0000), connector name(     )
[11028.968285] scsi 0:0:5:0: atapi(n), ncq(y), asyn_notify(n), smart(y), fua(y), sw_preserve(y)
[11028.968562] scsi 0:0:5:0: qdepth(32), tagged(1), scsi_level(7), cmd_que(1)
[11029.045669] sd 0:0:5:0: Attached scsi generic sg3 type 0
[11029.045682] sd 0:0:5:0: Power-on or device reset occurred
[11029.047241]  end_device-0:5: add: handle(0x000b), sas_addr(0x4433221102000000)
[11029.053780] sd 0:0:5:0: [sda] 31251759104 512-byte logical blocks: (16.0 TB/14.6 TiB)
[11029.055078] sd 0:0:5:0: [sda] 4096-byte physical blocks
[11029.062986] sd 0:0:5:0: [sda] Write Protect is off
[11029.064300] sd 0:0:5:0: [sda] Mode Sense: 9b 00 10 08
[11029.073834] sd 0:0:5:0: [sda] Write cache: enabled, read cache: enabled, supports DPO and FUA
[11029.132835]  sda: sda1 sda2
[11029.134327] sd 0:0:5:0: [sda] Attached SCSI disk
[11029.207921] mpt3sas_cm0: handle(0xa) sas_address(0x4433221101000000) port_type(0x1)
[11029.466692] scsi 0:0:6:0: Direct-Access     ATA      TOSHIBA MG08ACA1 0103 PQ: 0 ANSI: 6
[11029.467936] scsi 0:0:6:0: SATA: handle(0x000a), sas_addr(0x4433221101000000), phy(1), device_name(0x0000000000000000)
[11029.469069] scsi 0:0:6:0: enclosure logical id (0x500304802415a001), slot(1) 
[11029.469837] scsi 0:0:6:0: enclosure level(0x0000), connector name(     )
[11029.470199] scsi 0:0:6:0: atapi(n), ncq(y), asyn_notify(n), smart(y), fua(y), sw_preserve(y)
[11029.471057] scsi 0:0:6:0: qdepth(32), tagged(1), scsi_level(7), cmd_que(1)
[11029.547242] sd 0:0:6:0: Attached scsi generic sg4 type 0
[11029.547298] sd 0:0:6:0: Power-on or device reset occurred
[11029.547942]  end_device-0:6: add: handle(0x000a), sas_addr(0x4433221101000000)
[11029.553276] sd 0:0:6:0: [sdf] 31251759104 512-byte logical blocks: (16.0 TB/14.6 TiB)
[11029.553870] sd 0:0:6:0: [sdf] 4096-byte physical blocks
[11029.560662] sd 0:0:6:0: [sdf] Write Protect is off
[11029.560890] sd 0:0:6:0: [sdf] Mode Sense: 9b 00 10 08
[11029.570301] sd 0:0:6:0: [sdf] Write cache: enabled, read cache: enabled, supports DPO and FUA
[11029.635543]  sdf: sdf1 sdf2
[11029.635887] sd 0:0:6:0: [sdf] Attached SCSI disk
[11030.688277] block device autoloading is deprecated and will be removed.
[11031.115261] md: md126 stopped.

Guess I’ll try switching connections to the HBA around next.

Smart long looking good - other stupid things I can think of:

  1. Confirm HBA is connected to the x16 slot (other two slots are x4 & HBA wants x8)
  2. I think you have two SATA ports available on the motherboard still; toss the most problematic drives directly to motherboard?
1 Like
  1. I think you have two SATA ports available on the motherboard still; toss the most problematic drives directly to motherboard?

I spent the time since my last post changing the cables around on the HBA. I hooked the disk that always gets removed from the pool up to the three remaining connectors I have on the HBA. Every time it will eventually get removed when I trigger a scrub. Progress ranged from 65% to less than 5%.

Just this morning, I also tried hooking up the weird disk directly to the remaining SATA port on the Mainboard (I omitted one SSD above in my system overview). That scrub is still running, but not for long yet. (*Edit: It died like the others shortly after)

  1. Confirm HBA is connected to the x16 slot (other two slots are x4 & HBA wants x8)

What can I say. It is in the lower slot, a physical x8 but only wired as x4. It’s right there even in the quick reference but I completely overlooked it. The only reason it’s down there not in the x16 is because the large CPU cooler blocks the upper port. Guess I’ll get a different cooler and move the HBA.

I was just about out of ideas, this gives hope - thank you!

That’s unusual.

That’s unusual.

That’s what I get for trying to re-purpose my old consumer hardware…

Thermalright True Spirit 140 BW