Multiple 'hung' offline smart tests, can't cancel

Jonasx · October 8, 2024, 9:17pm

having an issue stopping multiple offline SMART tests ‘running’ on several disks. Not sure how they were initiated.

For example I have on drive that has 21 offline tests smartctrl indicates 100% remaining on all.
smartctl -X returns that “self-testing aborted” but that’s not the case.

Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Offline             Self-test routine in progress 100%      1450         -
# 2  Offline             Self-test routine in progress 100%      1450         -
# 3  Offline             Self-test routine in progress 100%      1450         -
# 4  Offline             Self-test routine in progress 100%      1450         -
# 5  Offline             Self-test routine in progress 100%      1450         -
# 6  Offline             Self-test routine in progress 100%      1450         -
# 7  Offline             Self-test routine in progress 100%      1450         -
# 8  Offline             Self-test routine in progress 100%      1450         -
# 9  Offline             Self-test routine in progress 100%      1450         -
#10  Offline             Self-test routine in progress 100%      1450         -
#11  Offline             Self-test routine in progress 100%      1450         -
#12  Offline             Self-test routine in progress 100%      1450         -
#13  Offline             Self-test routine in progress 100%      1450         -
#14  Offline             Self-test routine in progress 100%      1450         -
#15  Offline             Self-test routine in progress 100%      1450         -
#16  Offline             Self-test routine in progress 100%      1450         -
#17  Offline             Self-test routine in progress 100%      1450         -
#18  Offline             Self-test routine in progress 100%      1450         -
#19  Offline             Self-test routine in progress 100%      1450         -
#20  Offline             Self-test routine in progress 100%      1450         -
#21  Offline             Self-test routine in progress 100%      1450         -

This is present on 9 of 24 disks, varying degrees. All test of the type ‘offline’, all 100% remaining , all not responsive to a cancel

Any ideas?

Protopia · October 8, 2024, 10:24pm

We need to start by understanding your hardware configuration.

Please post full details of your hardware including MB, SATA Controllers, disks (inc. detailed model numbers), devices (lsblk -bo NAME,PTTYPE,TYPE,START,SIZE,PARTTYPENAME), pools (sudo zpool status -v, sudo zpool import -v) etc.

Since this pool is offline, please describe how it is configured.

Jonasx · October 8, 2024, 11:12pm

Continuing the discussion from Multiple 'hung' offline smart tests, can't cancel:

info on the disk in question as follows -

Device Model:     TEAM T2532TB
Serial Number:    TPBF*********************
LU WWN Device Id: 0 000000 000000000
Firmware Version: HP3618C8
User Capacity:    2,048,408,248,320 bytes [2.04 TB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
Form Factor:      2.5 inches
TRIM Command:     Available
Device is:        Not in smartctl database 7.3/5528
ATA Version is:   ACS-4 T13/BSR INCITS 529 revision 5
SATA Version is:  SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Tue Oct  8 16:54:06 2024 CDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

System -

Intel I5 12400
ASUS H610-PLUS D4
64GB RAM
24 2TB SSDs
LSI 9305-24i latest firmware/bios i could find

lsblk -bo NAME,PTTYPE,TYPE,START,SIZE,PARTTYPENAME
NAME PTTYPE TYPE START SIZE PARTTYPENAME

sda         gpt    disk          2048408248320 
└─sda1      gpt    part     4096 2048405799424 Solaris /usr & Apple ZFS
sdb         gpt    disk          2048408248320 
└─sdb1      gpt    part     4096 2048405799424 Solaris /usr & Apple ZFS
sdc         gpt    disk          2048408248320 
└─sdc1      gpt    part     4096 2048405799424 Solaris /usr & Apple ZFS
sdd         gpt    disk          2048408248320 
└─sdd1      gpt    part     4096 2048405799424 Solaris /usr & Apple ZFS
sde         gpt    disk          2048408248320 
└─sde1      gpt    part     4096 2048405799424 Solaris /usr & Apple ZFS
sdf         gpt    disk          2048408248320 
└─sdf1      gpt    part     4096 2048405799424 Solaris /usr & Apple ZFS
sdg         gpt    disk          2048408248320 
└─sdg1      gpt    part     4096 2048405799424 Solaris /usr & Apple ZFS
sdh         gpt    disk          2048408248320 
└─sdh1      gpt    part     4096 2048405799424 Solaris /usr & Apple ZFS
sdi         gpt    disk          2048408248320 
└─sdi1      gpt    part     4096 2048405799424 Solaris /usr & Apple ZFS
sdj         gpt    disk          2048408248320 
└─sdj1      gpt    part     4096 2048405799424 Solaris /usr & Apple ZFS
sdk         gpt    disk          2048408248320 
└─sdk1      gpt    part     4096 2048405799424 Solaris /usr & Apple ZFS
sdl         gpt    disk          2048408248320 
└─sdl1      gpt    part     4096 2048405799424 Solaris /usr & Apple ZFS
sdm         gpt    disk          2048408248320 
└─sdm1      gpt    part     4096 2048405799424 Solaris /usr & Apple ZFS
sdn         gpt    disk          2048408248320 
└─sdn1      gpt    part     4096 2048405799424 Solaris /usr & Apple ZFS
sdo         gpt    disk          2048408248320 
└─sdo1      gpt    part     4096 2048405799424 Solaris /usr & Apple ZFS
sdp         gpt    disk          2048408248320 
└─sdp1      gpt    part     4096 2048405799424 Solaris /usr & Apple ZFS
sdq         gpt    disk          2048408248320 
└─sdq1      gpt    part     4096 2048405799424 Solaris /usr & Apple ZFS
sdr         gpt    disk          2048408248320 
└─sdr1      gpt    part     4096 2048405799424 Solaris /usr & Apple ZFS
sds         gpt    disk          2048408248320 
└─sds1      gpt    part     4096 2048405799424 Solaris /usr & Apple ZFS
sdt         gpt    disk          2048408248320 
└─sdt1      gpt    part     4096 2048405799424 Solaris /usr & Apple ZFS
sdu         gpt    disk          2048408248320 
└─sdu1      gpt    part     4096 2048405799424 Solaris /usr & Apple ZFS
sdv         gpt    disk          2048408248320 
└─sdv1      gpt    part     4096 2048405799424 Solaris /usr & Apple ZFS
sdw         gpt    disk          2048408248320 
└─sdw1      gpt    part     4096 2048405799424 Solaris /usr & Apple ZFS
sdx         gpt    disk          2048408248320 
└─sdx1      gpt    part     4096 2048405799424 Solaris /usr & Apple ZFS
nvme0n1     gpt    disk           128035676160 
├─nvme0n1p1 gpt    part     4096       1048576 BIOS boot
├─nvme0n1p2 gpt    part     6144     536870912 EFI System
├─nvme0n1p3 gpt    part 34609152  110315773440 Solaris /usr & Apple ZFS
└─nvme0n1p4 gpt    part  1054720   17179869184 Linux swap

sudo zpool status -v

    
  pool: Datastore
 state: ONLINE
  scan: scrub repaired 0B in 00:17:19 with 0 errors on Sun Sep  8 02:17:20 2024
config:

        NAME                                      STATE     READ WRITE CKSUM
        Datastore                                 ONLINE       0     0     0
          raidz2-0                                ONLINE       0     0     0
            a87924ba-ab59-4faa-b4ab-08c4f5176bc7  ONLINE       0     0     0
            177cec98-134e-4811-8f76-bcb4c1aff10b  ONLINE       0     0     0
            b4bfa063-ab5c-4d6f-ab4c-2091eea0ec1a  ONLINE       0     0     0
            9ea85a97-c1c9-4c6d-8fd8-330495f50c00  ONLINE       0     0     0
            198be667-6ff1-4d9d-89d1-421a45ac4692  ONLINE       0     0     0
            0dd7a353-bdcc-4a9a-b8e9-8853839661db  ONLINE       0     0     0
            7c7e5b63-816e-4fa5-a58a-e9b617398e6b  ONLINE       0     0     0
            89fc77fd-63ef-4cff-a1b6-8a28a182d7b9  ONLINE       0     0     0
            9debc364-7057-4d9c-a3af-b72d6a2aa6ea  ONLINE       0     0     0
            99954aaa-56ec-4dbe-90e6-35215c54ac65  ONLINE       0     0     0
            05af729b-40c3-46ed-b18c-3a432f1d1cd7  ONLINE       0     0     0
            f2c0f8cc-379f-4e2c-8b19-e5e526fdf160  ONLINE       0     0     0
            02ea63b2-b9bf-47b1-85e1-e269d2676490  ONLINE       0     0     0
            0af9ef11-a00b-4096-a69c-b1a8648f9296  ONLINE       0     0     0
            49db4976-4d64-413d-92ee-3504ca501c9f  ONLINE       0     0     0
            d5e1162a-693a-491f-8853-d20412717eb4  ONLINE       0     0     0
            a6cde78c-b18b-43af-9aea-0b06544212d5  ONLINE       0     0     0
            1dbc9345-f2a3-4b01-9c5b-e29adce5073e  ONLINE       0     0     0
            bc50f56c-8f84-40f3-9477-6947e4b37f99  ONLINE       0     0     0
            867e2d02-3ce0-4259-b1af-fcaa839c1111  ONLINE       0     0     0
            1912395a-4896-4180-bbd9-502542cc1321  ONLINE       0     0     0
            fb3428e8-a431-454d-8e6f-8f4198dc8056  ONLINE       0     0     0
            7dc42c54-fd8d-470e-be27-02db2c048b9d  ONLINE       0     0     0
            9edb5cee-e3ae-4a66-bfd5-1439d6017a52  ONLINE       0     0     0

errors: No known data errors

  pool: boot-pool
 state: ONLINE
  scan: scrub repaired 0B in 00:00:16 with 0 errors on Mon Oct  7 03:45:17 2024
config:

        NAME         STATE     READ WRITE CKSUM
        boot-pool    ONLINE       0     0     0
          nvme0n1p3  ONLINE       0     0     0

errors: No known data errors

sudo zpool import

   
no pools available to import

Since this pool is offline, please describe how it is configured.

The pool is not offline, the type of queued tests are offline, no idea how they got there.

Of note all the drives of the same model have the tests queued. Might be something to take up with the manufacturer.

Edit -

Odd, I just did a smartctl -x on it again and it shows -

SMART Extended Self-test Log Version: 1 (1 sectors)
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Offline             Completed without error       00%      1578         -
# 2  Offline             Self-test routine in progress 10%      1578         -
# 3  Offline             Self-test routine in progress 10%      1578         -
# 4  Offline             Self-test routine in progress 10%      1578         -
# 5  Offline             Self-test routine in progress 10%      1578         -
# 6  Offline             Self-test routine in progress 10%      1578         -
# 7  Offline             Self-test routine in progress 10%      1578         -
# 8  Offline             Self-test routine in progress 10%      1578         -
# 9  Offline             Self-test routine in progress 10%      1578         -
#10  Offline             Self-test routine in progress 10%      1578         -
#11  Offline             Self-test routine in progress 10%      1578         -
#12  Offline             Self-test routine in progress 10%      1578         -
#13  Offline             Self-test routine in progress 10%      1578         -
#14  Offline             Self-test routine in progress 10%      1578         -
#15  Offline             Self-test routine in progress 10%      1578         -
#16  Offline             Self-test routine in progress 10%      1578         -
#17  Offline             Self-test routine in progress 10%      1578         -
#18  Offline             Self-test routine in progress 10%      1578         -
#19  Offline             Self-test routine in progress 10%      1578         -

Will monitor it and see if it gets through them now that it’s progressing, it’s been hung with 21 in the queue for at least a week.

Disregard - smartctl -a still shows the output from the initial post. -x shows the above…not sure the difference

joeschmuck · October 9, 2024, 3:23am

smartctl -x /dev/sda (lower case “x”) shows the full extended data, a little bit more than -a.

Some drives do not behave as they should.

And you tried smartctl -X /dev/sda (upper case “X”) to abort the self-test?

I think you must be a bit confused or not communicating your problem well. You have 24 drives, all 24 drives are ONLINE in your pool. What other pool would you expect to import?

Other than the SMART testing saying the test is still in progress and listed a lot of times, the NAS should be running.

I recommend that you shut down the machine via the GUI or Console. then unplug it for 30 seconds. Plug it back in, wait for the BMC (if you have one) to start beating again, power up and see what the results are then.

If you need more help, I suggest you provide more information about your system so we can minimize the guessing. See the links in my signature.

Jonasx · October 9, 2024, 9:13am

smartctl -x /dev/sda (lower case “x”) shows the full extended data, a little bit more than -a.

i understand that part , as i read in the man page, the difference I was referring to is that they seem to show different results regarding the queued tests - one showing 21 queued tests with 100% remaining, the other showing 18 queued test with 10% remaining, on the same drive.

I think you must be a bit confused or not communicating your problem well. You have 24 drives, all 24 drives are ONLINE in your pool. What other pool would you expect to import? Other than the SMART testing saying the test is still in progress and listed a lot of times, the NAS should be running.

I’ve reread my original report perhaps the last line is causing the confusion. The drives are not offline, the type of tests queued are offline, I’ve edited it to be less confusing. There is no pool to be imported.

The system is running fine, the only problem is the queued tests on 9 drives that do not progress and cannot thus far be cancelled.

That in and of itself is not really a problem for me except that other smart tests cannot run on those nine drives while the are stuck with the other tests queued.

NugentS · October 9, 2024, 11:34am

Shutdown / power off the NAS and all the drives - that should discard the tests

Jonasx · October 9, 2024, 1:37pm

I did so, drained the power and it seemed to fix one of them (/dev/sdh), however there are still 8 other drives that have the same condition.

I tried -X on another (/dev/sdj), power cycled it and still 8 affected drives.

I turned off smartd , just in case it was re-queuing jobs or something, repeated the whole process and no change.

8 affected drives is better than 9 i suppose, not sure how /dev/sdh got fixed

joeschmuck · October 9, 2024, 2:01pm

Please post the entirety of the command output smartctl -x /dev/sd? where the question mark is one of the affected drives. Do not post just the sections you feel we need to see like you did above, it just doesn’t paint the entire picture.

And powering off should kill any tests the drives are running.

While we are at it, explain this and provide the version you are running. If you are lucky, you are just running the wrong firmware but I don’t want to speculate.

Jonasx · October 9, 2024, 2:17pm

Please post the entirety of the command output smartctl

 smartctl -x /dev/sdj
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.6.44-production+truenas] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     TEAM T2532TB
Serial Number:    TPBF************************
LU WWN Device Id: 0 000000 000000000
Firmware Version: HP3618C8
User Capacity:    2,048,408,248,320 bytes [2.04 TB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
Form Factor:      2.5 inches
TRIM Command:     Available
Device is:        Not in smartctl database 7.3/5528
ATA Version is:   ACS-4 T13/BSR INCITS 529 revision 5
SATA Version is:  SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Wed Oct  9 09:11:57 2024 CDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
AAM feature is:   Unavailable
APM feature is:   Unavailable
Rd look-ahead is: Disabled
Write cache is:   Enabled
DSN feature is:   Unavailable
ATA Security is:  Disabled, NOT FROZEN [SEC1]
Wt Cache Reorder: Unavailable

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (  120) seconds.
Offline data collection
capabilities:                    (0x5d) SMART execute Offline immediate.
                                        No Auto Offline data collection support.
                                        Abort Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0002) Does not save SMART data before
                                        entering power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (  30) minutes.
Extended self-test routine
recommended polling time:        (  60) minutes.

SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
  1 Raw_Read_Error_Rate     -O--CK   100   100   050    -    0
  5 Reallocated_Sector_Ct   -O--CK   100   100   050    -    0
  9 Power_On_Hours          -O--CK   100   100   050    -    1595
 12 Power_Cycle_Count       -O--CK   100   100   050    -    35
160 Unknown_Attribute       -O--CK   100   100   050    -    0
161 Unknown_Attribute       -O--CK   100   100   050    -    6500
163 Unknown_Attribute       -O--CK   100   100   050    -    584
164 Unknown_Attribute       -O--CK   100   100   050    -    0
165 Unknown_Attribute       -O--CK   100   100   050    -    0
166 Unknown_Attribute       -O--CK   100   100   050    -    0
167 Unknown_Attribute       -O--CK   100   100   050    -    0
168 Unknown_Attribute       -O--CK   100   100   050    -    0
169 Unknown_Attribute       -O--CK   100   100   050    -    100
175 Program_Fail_Count_Chip -O--CK   100   100   050    -    0
176 Erase_Fail_Count_Chip   -O--CK   100   100   050    -    0
177 Wear_Leveling_Count     -O--CK   100   100   050    -    9383
178 Used_Rsvd_Blk_Cnt_Chip  -O--CK   100   100   050    -    2
181 Program_Fail_Cnt_Total  -O--CK   100   100   050    -    0
182 Erase_Fail_Count_Total  -O--CK   100   100   050    -    0
192 Power-Off_Retract_Count -O--CK   100   100   050    -    31
194 Temperature_Celsius     -O--CK   100   100   050    -    40
195 Hardware_ECC_Recovered  -O--CK   100   100   050    -    0
196 Reallocated_Event_Count -O--CK   100   100   050    -    0
197 Current_Pending_Sector  -O--CK   100   100   050    -    0
198 Offline_Uncorrectable   -O--CK   100   100   050    -    0
199 UDMA_CRC_Error_Count    -O--CK   100   100   050    -    0
232 Available_Reservd_Space -O--CK   100   100   050    -    100
241 Total_LBAs_Written      -O--CK   100   100   050    -    16856
242 Total_LBAs_Read         -O--CK   100   100   050    -    17901
                            ||||||_ K auto-keep
                            |||||__ C event count
                            ||||___ R error rate
                            |||____ S speed/performance
                            ||_____ O updated online
                            |______ P prefailure warning

General Purpose Log Directory Version 1
SMART           Log Directory Version 1 [multi-sector log support]
Address    Access  R/W   Size  Description
0x00       GPL,SL  R/O      1  Log Directory
0x01       GPL,SL  R/O      1  Summary SMART error log
0x02       GPL,SL  R/O      1  Comprehensive SMART error log
0x03       GPL,SL  R/O      1  Ext. Comprehensive SMART error log
0x04       GPL,SL  R/O      8  Device Statistics log
0x06       GPL,SL  R/O      1  SMART self-test log
0x07       GPL,SL  R/O      1  Extended self-test log
0x09       GPL,SL  R/W      1  Selective self-test log
0x10       GPL,SL  R/O      1  NCQ Command Error log
0x11       GPL,SL  R/O      1  SATA Phy Event Counters log
0x30       GPL,SL  R/O      9  IDENTIFY DEVICE data log
0x80-0x9f  GPL,SL  R/W     16  Host vendor specific log
0xa0       GPL,SL  VS      16  Device vendor specific log
0xe0       GPL,SL  R/W      1  SCT Command/Status
0xe1       GPL,SL  R/W      1  SCT Data Transfer

SMART Extended Comprehensive Error Log Version: 0 (1 sectors)
No Errors Logged

SMART Extended Self-test Log Version: 1 (1 sectors)
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Offline             Completed without error       00%      1591         -
# 2  Offline             Self-test routine in progress 10%      1591         -
# 3  Offline             Self-test routine in progress 10%      1591         -
# 4  Offline             Self-test routine in progress 10%      1591         -
# 5  Offline             Self-test routine in progress 10%      1591         -
# 6  Offline             Self-test routine in progress 10%      1591         -
# 7  Offline             Self-test routine in progress 10%      1591         -
# 8  Offline             Self-test routine in progress 10%      1591         -
# 9  Offline             Self-test routine in progress 10%      1591         -
#10  Offline             Self-test routine in progress 10%      1591         -
#11  Offline             Self-test routine in progress 10%      1591         -
#12  Offline             Self-test routine in progress 10%      1591         -
#13  Offline             Self-test routine in progress 10%      1591         -
#14  Offline             Self-test routine in progress 10%      1591         -
#15  Offline             Self-test routine in progress 10%      1591         -
#16  Offline             Self-test routine in progress 10%      1591         -
#17  Offline             Self-test routine in progress 10%      1591         -
#18  Offline             Self-test routine in progress 10%      1591         -
#19  Offline             Self-test routine in progress 10%      1591         -

SMART Selective self-test log data structure revision number 0
Note: revision number not 1 implies that no selective self-test has ever been run
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

SCT Commands not supported

Device Statistics (GP Log 0x04)
Page  Offset Size        Value Flags Description
0x01  =====  =               =  ===  == General Statistics (rev 1) ==
0x01  0x008  4              35  ---  Lifetime Power-On Resets
0x01  0x010  4            1595  ---  Power-on Hours
0x01  0x018  6      1104674816  ---  Logical Sectors Written
0x01  0x020  6        16132843  ---  Number of Write Commands
0x01  0x028  6      1173159936  ---  Logical Sectors Read
0x01  0x030  6         9859684  ---  Number of Read Commands
0x01  =====  =               =  ===  == General Statistics (rev 1) ==
0x01  0x008  4              35  ---  Lifetime Power-On Resets
0x01  0x010  4            1595  ---  Power-on Hours
0x01  0x018  6      1104674816  ---  Logical Sectors Written
0x01  0x020  6        16132843  ---  Number of Write Commands
0x01  0x028  6      1173159936  ---  Logical Sectors Read
0x01  0x030  6         9859684  ---  Number of Read Commands
0x01  =====  =               =  ===  == General Statistics (rev 1) ==
0x01  0x008  4              35  ---  Lifetime Power-On Resets
0x01  0x010  4            1595  ---  Power-on Hours
0x01  0x018  6      1104674816  ---  Logical Sectors Written
0x01  0x020  6        16132843  ---  Number of Write Commands
0x01  0x028  6      1173159936  ---  Logical Sectors Read
0x01  0x030  6         9859684  ---  Number of Read Commands
0x01  =====  =               =  ===  == General Statistics (rev 1) ==
0x01  0x008  4              35  ---  Lifetime Power-On Resets
0x01  0x010  4            1595  ---  Power-on Hours
0x01  0x018  6      1104674816  ---  Logical Sectors Written
0x01  0x020  6        16132843  ---  Number of Write Commands
0x01  0x028  6      1173159936  ---  Logical Sectors Read
0x01  0x030  6         9859684  ---  Number of Read Commands
0x01  =====  =               =  ===  == General Statistics (rev 1) ==
0x01  0x008  4              35  ---  Lifetime Power-On Resets
0x01  0x010  4            1595  ---  Power-on Hours
0x01  0x018  6      1104674816  ---  Logical Sectors Written
0x01  0x020  6        16132843  ---  Number of Write Commands
0x01  0x028  6      1173159936  ---  Logical Sectors Read
0x01  0x030  6         9859684  ---  Number of Read Commands
0x01  =====  =               =  ===  == General Statistics (rev 1) ==
0x01  0x008  4              35  ---  Lifetime Power-On Resets
0x01  0x010  4            1595  ---  Power-on Hours
0x01  0x018  6      1104674816  ---  Logical Sectors Written
0x01  0x020  6        16132843  ---  Number of Write Commands
0x01  0x028  6      1173159936  ---  Logical Sectors Read
0x01  0x030  6         9859684  ---  Number of Read Commands
0x01  =====  =               =  ===  == General Statistics (rev 1) ==
0x01  0x008  4              35  ---  Lifetime Power-On Resets
0x01  0x010  4            1595  ---  Power-on Hours
0x01  0x018  6      1104674816  ---  Logical Sectors Written
0x01  0x020  6        16132843  ---  Number of Write Commands
0x01  0x028  6      1173159936  ---  Logical Sectors Read
0x01  0x030  6         9859684  ---  Number of Read Commands
0x01  =====  =               =  ===  == General Statistics (rev 1) ==
0x01  0x008  4              35  ---  Lifetime Power-On Resets
0x01  0x010  4            1595  ---  Power-on Hours
0x01  0x018  6      1104674816  ---  Logical Sectors Written
0x01  0x020  6        16132843  ---  Number of Write Commands
0x01  0x028  6      1173159936  ---  Logical Sectors Read
0x01  0x030  6         9859684  ---  Number of Read Commands
                                |||_ C monitored condition met
                                ||__ D supports DSN
                                |___ N normalized value

Pending Defects log (GP Log 0x0c) not supported

SATA Phy Event Counters (GP Log 0x11)
ID      Size     Value  Description
0x0001  2            0  Command failed due to ICRC error
0x0009  2           57  Transition from drive PhyRdy to drive PhyNRdy
0x000a  2            3  Device-to-host register FISes sent due to a COMRESET
0x000b  2            0  CRC errors within host-to-device FIS
0x000d  2            0  Non-CRC errors within host-to-device FIS

While we are at it, explain this and provide the version you are running. If you are lucky, you are just running the wrong firmware but I don’t want to speculate.

 1.090950] mpt3sas_cm0: FW Package Ver(16.00.12.00)
[    1.091401] mpt3sas_cm0: LSISAS3224: FWVersion(16.00.12.00), ChipRevision(0x01)
[    1.091406] mpt3sas_cm0: Protocol=(Initiator,Target), Capabilities=(TLR,EEDP,Snapshot Buffer,Diag Trace Buffer,Task Set Full,NCQ)

sfatula · October 9, 2024, 6:36pm

Post output of: sas3flash -list

Jonasx · October 9, 2024, 7:58pm

    Adapter Selected is a Avago SAS: SAS3224(A1)

    Controller Number              : 0
    Controller                     : SAS3224(A1)
    PCI Address                    : 00:01:00:00
    SAS Address                    : 500062b-2-03c4-82c0
    NVDATA Version (Default)       : 10.00.00.03
    NVDATA Version (Persistent)    : 10.00.00.03
    Firmware Product ID            : 0x2228 (IT)
    Firmware Version               : 16.00.12.00
    NVDATA Vendor                  : LSI
    NVDATA Product ID              : SAS9305-24i
    BIOS Version                   : 08.37.02.00
    UEFI BSD Version               : 17.00.00.00
    FCODE Version                  : N/A
    Board Name                     : SAS9305-24i
    Board Assembly                 : 03-25699-02004
    Board Tracer Number            : SP81724392

    Finished Processing Commands Successfully.
    Exiting SAS3Flash.

sfatula · October 9, 2024, 8:06pm

Looks like latest firmware and IT mode. I’ve had hard drives before that refused to complete smart tests, think it was more the drives. Maybe look up your drive model and if anyone else has trouble with it doing smart tests.

joeschmuck · October 9, 2024, 9:46pm

I had a little time to look into this issue and here are my results (you may not be pleased).

No doubt these are crappy drives. Why do I say that?

The output of smartctl says it all. Crazy duplication of data.
Self-test not running to completion.
The firmware installed (it is the most current for that drive model) does not allow for a firmware upgrade, if one even existed.

It is possible that smartmontools is just decoding the SMART data incorrectly, but I suspect the data is not in a standardized format.

What can you do?
Two things:

Do not test your drives and live with it. I do not recommend that at all.
Troubleshoot until the problem can be isolated, or determine the drive is just poor quality.

First of all, choose one or two drives which are acting up and focus only on those right now. If we identify the problem, you can then go back and fix those.

In the TrueNAS GUI (I’m using Electric Eel so if there is a difference, sorry for that), select Storage
Next select Manage Disks for your Datastore pool.
You should now be able to see all your drives. On the right side are some down-arrows. Start at the two drives you previously selected and select the down-arrow. This opens up the drive management screen.
Verify (Ensure) that HDD Standby: Always On, and Adv. Power Management: Disabled which are the typical default values as I recall. If they are not, click the Edit button and make the changes. If you want to screw around with other setting, please wait until we have hopefully solved the current problem.
If you did not have to make any changes, please take a screen capture of the data and post it for one of the drives that is having this problem.
Make sure that you saved any changes.
Now we should start a SMART “Short” test. A short test takes no more than 2 minutes so it is a quick test. Click on Manual Test and select Short from the drop down box. A progress box should open and there should be an Expected Finished Time. Wait at least that long before continuing.
This box will not close on it’s own, select Close.
In the upper right corner of the screen you have an icon that looks like a clipboard and there should be a red circle over it which indicates the test is running. Once that red circle goes away, or 2 minutes has passed, click on SMART Test Results.
Scroll to the top of the data. you “should” see the disk completed a Short offline test, status is SUCCESS, remaining 0%, lifetime is the power on hours of the drive when the test completed, and hopefully no errors.
IF THAT WORKS, and you made changes, make those changes for all the drives.
IF THAT FAILS, post that screen shot I requested. And I’m almost out of ideas.
Let’s hope the steps above solve the problem.

Let us know how it goes.

Jonasx · October 9, 2024, 10:18pm

I’ve done the steps you outlined as part of my own troubleshooting.

I’m going to seek help elsewhere as this seems not to be a truenas issue.