SMART Tests Haven't Been Running?

Woke up this morning to two alerts: one saying my pool is degraded, a device having faulted due to persistent errors. The other saying that the ATA error count increased from 0 to 1.

I had a look at the drive in question, and it showed 18 ZFS errors, but looking at its SMART results, they’re all showing SUCCESS as the status…but they also all have “Short Offline” as the description. Does this mean the Long SMART tests, which I have scheduled for midnight ever Tuesday and thus should have run a little over 12 hours ago, have not been running?

I ran a manual long test of the problem drive, and that doesn’t seem to be running either—there’s currently a scrub running, though, so is it just waiting for that to complete? I can’t see any sort of queue or anything though.

I’d like to help but you have provided no information as to your hardware, which version of TrueNAS you are running, just the basics.

And I’m certain I can help if I know these little details.

Well, I don’t really know what to say in terms of hardware, there are a lot of components of varying levels of relevance. The disks are 20TB WD Red Pros. The OS is version 25.04.0.

Do you know what your motherboard is? How much RAM you have? The CPU make/model? How are the drives physically connected to the computer? Do you have an HBA? Make/Model. Is TrueNAS running on bare metal?

I am not asking for board level components.

ZFS errors and Drive errors are not always the same thing. ZFS errors are from the file system whilst Drive error are a form of physical damage. And drive errors can cause ZFS errors, but not always.

Take a look at the link in my signature called Drive Troubleshooting Flowcharts and it will guide you through some steps.

As for SMART not working, please be very descriptive in what you did to configure SMART testing in the GUI. Screenshots may be helpful here. Sometimes if you use a custom time, it may not work, but if you select a predetermined time, it should work.

1 Like

Do you know what your motherboard is? How much RAM you have? The CPU make/model?

ASUS B660M-A, intel Core i3 12100 Alder Lake 4 Core 8, 16GB of RAM.

How are the drives physically connected to the computer? Do you have an HBA? Make/Model. Is TrueNAS running on bare metal?

I don’t understand what this means.

As for SMART not working, please be very descriptive in what you did to configure SMART testing in the GUI.

I don’t know, it was a long time ago.


The extended test I ran manually does seem to have worked though it’s reporting failure with no errors which doesn’t make sense to me:
Screenshot 2025-08-20 14.35.13

But I have this alert as well:

HBA is a Host Bus Adapter, typically a PCIe card that connects to the drives.
Bare Metal is TrueNAS is running on the computer alone, not a virtual machine running on Proxmox or ESXi for example.

You have a few things going on that I would need information about in order to help out.

First I want to explain to you that my troubleshooting style with people on the internet is to assume nothing. I will treat you as if you know nothing at all so we do not assume something and then be chasing a rabbit down a hole. It makes troubleshooting actually faster to resolve a problem.

You do have a custom date/time to run the SMART test, this could be an issue, however I will not assume it is until we prove it.

You apparently have many drives since you have drive sdg.

Instructions:

  • lsblk -bo NAME,MODEL,ROTA,PTTYPE,TYPE,START,SIZE,PARTTYPENAME,PARTUUID
  • lspci
  • sas2flash -list (This or the next command may or maynot work)
  • sas3flash -list
  • sudo zpool status -v
  • smartctl -x /dev/sdg (to see the error data for the drive)
  • smartctl -x /dev/sdc (to check of the drive status, may ask for all the drives later for completeness)

That is enough to start with.

I need to ask you this and please do not take offence, this is not my objective, but you sound like you don’t know computer hardware much at all, and possibly not much about TrueNAS. Is that true? If true, as the system ages it will become more problematic for you to maintain, and you do need to know the hardware and the software it is running. TrueNAS is made for the commercial industry where IT people live and breathe this stuff. And while it is fairly easy to use with the GUI, a person must have some basic knowledge to configure and maintain it. Again, do not take offence, that is not my intention. I need to know who I’m working with to help you out.

3 Likes

The machine running TrueNAS is dedicated solely to TrueNAS, it’s not running anything else.

The drives are connected via…something I don’t remember the name of. I have two of them, they’re something like PCIe cards that have several hard drive connectors coming from them? I always forget what they’re called, and I got them on eBay so they’re harder to pull up the order details for than the hardware I got at the computer store.

I know a reasonable amount, I guess (compared to average, rather than this community), but it’s not something I actively keep on top of so I find myself having to refresh my memory every time something comes up.



smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.12.15-production+truenas] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Red Pro
Device Model:     WDC WD201KFGX-68BKJN0
Serial Number:    2LGBTSTK
LU WWN Device Id: 5 000cca 2b3c55d0e
Firmware Version: 83.00A83
User Capacity:    20,000,588,955,648 bytes [20.0 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database 7.3/5816
ATA Version is:   ACS-5 (minor revision not indicated)
SATA Version is:  SATA 3.5, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Thu Aug 21 03:51:32 2025 AWST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
AAM feature is:   Unavailable
APM feature is:   Disabled
Rd look-ahead is: Enabled
Write cache is:   Enabled
DSN feature is:   Unavailable
ATA Security is:  Disabled, frozen [SEC2]
Wt Cache Reorder: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      ( 121) The previous self-test completed having
                                        the read element of the test failed.
Total time to complete Offline 
data collection:                (  101) seconds.
Offline data collection
capabilities:                    (0x5b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine 
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        (2319) minutes.
SCT capabilities:              (0x003d) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
  1 Raw_Read_Error_Rate     PO-R--   100   100   001    -    0
  2 Throughput_Performance  --S---   148   148   054    -    49
  3 Spin_Up_Time            POS---   083   083   001    -    368 (Average 368)
  4 Start_Stop_Count        -O--C-   100   100   000    -    64
  5 Reallocated_Sector_Ct   PO--CK   100   100   001    -    0
  7 Seek_Error_Rate         -O-R--   100   100   001    -    0
  8 Seek_Time_Performance   --S---   140   140   020    -    15
  9 Power_On_Hours          -O--C-   098   098   000    -    25254
 10 Spin_Retry_Count        -O--C-   100   100   001    -    0
 12 Power_Cycle_Count       -O--CK   100   100   000    -    64
 22 Helium_Level            PO---K   100   100   025    -    6553700
 90 NAND_Master             P---CK   100   100   001    -    0x003c00000000
192 Power-Off_Retract_Count -O--CK   100   100   000    -    1115
193 Load_Cycle_Count        -O--C-   100   100   000    -    1115
194 Temperature_Celsius     -O----   041   041   000    -    51 (Min/Max 18/65)
196 Reallocated_Event_Count -O--CK   100   100   000    -    0
197 Current_Pending_Sector  -O---K   100   100   000    -    8
198 Offline_Uncorrectable   ---R--   100   100   000    -    0
199 UDMA_CRC_Error_Count    -O-R--   100   100   000    -    0
                            ||||||_ K auto-keep
                            |||||__ C event count
                            ||||___ R error rate
                            |||____ S speed/performance
                            ||_____ O updated online
                            |______ P prefailure warning

General Purpose Log Directory Version 1
SMART           Log Directory Version 1 [multi-sector log support]
Address    Access  R/W   Size  Description
0x00       GPL,SL  R/O      1  Log Directory
0x01           SL  R/O      1  Summary SMART error log
0x02           SL  R/O      1  Comprehensive SMART error log
0x03       GPL     R/O      1  Ext. Comprehensive SMART error log
0x04       GPL     R/O    256  Device Statistics log
0x04       SL      R/O    255  Device Statistics log
0x06           SL  R/O      1  SMART self-test log
0x07       GPL     R/O      1  Extended self-test log
0x08       GPL     R/O      2  Power Conditions log
0x09           SL  R/W      1  Selective self-test log
0x0c       GPL     R/O  17579  Pending Defects log
0x10       GPL     R/O      1  NCQ Command Error log
0x11       GPL     R/O      1  SATA Phy Event Counters log
0x12       GPL     R/O      1  SATA NCQ Non-Data log
0x13       GPL     R/O      1  SATA NCQ Send and Receive log
0x15       GPL     R/W      1  Rebuild Assist log
0x21       GPL     R/O      1  Write stream error log
0x22       GPL     R/O      1  Read stream error log
0x24       GPL     R/O    256  Current Device Internal Status Data log
0x25       GPL     R/O    256  Saved Device Internal Status Data log
0x2f       GPL     R/O      1  Set Sector Configuration
0x30       GPL,SL  R/O      9  IDENTIFY DEVICE data log
0x80-0x9f  GPL,SL  R/W     16  Host vendor specific log
0xb7           SL  VS       1  Device vendor specific log
0xd8-0xd9  GPL,SL  VS       1  Device vendor specific log
0xe0       GPL,SL  R/W      1  SCT Command/Status
0xe1       GPL,SL  R/W      1  SCT Data Transfer

SMART Extended Comprehensive Error Log Version: 1 (1 sectors)
Device Error Count: 1
        CR     = Command Register
        FEATR  = Features Register
        COUNT  = Count (was: Sector Count) Register
        LBA_48 = Upper bytes of LBA High/Mid/Low Registers ]  ATA-8
        LH     = LBA High (was: Cylinder High) Register    ]   LBA
        LM     = LBA Mid (was: Cylinder Low) Register      ] Register
        LL     = LBA Low (was: Sector Number) Register     ]
        DV     = Device (was: Device/Head) Register
        DC     = Device Control Register
        ER     = Error register
        ST     = Status register
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 1 [0] occurred at disk power-on lifetime: 25213 hours (1050 days + 13 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 43 00 00 00 00 00 00 00 00 00 00  Error: UNC at LBA = 0x00000000 = 0

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  60 02 f8 00 e0 00 08 8c d6 03 30 40 08 25d+22:16:31.088  READ FPDMA QUEUED
  60 00 30 00 20 00 08 8c d6 03 00 40 08 25d+22:16:28.694  READ FPDMA QUEUED
  60 04 38 00 f8 00 08 8c d5 fe c8 40 08 25d+22:16:28.688  READ FPDMA QUEUED
  60 03 58 00 68 00 08 8c d5 fb 48 40 08 25d+22:16:28.685  READ FPDMA QUEUED
  60 00 28 00 98 00 08 8c d5 fe a0 40 08 25d+22:16:28.685  READ FPDMA QUEUED

SMART Extended Self-test Log Version: 1 (1 sectors)
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed: read failure       90%     25250         34527525448
# 2  Extended offline    Completed: read failure       20%     25238         34527525448
# 3  Short offline       Completed without error       00%     25205         -
# 4  Short offline       Completed without error       00%     25178         -
# 5  Short offline       Completed without error       00%     25154         -
# 6  Short offline       Completed without error       00%     25130         -
# 7  Short offline       Completed without error       00%     25106         -
# 8  Short offline       Completed without error       00%     25082         -
# 9  Short offline       Completed without error       00%     25058         -
#10  Short offline       Completed without error       00%     25034         -
#11  Short offline       Completed without error       00%     25010         -
#12  Short offline       Completed without error       00%     24986         -
#13  Short offline       Completed without error       00%     24962         -
#14  Short offline       Completed without error       00%     24938         -
#15  Short offline       Completed without error       00%     24914         -
#16  Short offline       Completed without error       00%     24890         -
#17  Short offline       Completed without error       00%     24866         -
#18  Short offline       Completed without error       00%     24842         -
#19  Short offline       Completed without error       00%     24818         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

SCT Status Version:                  3
SCT Version (vendor specific):       256 (0x0100)
Device State:                        Active (0)
Current Temperature:                    51 Celsius
Power Cycle Min/Max Temperature:     20/57 Celsius
Lifetime    Min/Max Temperature:     18/65 Celsius
Under/Over Temperature Limit Count:   0/0
SMART Status:                        0xc24f (PASSED)
Minimum supported ERC Time Limit:    70 (7.0 seconds)

SCT Temperature History Version:     2
Temperature Sampling Period:         1 minute
Temperature Logging Interval:        1 minute
Min/Max recommended Temperature:      0/60 Celsius
Min/Max Temperature Limit:           -40/70 Celsius
Temperature History Size (Index):    128 (125)

Index    Estimated Time   Temperature Celsius
 126    2025-08-21 01:44    50  *******************************
 ...    ..( 78 skipped).    ..  *******************************
  77    2025-08-21 03:03    50  *******************************
  78    2025-08-21 03:04    51  ********************************
  79    2025-08-21 03:05    50  *******************************
  80    2025-08-21 03:06    50  *******************************
  81    2025-08-21 03:07    50  *******************************
  82    2025-08-21 03:08    51  ********************************
  83    2025-08-21 03:09    50  *******************************
  84    2025-08-21 03:10    50  *******************************
  85    2025-08-21 03:11    51  ********************************
 ...    ..( 27 skipped).    ..  ********************************
 113    2025-08-21 03:39    51  ********************************
 114    2025-08-21 03:40    50  *******************************
 115    2025-08-21 03:41    51  ********************************
 ...    ..(  8 skipped).    ..  ********************************
 124    2025-08-21 03:50    51  ********************************
 125    2025-08-21 03:51    50  *******************************

SCT Error Recovery Control:
           Read:     70 (7.0 seconds)
          Write:     70 (7.0 seconds)

Device Statistics (GP Log 0x04)
Page  Offset Size        Value Flags Description
0x01  =====  =               =  ===  == General Statistics (rev 1) ==
0x01  0x008  4              64  ---  Lifetime Power-On Resets
0x01  0x010  4           25254  ---  Power-on Hours
0x01  0x018  6    203607966594  ---  Logical Sectors Written
0x01  0x020  6      1068247986  ---  Number of Write Commands
0x01  0x028  6    494633226928  ---  Logical Sectors Read
0x01  0x030  6      1494903124  ---  Number of Read Commands
0x01  0x038  6     90915209700  ---  Date and Time TimeStamp
0x03  =====  =               =  ===  == Rotating Media Statistics (rev 1) ==
0x03  0x008  4           25242  ---  Spindle Motor Power-on Hours
0x03  0x010  4           25242  ---  Head Flying Hours
0x03  0x018  4            1115  ---  Head Load Events
0x03  0x020  4               0  ---  Number of Reallocated Logical Sectors
0x03  0x028  4               9  ---  Read Recovery Attempts
0x03  0x030  4              29  ---  Number of Mechanical Start Failures
0x04  =====  =               =  ===  == General Errors Statistics (rev 1) ==
0x04  0x008  4               1  ---  Number of Reported Uncorrectable Errors
0x04  0x010  4               0  ---  Resets Between Cmd Acceptance and Completion
0x04  0x018  4               0  ---  Physical Element Status Changed
0x05  =====  =               =  ===  == Temperature Statistics (rev 1) ==
0x05  0x008  1              51  ---  Current Temperature
0x05  0x010  1              52  N--  Average Short Term Temperature
0x05  0x018  1              49  N--  Average Long Term Temperature
0x05  0x020  1              65  ---  Highest Temperature
0x05  0x028  1              18  ---  Lowest Temperature
0x05  0x030  1              62  N--  Highest Average Short Term Temperature
0x05  0x038  1              25  N--  Lowest Average Short Term Temperature
0x05  0x040  1              55  N--  Highest Average Long Term Temperature
0x05  0x048  1              25  N--  Lowest Average Long Term Temperature
0x05  0x050  4            8900  ---  Time in Over-Temperature
0x05  0x058  1              60  ---  Specified Maximum Operating Temperature
0x05  0x060  4               0  ---  Time in Under-Temperature
0x05  0x068  1               0  ---  Specified Minimum Operating Temperature
0x06  =====  =               =  ===  == Transport Statistics (rev 1) ==
0x06  0x008  4             168  ---  Number of Hardware Resets
0x06  0x010  4              22  ---  Number of ASR Events
0x06  0x018  4               0  ---  Number of Interface CRC Errors
0xff  =====  =               =  ===  == Vendor Specific Statistics (rev 1) ==
0xff  0x040  7               0  ---  Vendor Specific
0xff  0x048  7               0  ---  Vendor Specific
0xff  0x050  7               0  ---  Vendor Specific
0xff  0x058  7               0  ---  Vendor Specific
0xff  0x060  7               0  ---  Vendor Specific
0xff  0x068  7          501943  ---  Vendor Specific
0xff  0x070  7               0  ---  Vendor Specific
0xff  0x078  7               0  ---  Vendor Specific
0xff  0x080  7              45  ---  Vendor Specific
                                |||_ C monitored condition met
                                ||__ D supports DSN
                                |___ N normalized value

Pending Defects log (GP Log 0x0c)
Index                LBA    Hours
    0        34527525448    25238
    1        34527525449    25238
    2        34527525450    25238
    3        34527525451    25238
    4        34527525452    25238
    5        34527525453    25238
    6        34527525454    25238
    7        34527525455    25238

SATA Phy Event Counters (GP Log 0x11)
ID      Size     Value  Description
0x0001  2            0  Command failed due to ICRC error
0x0002  2            0  R_ERR response for data FIS
0x0003  2            0  R_ERR response for device-to-host data FIS
0x0004  2            0  R_ERR response for host-to-device data FIS
0x0005  2            0  R_ERR response for non-data FIS
0x0006  2            0  R_ERR response for device-to-host non-data FIS
0x0007  2            0  R_ERR response for host-to-device non-data FIS
0x0008  2            0  Device-to-host non-data FIS retries
0x0009  2            2  Transition from drive PhyRdy to drive PhyNRdy
0x000a  2            2  Device-to-host register FISes sent due to a COMRESET
0x000b  2            0  CRC errors within host-to-device FIS
0x000d  2            0  Non-CRC errors within host-to-device FIS
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.12.15-production+truenas] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Red Pro
Device Model:     WDC WD201KFGX-68BKJN0
Serial Number:    2LG3Y0NF
LU WWN Device Id: 5 000cca 2b3c1ca0e
Firmware Version: 83.00A83
User Capacity:    20,000,588,955,648 bytes [20.0 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database 7.3/5816
ATA Version is:   ACS-5 (minor revision not indicated)
SATA Version is:  SATA 3.5, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Thu Aug 21 03:55:03 2025 AWST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
AAM feature is:   Unavailable
APM feature is:   Disabled
Rd look-ahead is: Enabled
Write cache is:   Enabled
DSN feature is:   Unavailable
ATA Security is:  Disabled, frozen [SEC2]
Wt Cache Reorder: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever 
                                        been run.
Total time to complete Offline 
data collection:                (  101) seconds.
Offline data collection
capabilities:                    (0x5b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine 
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        (2110) minutes.
SCT capabilities:              (0x003d) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
  1 Raw_Read_Error_Rate     PO-R--   100   100   001    -    0
  2 Throughput_Performance  --S---   148   148   054    -    50
  3 Spin_Up_Time            POS---   084   084   001    -    351 (Average 352)
  4 Start_Stop_Count        -O--C-   100   100   000    -    45
  5 Reallocated_Sector_Ct   PO--CK   100   100   001    -    0
  7 Seek_Error_Rate         -O-R--   100   100   001    -    0
  8 Seek_Time_Performance   --S---   140   140   020    -    15
  9 Power_On_Hours          -O--C-   098   098   000    -    25152
 10 Spin_Retry_Count        -O--C-   100   100   001    -    0
 12 Power_Cycle_Count       -O--CK   100   100   000    -    45
 22 Helium_Level            PO---K   100   100   025    -    6553700
 90 NAND_Master             P---CK   100   100   001    -    0x004800000000
192 Power-Off_Retract_Count -O--CK   100   100   000    -    1090
193 Load_Cycle_Count        -O--C-   100   100   000    -    1090
194 Temperature_Celsius     -O----   041   041   000    -    51 (Min/Max 18/65)
196 Reallocated_Event_Count -O--CK   100   100   000    -    0
197 Current_Pending_Sector  -O---K   100   100   000    -    0
198 Offline_Uncorrectable   ---R--   100   100   000    -    0
199 UDMA_CRC_Error_Count    -O-R--   100   100   000    -    0
                            ||||||_ K auto-keep
                            |||||__ C event count
                            ||||___ R error rate
                            |||____ S speed/performance
                            ||_____ O updated online
                            |______ P prefailure warning

General Purpose Log Directory Version 1
SMART           Log Directory Version 1 [multi-sector log support]
Address    Access  R/W   Size  Description
0x00       GPL,SL  R/O      1  Log Directory
0x01           SL  R/O      1  Summary SMART error log
0x02           SL  R/O      1  Comprehensive SMART error log
0x03       GPL     R/O      1  Ext. Comprehensive SMART error log
0x04       GPL     R/O    256  Device Statistics log
0x04       SL      R/O    255  Device Statistics log
0x06           SL  R/O      1  SMART self-test log
0x07       GPL     R/O      1  Extended self-test log
0x08       GPL     R/O      2  Power Conditions log
0x09           SL  R/W      1  Selective self-test log
0x0c       GPL     R/O  17579  Pending Defects log
0x10       GPL     R/O      1  NCQ Command Error log
0x11       GPL     R/O      1  SATA Phy Event Counters log
0x12       GPL     R/O      1  SATA NCQ Non-Data log
0x13       GPL     R/O      1  SATA NCQ Send and Receive log
0x15       GPL     R/W      1  Rebuild Assist log
0x21       GPL     R/O      1  Write stream error log
0x22       GPL     R/O      1  Read stream error log
0x24       GPL     R/O    256  Current Device Internal Status Data log
0x25       GPL     R/O    256  Saved Device Internal Status Data log
0x2f       GPL     R/O      1  Set Sector Configuration
0x30       GPL,SL  R/O      9  IDENTIFY DEVICE data log
0x80-0x9f  GPL,SL  R/W     16  Host vendor specific log
0xb7           SL  VS       1  Device vendor specific log
0xd8-0xd9  GPL,SL  VS       1  Device vendor specific log
0xe0       GPL,SL  R/W      1  SCT Command/Status
0xe1       GPL,SL  R/W      1  SCT Data Transfer

SMART Extended Comprehensive Error Log Version: 1 (1 sectors)
No Errors Logged

SMART Extended Self-test Log Version: 1 (1 sectors)
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%     25148         -
# 2  Short offline       Completed without error       00%     25126         -
# 3  Short offline       Completed without error       00%     25103         -
# 4  Short offline       Completed without error       00%     25076         -
# 5  Short offline       Completed without error       00%     25052         -
# 6  Short offline       Completed without error       00%     25028         -
# 7  Short offline       Completed without error       00%     25004         -
# 8  Short offline       Completed without error       00%     24980         -
# 9  Short offline       Completed without error       00%     24956         -
#10  Short offline       Completed without error       00%     24932         -
#11  Short offline       Completed without error       00%     24908         -
#12  Short offline       Completed without error       00%     24884         -
#13  Short offline       Completed without error       00%     24860         -
#14  Short offline       Completed without error       00%     24836         -
#15  Short offline       Completed without error       00%     24812         -
#16  Short offline       Completed without error       00%     24788         -
#17  Short offline       Completed without error       00%     24764         -
#18  Short offline       Completed without error       00%     24740         -
#19  Short offline       Completed without error       00%     24716         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

SCT Status Version:                  3
SCT Version (vendor specific):       256 (0x0100)
Device State:                        Active (0)
Current Temperature:                    51 Celsius
Power Cycle Min/Max Temperature:     19/56 Celsius
Lifetime    Min/Max Temperature:     18/65 Celsius
Under/Over Temperature Limit Count:   0/0
SMART Status:                        0xc24f (PASSED)
Minimum supported ERC Time Limit:    70 (7.0 seconds)

SCT Temperature History Version:     2
Temperature Sampling Period:         1 minute
Temperature Logging Interval:        1 minute
Min/Max recommended Temperature:      0/60 Celsius
Min/Max Temperature Limit:           -40/70 Celsius
Temperature History Size (Index):    128 (58)

Index    Estimated Time   Temperature Celsius
  59    2025-08-21 01:48    50  *******************************
 ...    ..( 66 skipped).    ..  *******************************
 126    2025-08-21 02:55    50  *******************************
 127    2025-08-21 02:56    51  ********************************
   0    2025-08-21 02:57    50  *******************************
   1    2025-08-21 02:58    50  *******************************
   2    2025-08-21 02:59    51  ********************************
 ...    ..( 15 skipped).    ..  ********************************
  18    2025-08-21 03:15    51  ********************************
  19    2025-08-21 03:16    50  *******************************
  20    2025-08-21 03:17    51  ********************************
 ...    ..( 36 skipped).    ..  ********************************
  57    2025-08-21 03:54    51  ********************************
  58    2025-08-21 03:55    50  *******************************

SCT Error Recovery Control:
           Read:     70 (7.0 seconds)
          Write:     70 (7.0 seconds)

Device Statistics (GP Log 0x04)
Page  Offset Size        Value Flags Description
0x01  =====  =               =  ===  == General Statistics (rev 1) ==
0x01  0x008  4              45  ---  Lifetime Power-On Resets
0x01  0x010  4           25152  ---  Power-on Hours
0x01  0x018  6    209550552323  ---  Logical Sectors Written
0x01  0x020  6      1081076182  ---  Number of Write Commands
0x01  0x028  6    511758302322  ---  Logical Sectors Read
0x01  0x030  6      1565184998  ---  Number of Read Commands
0x01  0x038  6     90548020450  ---  Date and Time TimeStamp
0x03  =====  =               =  ===  == Rotating Media Statistics (rev 1) ==
0x03  0x008  4           25141  ---  Spindle Motor Power-on Hours
0x03  0x010  4           25140  ---  Head Flying Hours
0x03  0x018  4            1090  ---  Head Load Events
0x03  0x020  4               0  ---  Number of Reallocated Logical Sectors
0x03  0x028  4            2408  ---  Read Recovery Attempts
0x03  0x030  4              29  ---  Number of Mechanical Start Failures
0x04  =====  =               =  ===  == General Errors Statistics (rev 1) ==
0x04  0x008  4               0  ---  Number of Reported Uncorrectable Errors
0x04  0x010  4               0  ---  Resets Between Cmd Acceptance and Completion
0x04  0x018  4               0  ---  Physical Element Status Changed
0x05  =====  =               =  ===  == Temperature Statistics (rev 1) ==
0x05  0x008  1              51  ---  Current Temperature
0x05  0x010  1              51  N--  Average Short Term Temperature
0x05  0x018  1              49  N--  Average Long Term Temperature
0x05  0x020  1              65  ---  Highest Temperature
0x05  0x028  1              18  ---  Lowest Temperature
0x05  0x030  1              62  N--  Highest Average Short Term Temperature
0x05  0x038  1              25  N--  Lowest Average Short Term Temperature
0x05  0x040  1              56  N--  Highest Average Long Term Temperature
0x05  0x048  1              25  N--  Lowest Average Long Term Temperature
0x05  0x050  4            9460  ---  Time in Over-Temperature
0x05  0x058  1              60  ---  Specified Maximum Operating Temperature
0x05  0x060  4               0  ---  Time in Under-Temperature
0x05  0x068  1               0  ---  Specified Minimum Operating Temperature
0x06  =====  =               =  ===  == Transport Statistics (rev 1) ==
0x06  0x008  4             131  ---  Number of Hardware Resets
0x06  0x010  4              19  ---  Number of ASR Events
0x06  0x018  4               0  ---  Number of Interface CRC Errors
0xff  =====  =               =  ===  == Vendor Specific Statistics (rev 1) ==
0xff  0x040  7               0  ---  Vendor Specific
0xff  0x048  7               0  ---  Vendor Specific
0xff  0x050  7               0  ---  Vendor Specific
0xff  0x058  7               0  ---  Vendor Specific
0xff  0x060  7               0  ---  Vendor Specific
0xff  0x068  7          434983  ---  Vendor Specific
0xff  0x070  7             309  ---  Vendor Specific
0xff  0x078  7               0  ---  Vendor Specific
0xff  0x080  7             110  ---  Vendor Specific
                                |||_ C monitored condition met
                                ||__ D supports DSN
                                |___ N normalized value

Pending Defects log (GP Log 0x0c)
No Defects Logged

SATA Phy Event Counters (GP Log 0x11)
ID      Size     Value  Description
0x0001  2            0  Command failed due to ICRC error
0x0002  2            0  R_ERR response for data FIS
0x0003  2            0  R_ERR response for device-to-host data FIS
0x0004  2            0  R_ERR response for host-to-device data FIS
0x0005  2            0  R_ERR response for non-data FIS
0x0006  2            0  R_ERR response for device-to-host non-data FIS
0x0007  2            0  R_ERR response for host-to-device non-data FIS
0x0008  2            0  Device-to-host non-data FIS retries
0x0009  2            2  Transition from drive PhyRdy to drive PhyNRdy
0x000a  2            2  Device-to-host register FISes sent due to a COMRESET
0x000b  2            0  CRC errors within host-to-device FIS
0x000d  2            0  Non-CRC errors within host-to-device FIS

Well, you DO have two HBAs, which is one more than seems necessary, and the one who responded to sas2flash is running an obsolete firmware. So upgrading from 15.00.00.00 to 20.00.07.00 should be high on the priority list.

1 Like

It’s only one more than is necessary if one alone would support a sufficient number of drives, which it doesn’t.

The firmware issue has come up before and my recollection is that the upshot of the whole thing is that I do have the latest firmware for those devices.

If I’m not mistaken, you don’t:

User on that post also had an H3-25113-03A, IBM 6Gb Perf HBA & successfully updated to 20.00.07.00:

There were only 3 commands I actualy needed. The firmware erased fine, and updated fine to version 20.00.07.00, and now shows up as an “LSI SAS9211-8i” unlike whatever ambiguous “IBM 6Gb Perf HBA” that it said before.

@koberulz
Understand that the folks responding to you know their stuff.

Can you explain why you think one HBA is not enough to run all your drives?

Realize that an HBA needs lot of airflow to keep it cool. If you could remove one, you reduce heat, power consumption, your air conditioning bill.

And while this is not your immediate problem, it is something you should consider after the problem has been resolved.

According to the data you provided, it appears that SMART short tests are happening daily.

Holy Cow! 35.2 hours to run a SMART Long test! 1.5 days

We already know the drive has failed (Serial Number: 2LGBTSTK) and it may be under warranty so I’d start the RMA process.

Using the TrueNAS User Guide, “Replace” the failed drive.
Once the drive has been resilvered, run zpool clear MainStorage and then check the zpool status MainStorage to verify no error message remains.

Now to the title of this thread, your SMART tests… As I said, Short are running, however the Long tests are not.

I suspect the CRON JOB is not working for the custom configuration you have.

You can use one of the predefined settings to test this out if you desire.
Also, you probably should not run a SMART Long test on all the drives at once, space it out across the week.

Your ToDo List:

  1. RMA your drive.
  2. If you have a spare drive, use the TrueNAS User Docs to “Replace” the failed drive.
  3. Reconfigure the Long tests to space the drives out over the week and to use a canned schedule, or at least try it again. Mine work fine for this version of TrueNAS.
  4. Consider reflashing the HBA to the current firmware.
  5. Explore as to why one HBA isn’t enough.

That is all for now. I will check back tomorrow.

1 Like

I mean… I’m also here - so I wouldn’t go that far. I do however agree with the things in your post.

1 Like

But you do not post terrible/bad or flat out wrong advice. You are one of the good ones. Well we have a lot of great people here. I certainly do not know it all. @etorix may know it all :joy:

1 Like

I have room in the case for 20 drives. One HBA will not run 20 drives.

Not sure where you’re getting 35 hours to run the SMART test? I was seeing the results the same day it started.

WD is refusing to tell me the warranty status of the drive, so that’s fun.

2110 minutes is 35.1666 hours, assuming the drive is not active for the reading/writing data, which will slow it down with the SMART tests.

The data you provided shows 8 drives in one pool and one NVMe drive. This is far from 20 drives. I can only act on what data you provide. If there are other drives, where are they? If you are only using 8 drives, and this is all you have been using for a year or more, my advice is to remove one of the HBA’s for now. You can always add it when you need to. But you do not have to remove it, we are just offering you friendly advice.

Here it is, not sure what you were doing wrong but it came up immediately for me. Maybe the server was down. It is in warranty so submit a RMA. Not sure if they do an Advanced RMA still, it is where they ship you a drive first, then you replace your failed drive, and ship the failed drive back in the same box. They require a credit card just in case you decide to not ship the failed drive back.

2LGBTSTK IN LIMITED WARRANTY WD201KFGX-68BKJN0 WDPARISD 7200 512M SATA3 6GB/S 20TB 18HD NAS 18-Aug-2027

Take your time doing this stuff. Read up on the replacement procedure, it is actually very easy. If you have hot-swap drive bays, power down to replace the drive, it is significantly safer this way.

1 Like

Well as I said I had the results of the extended SMART test the same day I started it, so I’m not sure what’s going on there.

I have eight installed at the moment, but I built the system with expansion in mind. I initially started with just four.

I’m in Australia, and any time I reach out to WD they tell me they can’t help me, I need to speak to Australian support, they’ll send me a link to contact them. Then they send me a link to their APAC support line, where I get told they can’t help me, I need to speak to Australian support, they’ll send me a link. Round and round we go. Yes, their tool says “in limited warranty,” but I don’t know what those limits are and it’s no help with trying to RMA it. I’ve done it before, but I can’t for the life of me remember how I got in touch with them (and that one was DOA, so a very clear warranty case).

The other issue is figuring out which drive it is. I really should have noted serial numbers when I installed them.

Not sure what you mean about having hot-swap drive bays. You mean like on a DiskStation? It’s just a Fractal Design Define 7 XL.

Possibly for the failed result. For the other drives that report a pass, it will take that long. A Short test takes approx 2 minutes, those are fast since they only make a basic functional test.

I used US as the country code. That could be an issue.

This means until the date it expires. And if you did anything to void the warranty, like opening it up, exceeding the maximum temperature, stuff like that.

You should be able to do this RMA all online, no need to speak to anyone, or so that has been my experience.

You are not the first nor the last to have this issue. Power down and then pull each drive. Make a chart or spreadsheet with the serial numbers on them. Some people add a label to a location which can easily be seen. In TrueNAS you can add a comment for each drive, I use this to note the physical location of the drive.

You would know if you have them. I do not trust them but some places must remain online all the time so having a high reliability hot swap bay is important. A lot of consumer ones are not that good. Always power down if you can, with is the point I was trying to make.

Maybe it finishes faster than the SMART report suggests.
Maybe it was already running the test when you tried to start one.

Are you planning on expanding in the near future? Otherwise you are paying electricity for something you’re not using, and you are exposing it to unnecessary wear and tear.

This report suggests your drives are running hot, and a lifetime top out at 65 is worrisome.
You only have 8/20 drives in there so the temps are only going to go up.

What does your current cooling solution look like and are you planning on adding cooling as you add drives?

The two HBAs typically expect good airflow, how hot are they running?
Update the HBA firmware as per the previous recommendation, yours is very out of date.

2 Likes

A 9305-24i will.
Any HBA will if paired with an expander.

But the current results already show that the case, or the cooling solution, is not up to the task for eight drives; it’s not going to take twenty and be fine.

:confused: I certainly don’t, I just post too much for the sake of my own sanity.
Arwen or HoneyBadger know a lot more about ZFS innards. And “all”, if that’s even possible, if for the most skilled ZFS developpers like ‘mav’.

No idea how to find that out.

It’s not something I’ve thought about…looks like I have two fans at the front of the case and one at the back, but only one of the two front fans is actually plugged into the motherboard for some reason.

Well that’s…expensive. And I’m not sure how it works? What I have has cables coming off it to connect the hard drives but that seems to lack those.

The manual explicitly depicts a 20-drive setup, so one assumes the case is capable of it. May just be as simple as plugging that other fan in?

It’s winter here right now, so the room is as cool as it’s ever going to get.

EDIT: Why does quoting not work properly?