Intermittent SMART errors?

I’ve caught wind of some odd intermittent SMART errors and I was wondering if anyone can see if this drive is nearing death or have I perhaps been too aggressive with short SMART checks (hourly)?

I’ve seen 6 of these read failures so far:

SMART Extended Self-test Log Version: 1 (1 sectors)
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed: read failure       90%     22756         11619635632
# 2  Short offline       Completed without error       00%     22755         -
# 3  Short offline       Completed: read failure       90%     22754         11619635632
# 4  Short offline       Completed: read failure       90%     22753         11619635632
# 5  Short offline       Completed without error       00%     22752         -
# 6  Short offline       Completed: read failure       90%     22751         11619635632
# 7  Short offline       Completed without error       00%     22750         -
# 8  Short offline       Completed: read failure       90%     22749         11619635632
# 9  Short offline       Completed without error       00%     22748         -
#10  Short offline       Completed without error       00%     22747         -
#11  Short offline       Completed without error       00%     22746         -
#12  Short offline       Completed without error       00%     22745         -
#13  Short offline       Completed without error       00%     22744         -
#14  Short offline       Completed without error       00%     22743         -
#15  Short offline       Completed without error       00%     22742         -
#16  Short offline       Completed without error       00%     22741         -
#17  Short offline       Completed without error       00%     22740         -
#18  Short offline       Completed: read failure       90%     22739         11619635632
#19  Short offline       Completed without error       00%     22738         -

But checking on SMART attributes I don’t see any immediate concerns except for that Raw_Read_Error_Rate that isn’t the best (but is still above the threshold).

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
  1 Raw_Read_Error_Rate     POSR--   100   064   044    -    17920
  3 Spin_Up_Time            PO----   084   084   000    -    0
  4 Start_Stop_Count        -O--CK   100   100   020    -    151
  5 Reallocated_Sector_Ct   PO--CK   100   100   010    -    64
  7 Seek_Error_Rate         POSR--   087   060   045    -    553426427
  9 Power_On_Hours          -O--CK   075   075   000    -    22757
 10 Spin_Retry_Count        PO--C-   100   100   097    -    0
 12 Power_Cycle_Count       -O--CK   100   100   020    -    151
184 End-to-End_Error        -O--CK   100   100   099    -    0
187 Reported_Uncorrect      -O--CK   100   100   000    -    0
188 Command_Timeout         -O--CK   100   100   000    -    0
189 High_Fly_Writes         -O-RCK   100   100   000    -    0
190 Airflow_Temperature_Cel -O---K   067   052   040    -    33 (Min/Max 25/36)
191 G-Sense_Error_Rate      -O--CK   095   095   000    -    10483
192 Power-Off_Retract_Count -O--CK   100   100   000    -    75
193 Load_Cycle_Count        -O--CK   048   048   000    -    104149
194 Temperature_Celsius     -O---K   033   048   000    -    33 (0 16 0 0 0)
195 Hardware_ECC_Recovered  -O-RC-   100   001   000    -    17920
197 Current_Pending_Sector  -O--C-   100   100   000    -    0
198 Offline_Uncorrectable   ----C-   100   100   000    -    0
199 UDMA_CRC_Error_Count    -OSRCK   200   200   000    -    0
240 Head_Flying_Hours       ------   100   253   000    -    11575h+07m+14.853s
241 Total_LBAs_Written      ------   100   253   000    -    224939508952
242 Total_LBAs_Read         ------   100   253   000    -    367211063284
                            ||||||_ K auto-keep
                            |||||__ C event count
                            ||||___ R error rate
                            |||____ S speed/performance
                            ||_____ O updated online

Any thoughts?

EDIT: I should add that none of the recent disk scrubs are noting any errors. The disk is roughly half full.

I think based on this along with the regular SMART test failure I’d replace it.

1 Like

Despite the SMART parameter still being so far above the threshold?

Yeah, that was my thinking as well. Once those start to creep up, the drive is on its death bed. But in this situation, I am not sure. This is a backup server, so unless I am somehow chancing some subtle data corruption prior to outright failure (which ZFS should protect me for), I’d be ok with wait-and-see approach?

IMO yes. The ideal value is 0. If your drive wasn’t failing SMART and had this value with no other issues I’d say leave it and be ready to replace but now SMART tests often can’t complete I’d replace it.

2 Likes

Why Current_Pending_Sector Is worse than Reallocated_Sector_Ct?
For what i know, pending are not still bad sectors.

Did you try a long test too?
In your situation, and if you have redundancy, probably you can handle It until disk die definetly

You are not being aggressive enough with respect to long tests. The short test is not doing much, so the short test failing should be a major cause of concern—and a valid ground for RMA.

As indicated by #195, the drive has been working hard to correct its own failings before ZFS had to step in.

You don’t see the wood (SMART test failing) for the trees (parameters) here. And failing to even read data is worse than failing to write.

2 Likes

The help above is spot on!

If you cannot pass a SMART long test, it is time to replace the drive, and a short test is barely a small portion of the long test. Don’t wait on any other values, they do not matter. A failure of a Short or Long test is solid proof the drive is failing.

I suspect you have a Seagate drive but it would have been a little helpful if you provided the entire output of smarctl -x /dev/xxx next time. This time the solution is an easy one, replace the drive before it completely fails.

I always recommend a daily SMART short test and a weekly SMART long test, with some exceptions such as if you have a high drive count (50 or 200 for example) then you may want to perform a monthly long test and spread the drives out across that month. The point is to run a long test periodically. You may have significantly more errors than you know.

1 Like

If this is increasing that suggests to me that you have an issue with vibrations in your chassis.

It’s also possible something external is asserting physical force on the NAS such that it may interfere with drive reliability.

1 Like

This is a small 5-bay custom built backup server, it’s racked in a 42U enclosure sitting in the basement, so I can’t see it being exposed to vibrations @neofusion . I’ll keep an eye out for this, though - good catch.

@etorix @joeschmuck I have kicked off a LONG test manually right now and I have kicked off a scrub run on the same pool to stress the disk. For some reason, the LONG smart events were never kicked off despite the schedule. I’ve also updated the schedule to run both SHORT and LONG tests on all drives daily, separated by 12 hours.

I’ve always been wary of placing undue load on disks with LONG tests, given that they last for so long (1+ hour). Am I imagining things here? Would a daily SHORT/LONG test schedule as above be too aggressive?

Lastly, I am attaching the full smartctl output, in case anyone is interested. This is a Seagate 8 TB IronWolf drive from I didn’t want to spam the thread initially, but I’ll post the whole thing in the future (though I really wish I didn’t have to :slight_smile: ).

admin@filer[~]$ sudo smartctl -x /dev/sda
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.6.32-production+truenas] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Seagate IronWolf
Device Model:     ST8000VN0022-2EL112
Serial Number:    ZA19DEVC
LU WWN Device Id: 5 000c50 0a54621bc
Firmware Version: SC61
User Capacity:    8,001,563,222,016 bytes [8.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database 7.3/5528
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Sun Sep 15 09:38:17 2024 MDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
AAM feature is:   Unavailable
APM feature is:   Unavailable
Rd look-ahead is: Enabled
Write cache is:   Enabled
DSN feature is:   Unavailable
ATA Security is:  Disabled, NOT FROZEN [SEC1]
Wt Cache Reorder: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      ( 249) Self-test routine in progress...
                                        90% of test remaining.
Total time to complete Offline
data collection:                (  575) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        ( 798) minutes.
Conveyance self-test routine
recommended polling time:        (   2) minutes.
SCT capabilities:              (0x50bd) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
  1 Raw_Read_Error_Rate     POSR--   083   064   044    -    215921632
  3 Spin_Up_Time            PO----   084   084   000    -    0
  4 Start_Stop_Count        -O--CK   100   100   020    -    151
  5 Reallocated_Sector_Ct   PO--CK   100   100   010    -    64
  7 Seek_Error_Rate         POSR--   087   060   045    -    553761926
  9 Power_On_Hours          -O--CK   075   075   000    -    22769
 10 Spin_Retry_Count        PO--C-   100   100   097    -    0
 12 Power_Cycle_Count       -O--CK   100   100   020    -    151
184 End-to-End_Error        -O--CK   100   100   099    -    0
187 Reported_Uncorrect      -O--CK   100   100   000    -    0
188 Command_Timeout         -O--CK   100   100   000    -    0
189 High_Fly_Writes         -O-RCK   100   100   000    -    0
190 Airflow_Temperature_Cel -O---K   066   052   040    -    34 (Min/Max 25/36)
191 G-Sense_Error_Rate      -O--CK   095   095   000    -    10483
192 Power-Off_Retract_Count -O--CK   100   100   000    -    75
193 Load_Cycle_Count        -O--CK   048   048   000    -    104175
194 Temperature_Celsius     -O---K   034   048   000    -    34 (0 16 0 0 0)
195 Hardware_ECC_Recovered  -O-RC-   006   001   000    -    215921632
197 Current_Pending_Sector  -O--C-   100   100   000    -    0
198 Offline_Uncorrectable   ----C-   100   100   000    -    0
199 UDMA_CRC_Error_Count    -OSRCK   200   200   000    -    0
240 Head_Flying_Hours       ------   100   253   000    -    11586h+17m+27.766s
241 Total_LBAs_Written      ------   100   253   000    -    224939566960
242 Total_LBAs_Read         ------   100   253   000    -    367426908988
                            ||||||_ K auto-keep
                            |||||__ C event count
                            ||||___ R error rate
                            |||____ S speed/performance
                            ||_____ O updated online
                            |______ P prefailure warning

General Purpose Log Directory Version 1
SMART           Log Directory Version 1 [multi-sector log support]
Address    Access  R/W   Size  Description
0x00       GPL,SL  R/O      1  Log Directory
0x01           SL  R/O      1  Summary SMART error log
0x02           SL  R/O      5  Comprehensive SMART error log
0x03       GPL     R/O      5  Ext. Comprehensive SMART error log
0x04       GPL,SL  R/O      8  Device Statistics log
0x06           SL  R/O      1  SMART self-test log
0x07       GPL     R/O      1  Extended self-test log
0x08       GPL     R/O      2  Power Conditions log
0x09           SL  R/W      1  Selective self-test log
0x0c       GPL     R/O   2048  Pending Defects log
0x10       GPL     R/O      1  NCQ Command Error log
0x11       GPL     R/O      1  SATA Phy Event Counters log
0x13       GPL     R/O      1  SATA NCQ Send and Receive log
0x15       GPL     R/W      1  Rebuild Assist log
0x21       GPL     R/O      1  Write stream error log
0x22       GPL     R/O      1  Read stream error log
0x24       GPL     R/O    512  Current Device Internal Status Data log
0x30       GPL,SL  R/O      9  IDENTIFY DEVICE data log
0x80-0x9f  GPL,SL  R/W     16  Host vendor specific log
0xa1       GPL,SL  VS      24  Device vendor specific log
0xa2       GPL     VS    8160  Device vendor specific log
0xa6       GPL     VS     192  Device vendor specific log
0xa8-0xa9  GPL,SL  VS     136  Device vendor specific log
0xab       GPL     VS       1  Device vendor specific log
0xad       GPL     VS      16  Device vendor specific log
0xb0       GPL     VS    9048  Device vendor specific log
0xbe-0xbf  GPL     VS   65535  Device vendor specific log
0xc1       GPL,SL  VS      16  Device vendor specific log
0xc3       GPL,SL  VS       8  Device vendor specific log
0xd1       GPL     VS     264  Device vendor specific log
0xd2       GPL     VS   10000  Device vendor specific log
0xd4       GPL     VS    2048  Device vendor specific log
0xda       GPL,SL  VS       1  Device vendor specific log
0xe0       GPL,SL  R/W      1  SCT Command/Status
0xe1       GPL,SL  R/W      1  SCT Data Transfer

SMART Extended Comprehensive Error Log Version: 1 (5 sectors)
No Errors Logged

SMART Extended Self-test Log Version: 1 (1 sectors)
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Self-test routine in progress 90%     22769         -
# 2  Short offline       Completed without error       00%     22759         -
# 3  Short offline       Completed: read failure       90%     22756         11619635632
# 4  Short offline       Completed without error       00%     22755         -
# 5  Short offline       Completed: read failure       90%     22754         11619635632
# 6  Short offline       Completed: read failure       90%     22753         11619635632
# 7  Short offline       Completed without error       00%     22752         -
# 8  Short offline       Completed: read failure       90%     22751         11619635632
# 9  Short offline       Completed without error       00%     22750         -
#10  Short offline       Completed: read failure       90%     22749         11619635632
#11  Short offline       Completed without error       00%     22748         -
#12  Short offline       Completed without error       00%     22747         -
#13  Short offline       Completed without error       00%     22746         -
#14  Short offline       Completed without error       00%     22745         -
#15  Short offline       Completed without error       00%     22744         -
#16  Short offline       Completed without error       00%     22743         -
#17  Short offline       Completed without error       00%     22742         -
#18  Short offline       Completed without error       00%     22741         -
#19  Short offline       Completed without error       00%     22740         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

SCT Status Version:                  3
SCT Version (vendor specific):       522 (0x020a)
Device State:                        Active (0)
Current Temperature:                    34 Celsius
Power Cycle Min/Max Temperature:     25/36 Celsius
Lifetime    Min/Max Temperature:     16/48 Celsius
Under/Over Temperature Limit Count:   0/34

SCT Temperature History Version:     2
Temperature Sampling Period:         3 minutes
Temperature Logging Interval:        59 minutes
Min/Max recommended Temperature:     10/25 Celsius
Min/Max Temperature Limit:            0/70 Celsius
Temperature History Size (Index):    128 (38)

Index    Estimated Time   Temperature Celsius
  39    2024-09-10 04:00    37  ******************
  40    2024-09-10 04:59    37  ******************
  41    2024-09-10 05:58    36  *****************
 ...    ..( 10 skipped).    ..  *****************
  52    2024-09-10 16:47    36  *****************
  53    2024-09-10 17:46    33  **************
 ...    ..(  6 skipped).    ..  **************
  60    2024-09-11 00:39    33  **************
  61    2024-09-11 01:38    37  ******************
  62    2024-09-11 02:37    38  *******************
  63    2024-09-11 03:36    38  *******************
  64    2024-09-11 04:35    38  *******************
  65    2024-09-11 05:34    37  ******************
 ...    ..(  4 skipped).    ..  ******************
  70    2024-09-11 10:29    37  ******************
  71    2024-09-11 11:28    36  *****************
 ...    ..(  3 skipped).    ..  *****************
  75    2024-09-11 15:24    36  *****************
  76    2024-09-11 16:23    37  ******************
 ...    ..(  4 skipped).    ..  ******************
  81    2024-09-11 21:18    37  ******************
  82    2024-09-11 22:17    36  *****************
  83    2024-09-11 23:16    36  *****************
  84    2024-09-12 00:15    38  *******************
  85    2024-09-12 01:14    38  *******************
  86    2024-09-12 02:13    37  ******************
 ...    ..(  2 skipped).    ..  ******************
  89    2024-09-12 05:10    37  ******************
  90    2024-09-12 06:09    36  *****************
  91    2024-09-12 07:08    37  ******************
  92    2024-09-12 08:07    37  ******************
  93    2024-09-12 09:06     ?  -
  94    2024-09-12 10:05    25  ******
  95    2024-09-12 11:04    33  **************
  96    2024-09-12 12:03    35  ****************
  97    2024-09-12 13:02    36  *****************
  98    2024-09-12 14:01     ?  -
  99    2024-09-12 15:00    27  ********
 100    2024-09-12 15:59    36  *****************
 101    2024-09-12 16:58    37  ******************
 102    2024-09-12 17:57    36  *****************
 103    2024-09-12 18:56    37  ******************
 104    2024-09-12 19:55    36  *****************
 105    2024-09-12 20:54    37  ******************
 ...    ..(  2 skipped).    ..  ******************
 108    2024-09-12 23:51    37  ******************
 109    2024-09-13 00:50    36  *****************
 110    2024-09-13 01:49    36  *****************
 111    2024-09-13 02:48    37  ******************
 ...    ..(  3 skipped).    ..  ******************
 115    2024-09-13 06:44    37  ******************
 116    2024-09-13 07:43    34  ***************
 117    2024-09-13 08:42    38  *******************
 118    2024-09-13 09:41    38  *******************
 119    2024-09-13 10:40    38  *******************
 120    2024-09-13 11:39    37  ******************
 121    2024-09-13 12:38    35  ****************
 122    2024-09-13 13:37    36  *****************
 123    2024-09-13 14:36    37  ******************
 124    2024-09-13 15:35    37  ******************
 125    2024-09-13 16:34    34  ***************
 ...    ..(  2 skipped).    ..  ***************
   0    2024-09-13 19:31    34  ***************
   1    2024-09-13 20:30     ?  -
   2    2024-09-13 21:29    25  ******
   3    2024-09-13 22:28     ?  -
   4    2024-09-13 23:27    25  ******
   5    2024-09-14 00:26    36  *****************
   6    2024-09-14 01:25    36  *****************
   7    2024-09-14 02:24    35  ****************
   8    2024-09-14 03:23    35  ****************
   9    2024-09-14 04:22    35  ****************
  10    2024-09-14 05:21    36  *****************
 ...    ..(  6 skipped).    ..  *****************
  17    2024-09-14 12:14    36  *****************
  18    2024-09-14 13:13    35  ****************
  19    2024-09-14 14:12    35  ****************
  20    2024-09-14 15:11    35  ****************
  21    2024-09-14 16:10    33  **************
 ...    ..( 16 skipped).    ..  **************
  38    2024-09-15 08:53    33  **************

SCT Error Recovery Control:
           Read:     70 (7.0 seconds)
          Write:     70 (7.0 seconds)

Device Statistics (GP Log 0x04)
Page  Offset Size        Value Flags Description
0x01  =====  =               =  ===  == General Statistics (rev 1) ==
0x01  0x008  4             151  ---  Lifetime Power-On Resets
0x01  0x010  4           22769  ---  Power-on Hours
0x01  0x018  6    224940935430  ---  Logical Sectors Written
0x01  0x020  6       709005039  ---  Number of Write Commands
0x01  0x028  6    347475320127  ---  Logical Sectors Read
0x01  0x030  6       938853105  ---  Number of Read Commands
0x01  0x038  6               -  ---  Date and Time TimeStamp
0x03  =====  =               =  ===  == Rotating Media Statistics (rev 1) ==
0x03  0x008  4      2069993317  N--  Spindle Motor Power-on Hours
0x03  0x010  4      2069975166  N--  Head Flying Hours
0x03  0x018  4          104175  ---  Head Load Events
0x03  0x020  4              64  ---  Number of Reallocated Logical Sectors
0x03  0x028  4               0  ---  Read Recovery Attempts
0x03  0x030  4               0  ---  Number of Mechanical Start Failures
0x03  0x038  4               0  ---  Number of Realloc. Candidate Logical Sectors
0x03  0x040  4              74  ---  Number of High Priority Unload Events
0x04  =====  =               =  ===  == General Errors Statistics (rev 1) ==
0x04  0x008  4               0  ---  Number of Reported Uncorrectable Errors
0x04  0x010  4               0  ---  Resets Between Cmd Acceptance and Completion
0x05  =====  =               =  ===  == Temperature Statistics (rev 1) ==
0x05  0x008  1              34  ---  Current Temperature
0x05  0x010  1              34  ---  Average Short Term Temperature
0x05  0x018  1              36  ---  Average Long Term Temperature
0x05  0x020  1              48  ---  Highest Temperature
0x05  0x028  1               0  ---  Lowest Temperature
0x05  0x030  1              43  ---  Highest Average Short Term Temperature
0x05  0x038  1              28  ---  Lowest Average Short Term Temperature
0x05  0x040  1              41  ---  Highest Average Long Term Temperature
0x05  0x048  1              29  ---  Lowest Average Long Term Temperature
0x05  0x050  4               0  ---  Time in Over-Temperature
0x05  0x058  1              70  ---  Specified Maximum Operating Temperature
0x05  0x060  4               0  ---  Time in Under-Temperature
0x05  0x068  1               0  ---  Specified Minimum Operating Temperature
0x06  =====  =               =  ===  == Transport Statistics (rev 1) ==
0x06  0x008  4             357  ---  Number of Hardware Resets
0x06  0x010  4             142  ---  Number of ASR Events
0x06  0x018  4               0  ---  Number of Interface CRC Errors
                                |||_ C monitored condition met
                                ||__ D supports DSN
                                |___ N normalized value

Pending Defects log (GP Log 0x0c)
No Defects Logged

SATA Phy Event Counters (GP Log 0x11)
ID      Size     Value  Description
0x000a  2            2  Device-to-host register FISes sent due to a COMRESET
0x0001  2            0  Command failed due to ICRC error
0x0003  2            0  R_ERR response for device-to-host data FIS
0x0004  2            0  R_ERR response for host-to-device data FIS
0x0006  2            0  R_ERR response for device-to-host non-data FIS
0x0007  2            0  R_ERR response for host-to-device non-data FIS

Seagate FARM log (GP Log 0xa6) supported [try: -l farm]

admin@filer[~]$ >....
0x01  0x020  6       709005039  ---  Number of Write Commands
0x01  0x028  6    347475320127  ---  Logical Sectors Read
0x01  0x030  6       938853105  ---  Number of Read Commands
0x01  0x038  6               -  ---  Date and Time TimeStamp
0x03  =====  =               =  ===  == Rotating Media Statistics (rev 1) ==
0x03  0x008  4      2069993317  N--  Spindle Motor Power-on Hours
0x03  0x010  4      2069975166  N--  Head Flying Hours
0x03  0x018  4          104175  ---  Head Load Events
0x03  0x020  4              64  ---  Number of Reallocated Logical Sectors
0x03  0x028  4               0  ---  Read Recovery Attempts
0x03  0x030  4               0  ---  Number of Mechanical Start Failures
0x03  0x038  4               0  ---  Number of Realloc. Candidate Logical Sectors
0x03  0x040  4              74  ---  Number of High Priority Unload Events
0x04  =====  =               =  ===  == General Errors Statistics (rev 1) ==
0x04  0x008  4               0  ---  Number of Reported Uncorrectable Errors
0x04  0x010  4               0  ---  Resets Between Cmd Acceptance and Completion
0x05  =====  =               =  ===  == Temperature Statistics (rev 1) ==
0x05  0x008  1              34  ---  Current Temperature
0x05  0x010  1              34  ---  Average Short Term Temperature
0x05  0x018  1              36  ---  Average Long Term Temperature
0x05  0x020  1              48  ---  Highest Temperature
0x05  0x028  1               0  ---  Lowest Temperature
0x05  0x030  1              43  ---  Highest Average Short Term Temperature
0x05  0x038  1              28  ---  Lowest Average Short Term Temperature
0x05  0x040  1              41  ---  Highest Average Long Term Temperature
0x05  0x048  1              29  ---  Lowest Average Long Term Temperature
0x05  0x050  4               0  ---  Time in Over-Temperature
0x05  0x058  1              70  ---  Specified Maximum Operating Temperature
0x05  0x060  4               0  ---  Time in Under-Temperature
0x05  0x068  1               0  ---  Specified Minimum Operating Temperature
0x06  =====  =               =  ===  == Transport Statistics (rev 1) ==
0x06  0x008  4             357  ---  Number of Hardware Resets
0x06  0x010  4             142  ---  Number of ASR Events
0x06  0x018  4               0  ---  Number of Interface CRC Errors
                                |||_ C monitored condition met
                                ||__ D supports DSN
                                |___ N normalized value

Pending Defects log (GP Log 0x0c)
No Defects Logged

SATA Phy Event Counters (GP Log 0x11)
ID      Size     Value  Description
0x000a  2            2  Device-to-host register FISes sent due to a COMRESET
0x0001  2            0  Command failed due to ICRC error
0x0003  2            0  R_ERR response for device-to-host data FIS
0x0004  2            0  R_ERR response for host-to-device data FIS
0x0006  2            0  R_ERR response for device-to-host non-data FIS
0x0007  2            0  R_ERR response for host-to-device non-data FIS

Seagate FARM log (GP Log 0xa6) supported [try: -l farm]

For this drive a Long test would last 13.3 hours, a bit longer than a hour for certain.

If your drive was actually failing, I would not recommend doing SMART tests or a Scrub until after you made a copy of your data in a safe place. Running those tests “could” cause complete failure. BUT, yes you are imagining things for a normal good drive. HDDs are designed to run continuously, meaning the drive spins, the heads float above the platters. Driving the heads across the platters also promotes even wear patterns, although with ZFS that should never be an issue.

Daily Long tests would be too aggressive in my opinion.

A Short test takes no more than 2 minutes so you can schedule a Long test 5 minutes after you start the Short test. However, daily Long tests is just a bit much. Also the Long test does generate more heat and you want to keep things cool. It sounds counter-intuitive however the testing is needed so you find out about a possible failure before the drive dies. And SMART is not All Knowing. Some failures it cannot predict but it is the best we have right now.

1 Like

Daily short tests are quite possible. Due to their length, long tests can be weekly, bi-weekly, or monthly, avoiding overlap with scrubs.

2 Likes

This is excellent info @joeschmuck - I assume you got the 13.3 hours of LONG test runtime from Extended self-test routine recommended polling time parameter value? I always understood this as a “recommendation”, or an “estimate”. So the long test will actually take this much amount of time?

Yes, and these values are fairly accurate. If you take two of the same model drives, from different batches/lots, odds are the completion time will be different. The drives are tested and the time modified accordingly, or that has been my experience. If the drive is not doing any other work and if there is no failures to slow it down, other work could be reading or writing or both. The drive will give priority to the data request over the SMART Self-test. So, if you are transferring 1TB to your system while doing a SMART Long test, you can expect the drive to finish a few seconds (possibly a minute or two) later.

Yes

2 Likes

I’ve had the scrub and SMART long test start at roughly the same time and run concurrently. One data point is that scrub finished successfully in about the time I’d expect it. Another one is that the long test is reported to be about 20 % done.

image

It looks like this behaviour is in line with what we’d expect.

I’ll report back with the SMART long test results - I fully expect the test to fail, and I am not sure how I’d explain a passing test.

Yes, SCRUB would have priority over the SMART test. I would however say that it would be best to not schedule them at the same time.

2 Likes

Well, the LONG test has finished and was successful which isn’t what I had expected. Moreover, it was done during scrub.

However, comparing the previous smartctl output to the current, it’s clear that the drive is not well. Several additional sectors have been re-allocated (6472). It looks as if the drive is erroring out less (improvement in ECC Recovered metric but unsure if I am interpreting this correctly) but even if that’s the case, it probably is just a transient state.

I am again surprised with why the test passed - probably the drive is teetering the metric boundaries but occasionally fails the tests. I’ll keep monitoring out of curiosity. The drive is long out of warranty (purchased 2017) so I’ll have it stick around some more.

Here is the smartctl full dump.

admin@filer[~]$ sudo smartctl -x /dev/sda
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.6.32-production+truenas] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Seagate IronWolf
Device Model:     ST8000VN0022-2EL112
Serial Number:    ZA19DEVC
LU WWN Device Id: 5 000c50 0a54621bc
Firmware Version: SC61
User Capacity:    8,001,563,222,016 bytes [8.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database 7.3/5528
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Mon Sep 16 08:41:10 2024 MDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
AAM feature is:   Unavailable
APM feature is:   Unavailable
Rd look-ahead is: Enabled
Write cache is:   Enabled
DSN feature is:   Unavailable
ATA Security is:  Disabled, NOT FROZEN [SEC1]
Wt Cache Reorder: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (  575) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        ( 798) minutes.
Conveyance self-test routine
recommended polling time:        (   2) minutes.
SCT capabilities:              (0x50bd) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
  1 Raw_Read_Error_Rate     POSR--   083   064   044    -    183133688
  3 Spin_Up_Time            PO----   084   084   000    -    0
  4 Start_Stop_Count        -O--CK   100   100   020    -    151
  5 Reallocated_Sector_Ct   PO--CK   100   100   010    -    72
  7 Seek_Error_Rate         POSR--   087   060   045    -    561948479
  9 Power_On_Hours          -O--CK   074   074   000    -    22792
 10 Spin_Retry_Count        PO--C-   100   100   097    -    0
 12 Power_Cycle_Count       -O--CK   100   100   020    -    151
184 End-to-End_Error        -O--CK   100   100   099    -    0
187 Reported_Uncorrect      -O--CK   100   100   000    -    0
188 Command_Timeout         -O--CK   100   100   000    -    0
189 High_Fly_Writes         -O-RCK   100   100   000    -    0
190 Airflow_Temperature_Cel -O---K   066   052   040    -    34 (Min/Max 25/37)
191 G-Sense_Error_Rate      -O--CK   095   095   000    -    10483
192 Power-Off_Retract_Count -O--CK   100   100   000    -    75
193 Load_Cycle_Count        -O--CK   048   048   000    -    104184
194 Temperature_Celsius     -O---K   034   048   000    -    34 (0 16 0 0 0)
195 Hardware_ECC_Recovered  -O-RC-   033   001   000    -    183133688
197 Current_Pending_Sector  -O--C-   100   100   000    -    0
198 Offline_Uncorrectable   ----C-   100   100   000    -    0
199 UDMA_CRC_Error_Count    -OSRCK   200   200   000    -    0
240 Head_Flying_Hours       ------   100   253   000    -    11609h+15m+31.029s
241 Total_LBAs_Written      ------   100   253   000    -    224942721456
242 Total_LBAs_Read         ------   100   253   000    -    378367049884
                            ||||||_ K auto-keep
                            |||||__ C event count
                            ||||___ R error rate
                            |||____ S speed/performance
                            ||_____ O updated online
                            |______ P prefailure warning

General Purpose Log Directory Version 1
SMART           Log Directory Version 1 [multi-sector log support]
Address    Access  R/W   Size  Description
0x00       GPL,SL  R/O      1  Log Directory
0x01           SL  R/O      1  Summary SMART error log
0x02           SL  R/O      5  Comprehensive SMART error log
0x03       GPL     R/O      5  Ext. Comprehensive SMART error log
0x04       GPL,SL  R/O      8  Device Statistics log
0x06           SL  R/O      1  SMART self-test log
0x07       GPL     R/O      1  Extended self-test log
0x08       GPL     R/O      2  Power Conditions log
0x09           SL  R/W      1  Selective self-test log
0x0c       GPL     R/O   2048  Pending Defects log
0x10       GPL     R/O      1  NCQ Command Error log
0x11       GPL     R/O      1  SATA Phy Event Counters log
0x13       GPL     R/O      1  SATA NCQ Send and Receive log
0x15       GPL     R/W      1  Rebuild Assist log
0x21       GPL     R/O      1  Write stream error log
0x22       GPL     R/O      1  Read stream error log
0x24       GPL     R/O    512  Current Device Internal Status Data log
0x30       GPL,SL  R/O      9  IDENTIFY DEVICE data log
0x80-0x9f  GPL,SL  R/W     16  Host vendor specific log
0xa1       GPL,SL  VS      24  Device vendor specific log
0xa2       GPL     VS    8160  Device vendor specific log
0xa6       GPL     VS     192  Device vendor specific log
0xa8-0xa9  GPL,SL  VS     136  Device vendor specific log
0xab       GPL     VS       1  Device vendor specific log
0xad       GPL     VS      16  Device vendor specific log
0xb0       GPL     VS    9048  Device vendor specific log
0xbe-0xbf  GPL     VS   65535  Device vendor specific log
0xc1       GPL,SL  VS      16  Device vendor specific log
0xc3       GPL,SL  VS       8  Device vendor specific log
0xd1       GPL     VS     264  Device vendor specific log
0xd2       GPL     VS   10000  Device vendor specific log
0xd4       GPL     VS    2048  Device vendor specific log
0xda       GPL,SL  VS       1  Device vendor specific log
0xe0       GPL,SL  R/W      1  SCT Command/Status
0xe1       GPL,SL  R/W      1  SCT Data Transfer

SMART Extended Comprehensive Error Log Version: 1 (5 sectors)
No Errors Logged

SMART Extended Self-test Log Version: 1 (1 sectors)
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%     22789         -
# 2  Short offline       Completed without error       00%     22759         -
# 3  Short offline       Completed: read failure       90%     22756         11619635632
# 4  Short offline       Completed without error       00%     22755         -
# 5  Short offline       Completed: read failure       90%     22754         11619635632
# 6  Short offline       Completed: read failure       90%     22753         11619635632
# 7  Short offline       Completed without error       00%     22752         -
# 8  Short offline       Completed: read failure       90%     22751         11619635632
# 9  Short offline       Completed without error       00%     22750         -
#10  Short offline       Completed: read failure       90%     22749         11619635632
#11  Short offline       Completed without error       00%     22748         -
#12  Short offline       Completed without error       00%     22747         -
#13  Short offline       Completed without error       00%     22746         -
#14  Short offline       Completed without error       00%     22745         -
#15  Short offline       Completed without error       00%     22744         -
#16  Short offline       Completed without error       00%     22743         -
#17  Short offline       Completed without error       00%     22742         -
#18  Short offline       Completed without error       00%     22741         -
#19  Short offline       Completed without error       00%     22740         -
5 of 5 failed self-tests are outdated by newer successful extended offline self-test # 1

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

SCT Status Version:                  3
SCT Version (vendor specific):       522 (0x020a)
Device State:                        Active (0)
Current Temperature:                    34 Celsius
Power Cycle Min/Max Temperature:     25/37 Celsius
Lifetime    Min/Max Temperature:     16/48 Celsius
Under/Over Temperature Limit Count:   0/57

SCT Temperature History Version:     2
Temperature Sampling Period:         3 minutes
Temperature Logging Interval:        59 minutes
Min/Max recommended Temperature:     10/25 Celsius
Min/Max Temperature Limit:            0/70 Celsius
Temperature History Size (Index):    128 (61)

Index    Estimated Time   Temperature Celsius
  62    2024-09-11 03:36    38  *******************
  63    2024-09-11 04:35    38  *******************
  64    2024-09-11 05:34    38  *******************
  65    2024-09-11 06:33    37  ******************
 ...    ..(  4 skipped).    ..  ******************
  70    2024-09-11 11:28    37  ******************
  71    2024-09-11 12:27    36  *****************
 ...    ..(  3 skipped).    ..  *****************
  75    2024-09-11 16:23    36  *****************
  76    2024-09-11 17:22    37  ******************
 ...    ..(  4 skipped).    ..  ******************
  81    2024-09-11 22:17    37  ******************
  82    2024-09-11 23:16    36  *****************
  83    2024-09-12 00:15    36  *****************
  84    2024-09-12 01:14    38  *******************
  85    2024-09-12 02:13    38  *******************
  86    2024-09-12 03:12    37  ******************
 ...    ..(  2 skipped).    ..  ******************
  89    2024-09-12 06:09    37  ******************
  90    2024-09-12 07:08    36  *****************
  91    2024-09-12 08:07    37  ******************
  92    2024-09-12 09:06    37  ******************
  93    2024-09-12 10:05     ?  -
  94    2024-09-12 11:04    25  ******
  95    2024-09-12 12:03    33  **************
  96    2024-09-12 13:02    35  ****************
  97    2024-09-12 14:01    36  *****************
  98    2024-09-12 15:00     ?  -
  99    2024-09-12 15:59    27  ********
 100    2024-09-12 16:58    36  *****************
 101    2024-09-12 17:57    37  ******************
 102    2024-09-12 18:56    36  *****************
 103    2024-09-12 19:55    37  ******************
 104    2024-09-12 20:54    36  *****************
 105    2024-09-12 21:53    37  ******************
 ...    ..(  2 skipped).    ..  ******************
 108    2024-09-13 00:50    37  ******************
 109    2024-09-13 01:49    36  *****************
 110    2024-09-13 02:48    36  *****************
 111    2024-09-13 03:47    37  ******************
 ...    ..(  3 skipped).    ..  ******************
 115    2024-09-13 07:43    37  ******************
 116    2024-09-13 08:42    34  ***************
 117    2024-09-13 09:41    38  *******************
 118    2024-09-13 10:40    38  *******************
 119    2024-09-13 11:39    38  *******************
 120    2024-09-13 12:38    37  ******************
 121    2024-09-13 13:37    35  ****************
 122    2024-09-13 14:36    36  *****************
 123    2024-09-13 15:35    37  ******************
 124    2024-09-13 16:34    37  ******************
 125    2024-09-13 17:33    34  ***************
 ...    ..(  2 skipped).    ..  ***************
   0    2024-09-13 20:30    34  ***************
   1    2024-09-13 21:29     ?  -
   2    2024-09-13 22:28    25  ******
   3    2024-09-13 23:27     ?  -
   4    2024-09-14 00:26    25  ******
   5    2024-09-14 01:25    36  *****************
   6    2024-09-14 02:24    36  *****************
   7    2024-09-14 03:23    35  ****************
   8    2024-09-14 04:22    35  ****************
   9    2024-09-14 05:21    35  ****************
  10    2024-09-14 06:20    36  *****************
 ...    ..(  6 skipped).    ..  *****************
  17    2024-09-14 13:13    36  *****************
  18    2024-09-14 14:12    35  ****************
  19    2024-09-14 15:11    35  ****************
  20    2024-09-14 16:10    35  ****************
  21    2024-09-14 17:09    33  **************
 ...    ..( 16 skipped).    ..  **************
  38    2024-09-15 09:52    33  **************
  39    2024-09-15 10:51    36  *****************
  40    2024-09-15 11:50    36  *****************
  41    2024-09-15 12:49    37  ******************
 ...    ..(  2 skipped).    ..  ******************
  44    2024-09-15 15:46    37  ******************
  45    2024-09-15 16:45    36  *****************
  46    2024-09-15 17:44    37  ******************
 ...    ..(  2 skipped).    ..  ******************
  49    2024-09-15 20:41    37  ******************
  50    2024-09-15 21:40    36  *****************
 ...    ..(  2 skipped).    ..  *****************
  53    2024-09-16 00:37    36  *****************
  54    2024-09-16 01:36    37  ******************
  55    2024-09-16 02:35    37  ******************
  56    2024-09-16 03:34    37  ******************
  57    2024-09-16 04:33    36  *****************
  58    2024-09-16 05:32    36  *****************
  59    2024-09-16 06:31    34  ***************
  60    2024-09-16 07:30    34  ***************
  61    2024-09-16 08:29    34  ***************

SCT Error Recovery Control:
           Read:     70 (7.0 seconds)
          Write:     70 (7.0 seconds)

Device Statistics (GP Log 0x04)
Page  Offset Size        Value Flags Description
0x01  =====  =               =  ===  == General Statistics (rev 1) ==
0x01  0x008  4             151  ---  Lifetime Power-On Resets
0x01  0x010  4           22792  ---  Power-on Hours
0x01  0x018  6    224944089894  ---  Logical Sectors Written
0x01  0x020  6       709156434  ---  Number of Write Commands
0x01  0x028  6    358415293087  ---  Logical Sectors Read
0x01  0x030  6       944226049  ---  Number of Read Commands
0x01  0x038  6               -  ---  Date and Time TimeStamp
0x03  =====  =               =  ===  == Rotating Media Statistics (rev 1) ==
0x03  0x008  4      2069993341  N--  Spindle Motor Power-on Hours
0x03  0x010  4      2069975186  N--  Head Flying Hours
0x03  0x018  4          104184  ---  Head Load Events
0x03  0x020  4              72  ---  Number of Reallocated Logical Sectors
0x03  0x028  4               0  ---  Read Recovery Attempts
0x03  0x030  4               0  ---  Number of Mechanical Start Failures
0x03  0x038  4               0  ---  Number of Realloc. Candidate Logical Sectors
0x03  0x040  4              74  ---  Number of High Priority Unload Events
0x04  =====  =               =  ===  == General Errors Statistics (rev 1) ==
0x04  0x008  4               0  ---  Number of Reported Uncorrectable Errors
0x04  0x010  4               0  ---  Resets Between Cmd Acceptance and Completion
0x05  =====  =               =  ===  == Temperature Statistics (rev 1) ==
0x05  0x008  1              34  ---  Current Temperature
0x05  0x010  1              35  ---  Average Short Term Temperature
0x05  0x018  1              36  ---  Average Long Term Temperature
0x05  0x020  1              48  ---  Highest Temperature
0x05  0x028  1               0  ---  Lowest Temperature
0x05  0x030  1              43  ---  Highest Average Short Term Temperature
0x05  0x038  1              28  ---  Lowest Average Short Term Temperature
0x05  0x040  1              41  ---  Highest Average Long Term Temperature
0x05  0x048  1              29  ---  Lowest Average Long Term Temperature
0x05  0x050  4               0  ---  Time in Over-Temperature
0x05  0x058  1              70  ---  Specified Maximum Operating Temperature
0x05  0x060  4               0  ---  Time in Under-Temperature
0x05  0x068  1               0  ---  Specified Minimum Operating Temperature
0x06  =====  =               =  ===  == Transport Statistics (rev 1) ==
0x06  0x008  4             357  ---  Number of Hardware Resets
0x06  0x010  4             142  ---  Number of ASR Events
0x06  0x018  4               0  ---  Number of Interface CRC Errors
                                |||_ C monitored condition met
                                ||__ D supports DSN
                                |___ N normalized value

Pending Defects log (GP Log 0x0c)
No Defects Logged

SATA Phy Event Counters (GP Log 0x11)
ID      Size     Value  Description
0x000a  2            2  Device-to-host register FISes sent due to a COMRESET
0x0001  2            0  Command failed due to ICRC error
0x0003  2            0  R_ERR response for device-to-host data FIS
0x0004  2            0  R_ERR response for host-to-device data FIS
0x0006  2            0  R_ERR response for device-to-host non-data FIS
0x0007  2            0  R_ERR response for host-to-device non-data FIS

Seagate FARM log (GP Log 0xa6) supported [try: -l farm]

Correct, the SMART test is done on the drive, by the drive itself, it will not know about any activities concurrently run by the host OS. The host could in theory manually tell the drive to abort the SMART test.

Surprising indeed. I suppose that the long test caused the drive to “solve” the issue by reallocating the problematic sectors (#5) without bumping #197 and #198 (write errors only?).
Still I would find it difficult to trust this drive. Monitor carefully…

1 Like

The number of read errors are up 18%, coincidentally, so is the number of hardware ECC corrected errors.

Actually, that might be a red herring if this tool is to be believed:
https://s.i.wtf/#00000CDEB3E0

1 Like