Alert stating disk error but not seeing it

I am getting the following alert but can not see the problem anywhere other than the alert email.

New alerts:

Device: /dev/sdg [SAT], 12 Offline uncorrectable sectors.

But see this in the storage area.

Is there some other place I need to look? Note the Dashboard does not show any errors either.

Try running a Long SMART test on that drive and seeing if anything comes up.

There’s no reason to expect that a SMART error (which is what you’re seeing) would correlate with a pool error, which is what would show in the areas you’re looking at. I thought there was a way in the GUI to show SMART status, but I’m not seeing it now. But from the command line, run smartctl -x /dev/sdg and post the complete output.

1 Like

Go to Storage > Disks. Expand a disk row and click S.M.A.R.T. Test Results.

But that only shows the results of SMART self-tests, not all the attributes, right?

I believe on a failed test you should be able to click the Status icon to view logs, which will give further details

=== START OF INFORMATION SECTION ===
Device Model:     HUH721010ALE601
Serial Number:    7JG62ARC
LU WWN Device Id: 5 000cca 266c2c308
Firmware Version: LHGL0003
User Capacity:    10,000,831,348,736 bytes [10.0 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        Not in smartctl database 7.3/5528
ATA Version is:   ACS-2, ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Tue Dec  3 12:42:48 2024 EST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
AAM feature is:   Unavailable
APM feature is:   Disabled
Rd look-ahead is: Enabled
Write cache is:   Enabled
DSN feature is:   Unavailable
ATA Security is:  Disabled, NOT FROZEN [SEC1], Master PW ID: 0xfffd
Wt Cache Reorder: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      ( 249) Self-test routine in progress...
                                        90% of test remaining.
Total time to complete Offline 
data collection:                (   93) seconds.
Offline data collection
capabilities:                    (0x5b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine 
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        (   1) minutes.
SCT capabilities:              (0x003d) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
  1 Raw_Read_Error_Rate     PO-R--   100   100   016    -    0
  2 Throughput_Performance  P-S---   134   134   054    -    96
  3 Spin_Up_Time            POS---   150   150   024    -    418 (Average 458)
  4 Start_Stop_Count        -O--C-   100   100   000    -    271
  5 Reallocated_Sector_Ct   PO--CK   100   100   005    -    0
  7 Seek_Error_Rate         PO-R--   100   100   067    -    0
  8 Seek_Time_Performance   P-S---   128   128   020    -    18
  9 Power_On_Hours          -O--C-   093   093   000    -    50901
 10 Spin_Retry_Count        PO--C-   100   100   060    -    0
 12 Power_Cycle_Count       -O--CK   100   100   000    -    82
 22 Unknown_Attribute       PO---K   100   100   025    -    100
 45 Unknown_Attribute       PO---K   100   100   001    -    1095233372415
192 Power-Off_Retract_Count -O--CK   099   099   000    -    1201
193 Load_Cycle_Count        -O--C-   099   099   000    -    1201
194 Temperature_Celsius     -O----   139   139   000    -    43 (Min/Max 18/55)
196 Reallocated_Event_Count -O--CK   100   100   000    -    0
197 Current_Pending_Sector  -O---K   100   100   000    -    0
198 Offline_Uncorrectable   ---R--   100   100   000    -    12
199 UDMA_CRC_Error_Count    -O-R--   200   200   000    -    0
231 Temperature_Celsius     -O--CK   100   100   000    -    0
241 Total_LBAs_Written      -O--C-   100   100   000    -    5743100120586
242 Total_LBAs_Read         -O--C-   100   100   000    -    4945634658774
                            ||||||_ K auto-keep
                            |||||__ C event count
                            ||||___ R error rate
                            |||____ S speed/performance
                            ||_____ O updated online
                            |______ P prefailure warning

General Purpose Log Directory Version 1
SMART           Log Directory Version 1 [multi-sector log support]
Address    Access  R/W   Size  Description
0x00       GPL,SL  R/O      1  Log Directory
0x01           SL  R/O      1  Summary SMART error log
0x02           SL  R/O      1  Comprehensive SMART error log
0x03       GPL     R/O      1  Ext. Comprehensive SMART error log
0x04       GPL     R/O    256  Device Statistics log
0x04       SL      R/O    255  Device Statistics log
0x06           SL  R/O      1  SMART self-test log
0x07       GPL     R/O      1  Extended self-test log
0x08       GPL     R/O      2  Power Conditions log
0x09           SL  R/W      1  Selective self-test log
0x0c       GPL     R/O   5501  Pending Defects log
0x10       GPL     R/O      1  NCQ Command Error log
0x11       GPL     R/O      1  SATA Phy Event Counters log
0x12       GPL     R/O      1  SATA NCQ Non-Data log
0x13       GPL     R/O      1  SATA NCQ Send and Receive log
0x15       GPL     R/W      1  Rebuild Assist log
0x21       GPL     R/O      1  Write stream error log
0x22       GPL     R/O      1  Read stream error log
0x24       GPL     R/O    256  Current Device Internal Status Data log
0x25       GPL     R/O    256  Saved Device Internal Status Data log
0x30       GPL,SL  R/O      9  IDENTIFY DEVICE data log
0x80       GPL     R/W    688  Host vendor specific log
0x81-0x9f  GPL,SL  R/W     16  Host vendor specific log
0xb2       GPL     VS     688  Device vendor specific log
0xc8       GPL     VS      12  Device vendor specific log
0xe0       GPL,SL  R/W      1  SCT Command/Status
0xe1       GPL,SL  R/W      1  SCT Data Transfer

SMART Extended Comprehensive Error Log Version: 1 (1 sectors)
Device Error Count: 8 (device log contains only the most recent 4 errors)
        CR     = Command Register
        FEATR  = Features Register
        COUNT  = Count (was: Sector Count) Register
        LBA_48 = Upper bytes of LBA High/Mid/Low Registers ]  ATA-8
        LH     = LBA High (was: Cylinder High) Register    ]   LBA
        LM     = LBA Mid (was: Cylinder Low) Register      ] Register
        LL     = LBA Low (was: Sector Number) Register     ]
        DV     = Device (was: Device/Head) Register
        DC     = Device Control Register
        ER     = Error register
        ST     = Status register
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 8 [3] occurred at disk power-on lifetime: 9509 hours (396 days + 5 hours)
  When the command that caused the error occurred, the device was doing SMART Offline or Self-test.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 51 00 80 00 00 25 9f b3 78 40 00  Error: UNC 128 sectors at LBA = 0x259fb378 = 631223160

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  25 00 00 08 00 00 00 25 9f ab f8 e0 08  2d+14:19:01.638  READ DMA EXT
  27 00 00 00 00 00 00 00 00 00 00 e0 08  2d+14:19:01.593  READ NATIVE MAX ADDRESS EXT [OBS-ACS-3]
  ec 00 00 00 00 00 00 00 00 00 00 a0 08  2d+14:19:01.593  IDENTIFY DEVICE
  ef 00 03 00 46 00 00 00 00 00 00 a0 08  2d+14:19:01.549  SET FEATURES [Set transfer mode]
  27 00 00 00 00 00 00 00 00 00 00 e0 08  2d+14:19:01.533  READ NATIVE MAX ADDRESS EXT [OBS-ACS-3]

Error 7 [2] occurred at disk power-on lifetime: 9509 hours (396 days + 5 hours)
  When the command that caused the error occurred, the device was doing SMART Offline or Self-test.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 51 00 80 00 00 25 9f b3 78 40 00  Error: UNC 128 sectors at LBA = 0x259fb378 = 631223160

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  25 00 00 08 00 00 00 25 9f ab f8 e0 08  2d+14:19:01.438  READ DMA EXT
  25 00 00 04 00 00 04 2e e4 e5 b0 e0 08  2d+14:19:01.406  READ DMA EXT
  35 00 00 00 08 00 02 fd 0a 17 b8 e0 08  2d+14:19:01.406  WRITE DMA EXT
  35 00 00 00 18 00 02 fd 57 04 28 e0 08  2d+14:19:01.405  WRITE DMA EXT
  35 00 00 08 00 00 02 fd 56 fc 28 e0 08  2d+14:19:01.401  WRITE DMA EXT

Error 6 [1] occurred at disk power-on lifetime: 9509 hours (396 days + 5 hours)
  When the command that caused the error occurred, the device was doing SMART Offline or Self-test.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 51 00 80 00 00 25 9f b3 78 40 00  Error: UNC 128 sectors at LBA = 0x259fb378 = 631223160

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  25 00 00 08 00 00 00 25 9f ab f8 e0 08  2d+14:19:01.146  READ DMA EXT
  27 00 00 00 00 00 00 00 00 00 00 e0 08  2d+14:19:01.067  READ NATIVE MAX ADDRESS EXT [OBS-ACS-3]
  ec 00 00 00 00 00 00 00 00 00 00 a0 08  2d+14:19:01.067  IDENTIFY DEVICE
  ef 00 03 00 46 00 00 00 00 00 00 a0 08  2d+14:19:00.879  SET FEATURES [Set transfer mode]
  27 00 00 00 00 00 00 00 00 00 00 e0 08  2d+14:19:00.858  READ NATIVE MAX ADDRESS EXT [OBS-ACS-3]

Error 5 [0] occurred at disk power-on lifetime: 9509 hours (396 days + 5 hours)
  When the command that caused the error occurred, the device was doing SMART Offline or Self-test.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 51 00 80 00 00 25 9f b3 78 40 00  Error: UNC 128 sectors at LBA = 0x259fb378 = 631223160

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  25 00 00 08 00 00 00 25 9f ab f8 e0 08  2d+14:19:00.821  READ DMA EXT
  35 00 00 00 18 00 02 fd 54 89 b0 e0 08  2d+14:19:00.795  WRITE DMA EXT
  35 00 00 08 00 00 02 fd 54 81 b0 e0 08  2d+14:19:00.790  WRITE DMA EXT
  35 00 00 08 00 00 02 fd 54 79 b0 e0 08  2d+14:19:00.786  WRITE DMA EXT
  35 00 00 08 00 00 02 fd 54 71 b0 e0 08  2d+14:19:00.772  WRITE DMA EXT

SMART Extended Self-test Log Version: 1 (1 sectors)
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%     50888         -
# 2  Short offline       Completed without error       00%     50864         -
# 3  Short offline       Completed without error       00%     50840         -
# 4  Short offline       Completed without error       00%     50816         -
# 5  Short offline       Completed without error       00%     50792         -
# 6  Short offline       Completed without error       00%     50768         -
# 7  Short offline       Completed without error       00%     50744         -
# 8  Short offline       Completed without error       00%     50720         -
# 9  Short offline       Completed without error       00%     50696         -
#10  Short offline       Completed without error       00%     50672         -
#11  Short offline       Completed without error       00%     50648         -
#12  Short offline       Completed without error       00%     50624         -
#13  Short offline       Completed without error       00%     50600         -
#14  Short offline       Completed without error       00%     50576         -
#15  Short offline       Completed without error       00%     50552         -
#16  Short offline       Completed without error       00%     50529         -
#17  Short offline       Completed without error       00%     50505         -
#18  Short offline       Completed without error       00%     50481         -
#19  Short offline       Completed without error       00%     50457         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

SCT Status Version:                  3
SCT Version (vendor specific):       256 (0x0100)
Device State:                        DST executing in background (3)
Current Temperature:                    43 Celsius
Power Cycle Min/Max Temperature:     37/43 Celsius
Lifetime    Min/Max Temperature:     18/55 Celsius
Under/Over Temperature Limit Count:   0/0

SCT Temperature History Version:     2
Temperature Sampling Period:         1 minute
Temperature Logging Interval:        1 minute
Min/Max recommended Temperature:      0/60 Celsius
Min/Max Temperature Limit:           -40/70 Celsius
Temperature History Size (Index):    128 (78)

Index    Estimated Time   Temperature Celsius
  79    2024-12-03 10:35    40  *********************
 ...    ..( 87 skipped).    ..  *********************
  39    2024-12-03 12:03    40  *********************
  40    2024-12-03 12:04    41  **********************
 ...    ..(  7 skipped).    ..  **********************
  48    2024-12-03 12:12    41  **********************
  49    2024-12-03 12:13    42  ***********************
 ...    ..( 21 skipped).    ..  ***********************
  71    2024-12-03 12:35    42  ***********************
  72    2024-12-03 12:36    43  ************************
 ...    ..(  4 skipped).    ..  ************************
  77    2024-12-03 12:41    43  ************************
  78    2024-12-03 12:42    40  *********************

SCT Error Recovery Control:
           Read: Disabled
          Write: Disabled

Device Statistics (GP Log 0x04)
Page  Offset Size        Value Flags Description
0x01  =====  =               =  ===  == General Statistics (rev 1) ==
0x01  0x008  4              82  ---  Lifetime Power-On Resets
0x01  0x010  4           50901  ---  Power-on Hours
0x01  0x018  6   5743100120586  ---  Logical Sectors Written
0x01  0x020  6      7842513647  ---  Number of Write Commands
0x01  0x028  6   4945634658774  ---  Logical Sectors Read
0x01  0x030  6      7712474360  ---  Number of Read Commands
0x01  0x038  6    183245216550  ---  Date and Time TimeStamp
0x03  =====  =               =  ===  == Rotating Media Statistics (rev 1) ==
0x03  0x008  4           49729  ---  Spindle Motor Power-on Hours
0x03  0x010  4           49729  ---  Head Flying Hours
0x03  0x018  4            1201  ---  Head Load Events
0x03  0x020  4               0  ---  Number of Reallocated Logical Sectors
0x03  0x028  4            3731  ---  Read Recovery Attempts
0x03  0x030  4               0  ---  Number of Mechanical Start Failures
0x04  =====  =               =  ===  == General Errors Statistics (rev 1) ==
0x04  0x008  4               8  ---  Number of Reported Uncorrectable Errors
0x04  0x010  4               1  ---  Resets Between Cmd Acceptance and Completion
0x05  =====  =               =  ===  == Temperature Statistics (rev 1) ==
0x05  0x008  1              43  ---  Current Temperature
0x05  0x010  1              39  N--  Average Short Term Temperature
0x05  0x018  1              38  N--  Average Long Term Temperature
0x05  0x020  1              55  ---  Highest Temperature
0x05  0x028  1              18  ---  Lowest Temperature
0x05  0x030  1              53  N--  Highest Average Short Term Temperature
0x05  0x038  1              23  N--  Lowest Average Short Term Temperature
0x05  0x040  1              51  N--  Highest Average Long Term Temperature
0x05  0x048  1              25  N--  Lowest Average Long Term Temperature
0x05  0x050  4               0  ---  Time in Over-Temperature
0x05  0x058  1              60  ---  Specified Maximum Operating Temperature
0x05  0x060  4               0  ---  Time in Under-Temperature
0x05  0x068  1               0  ---  Specified Minimum Operating Temperature
0x06  =====  =               =  ===  == Transport Statistics (rev 1) ==
0x06  0x008  4             206  ---  Number of Hardware Resets
0x06  0x010  4             294  ---  Number of ASR Events
0x06  0x018  4               0  ---  Number of Interface CRC Errors
0xff  =====  =               =  ===  == Vendor Specific Statistics (rev 1) ==
                                |||_ C monitored condition met
                                ||__ D supports DSN
                                |___ N normalized value

Pending Defects log (GP Log 0x0c)
No Defects Logged

SATA Phy Event Counters (GP Log 0x11)
ID      Size     Value  Description
0x0001  2            0  Command failed due to ICRC error
0x0002  2            0  R_ERR response for data FIS
0x0003  2            0  R_ERR response for device-to-host data FIS
0x0004  2            0  R_ERR response for host-to-device data FIS
0x0005  2            0  R_ERR response for non-data FIS
0x0006  2            0  R_ERR response for device-to-host non-data FIS
0x0007  2            0  R_ERR response for host-to-device non-data FIS
0x0008  2            0  Device-to-host non-data FIS retries
0x0009  2           17  Transition from drive PhyRdy to drive PhyNRdy
0x000a  2           12  Device-to-host register FISes sent due to a COMRESET
0x000b  2            0  CRC errors within host-to-device FIS
0x000d  2            0  Non-CRC errors within host-to-device FIS

Smart results all show success with no errors.

Though they don’t show that you’ve ever run a long test, which I’d recommend. Other than that, and that the drive’s running a little warm, I don’t see major red flags.

We see this all the time and the opposite as well, that a pool error is perceived as a drive failure.

I believe it is the status reported by SMART (PASS/FAIL), meaning the one status this I have never seen listed as anything other than PASSED. It is for a gross failure, not the ones we see daily. I’m curious if I took an old hard drive, places a drop of oil on a platter, if then the “PASSED” would change. I’m skeptical. I have a drive I can use.
Retired = Lots of time on my hands.

@Donald_Dodier
Please run smartctl -t long /dev/sdg to start a Long test. You do not have any Long tests showing as completed. I highly recommend you run a Long test periodically (weekly is what I recommend for a Home system).

A test hasn’t failed, a value is “just” getting worse.

Reporting these events in alerts but not giving the user an option in the UI, to get more information, is not a great user experience. I hope you consider expanding on the functionality here.

Sounds like a Feature Requests - TrueNAS Community Forums :wink: