SMART Errors

mooseca · June 4, 2024, 6:06am

HI Guys.

I’ve been getting SMART errors for a couple months and the unreadable sectors have been increasing from 24 to now 72. Is this fixable or time for a new hard drive?

=== START OF INFORMATION SECTION ===
Model Family:     Seagate BarraCuda 3.5 (SMR)
Device Model:     ST4000DM005-2DP166
Serial Number:    ZDH183T6
LU WWN Device Id: 5 000c50 0a293d962
Firmware Version: 0001
User Capacity:    4,000,787,030,016 bytes [4.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5980 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database 7.3/5528
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Tue Jun  4 16:03:49 2024 AEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
AAM feature is:   Unavailable
APM feature is:   Disabled
Rd look-ahead is: Enabled
Write cache is:   Enabled
DSN feature is:   Unavailable
ATA Security is:  Disabled, frozen [SEC2]
Wt Cache Reorder: Unavailable

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      ( 117) The previous self-test completed having
                                        the read element of the test failed.
Total time to complete Offline 
data collection:                (  581) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine 
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        ( 624) minutes.
Conveyance self-test routine
recommended polling time:        (   2) minutes.
SCT capabilities:              (0x10a5) SCT Status supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
  1 Raw_Read_Error_Rate     POSR--   082   064   006    -    144417880
  3 Spin_Up_Time            PO----   094   093   000    -    0
  4 Start_Stop_Count        -O--CK   100   100   020    -    615
  5 Reallocated_Sector_Ct   PO--CK   100   100   010    -    8
  7 Seek_Error_Rate         POSR--   093   060   045    -    1900605876
  9 Power_On_Hours          -O--CK   036   036   000    -    56647h+22m+55.006s
 10 Spin_Retry_Count        PO--C-   100   100   097    -    0
 12 Power_Cycle_Count       -O--CK   100   100   020    -    488
183 Runtime_Bad_Block       -O--CK   100   100   000    -    0
184 End-to-End_Error        -O--CK   100   100   099    -    0
187 Reported_Uncorrect      -O--CK   092   092   000    -    8
188 Command_Timeout         -O--CK   100   100   000    -    0 0 0
189 High_Fly_Writes         -O-RCK   100   100   000    -    0
190 Airflow_Temperature_Cel -O---K   074   055   040    -    26 (Min/Max 21/31)
191 G-Sense_Error_Rate      -O--CK   100   100   000    -    0
192 Power-Off_Retract_Count -O--CK   099   099   000    -    2543
193 Load_Cycle_Count        -O--CK   093   093   000    -    15813
194 Temperature_Celsius     -O---K   026   045   000    -    26 (0 7 0 0 0)
197 Current_Pending_Sector  -O--C-   100   100   000    -    72
198 Offline_Uncorrectable   ----C-   100   100   000    -    72
199 UDMA_CRC_Error_Count    -OSRCK   200   200   000    -    0
240 Head_Flying_Hours       ------   100   253   000    -    55457h+52m+51.897s
241 Total_LBAs_Written      ------   100   253   000    -    107687467030
242 Total_LBAs_Read         ------   100   253   000    -    219100461370
                            ||||||_ K auto-keep
                            |||||__ C event count
                            ||||___ R error rate
                            |||____ S speed/performance
                            ||_____ O updated online
                            |______ P prefailure warning

General Purpose Log Directory Version 1
SMART           Log Directory Version 1 [multi-sector log support]
Address    Access  R/W   Size  Description
0x00       GPL,SL  R/O      1  Log Directory
0x01           SL  R/O      1  Summary SMART error log
0x02           SL  R/O      5  Comprehensive SMART error log
0x03       GPL     R/O      5  Ext. Comprehensive SMART error log
0x04       GPL,SL  R/O      8  Device Statistics log
0x06           SL  R/O      1  SMART self-test log
0x07       GPL     R/O      1  Extended self-test log
0x09           SL  R/W      1  Selective self-test log
0x10       GPL     R/O      1  NCQ Command Error log
0x11       GPL     R/O      1  SATA Phy Event Counters log
0x13       GPL     R/O      1  SATA NCQ Send and Receive log
0x15       GPL     R/W      1  Rebuild Assist log
0x21       GPL     R/O      1  Write stream error log
0x22       GPL     R/O      1  Read stream error log
0x24       GPL     R/O    512  Current Device Internal Status Data log
0x30       GPL,SL  R/O      9  IDENTIFY DEVICE data log
0x80-0x9f  GPL,SL  R/W     16  Host vendor specific log
0xa1       GPL,SL  VS      24  Device vendor specific log
0xa2       GPL     VS    8160  Device vendor specific log
0xa6       GPL     VS     192  Device vendor specific log
0xa8-0xa9  GPL,SL  VS     136  Device vendor specific log
0xab       GPL     VS       1  Device vendor specific log
0xb0       GPL     VS    9048  Device vendor specific log
0xbe-0xbf  GPL     VS   65535  Device vendor specific log
0xc0       GPL,SL  VS       1  Device vendor specific log
0xc1       GPL,SL  VS      16  Device vendor specific log
0xd1       GPL     VS     136  Device vendor specific log
0xd2       GPL     VS   10000  Device vendor specific log
0xd3       GPL     VS    1920  Device vendor specific log
0xe0       GPL,SL  R/W      1  SCT Command/Status
0xe1       GPL,SL  R/W      1  SCT Data Transfer

SMART Extended Comprehensive Error Log Version: 1 (5 sectors)
Device Error Count: 8
        CR     = Command Register
        FEATR  = Features Register
        COUNT  = Count (was: Sector Count) Register
        LBA_48 = Upper bytes of LBA High/Mid/Low Registers ]  ATA-8
        LH     = LBA High (was: Cylinder High) Register    ]   LBA
        LM     = LBA Mid (was: Cylinder Low) Register      ] Register
        LL     = LBA Low (was: Sector Number) Register     ]
        DV     = Device (was: Device/Head) Register
        DC     = Device Control Register
        ER     = Error register
        ST     = Status register
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 8 [7] log entry is empty
Error 7 [6] log entry is empty
Error 6 [5] log entry is empty
Error 5 [4] log entry is empty
Error 4 [3] occurred at disk power-on lifetime: 53534 hours (2230 days + 14 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 51 00 00 00 01 b0 b9 b0 b0 00 00  Error: UNC at LBA = 0x1b0b9b0b0 = 7259926704

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  60 00 00 00 08 00 00 f9 97 95 88 40 00  5d+09:51:34.714  READ FPDMA QUEUED
  60 00 00 01 00 00 01 b0 b9 b5 10 40 00  5d+09:51:31.223  READ FPDMA QUEUED
  60 00 00 02 18 00 01 b0 b9 b2 60 40 00  5d+09:51:31.223  READ FPDMA QUEUED
  60 00 00 02 00 00 01 b0 b9 af 28 40 00  5d+09:51:31.223  READ FPDMA QUEUED
  2f 00 00 00 01 00 00 00 00 00 10 00 00  5d+09:51:31.122  READ LOG EXT

Error 3 [2] occurred at disk power-on lifetime: 53534 hours (2230 days + 14 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 51 00 00 00 01 b0 b9 b0 b8 00 00  Error: UNC at LBA = 0x1b0b9b0b8 = 7259926712

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  60 00 00 01 00 00 01 b0 b9 b5 10 40 00  5d+09:51:27.934  READ FPDMA QUEUED
  60 00 00 02 18 00 01 b0 b9 b2 60 40 00  5d+09:51:27.934  READ FPDMA QUEUED
  60 00 00 01 00 00 01 b0 b6 f9 48 40 00  5d+09:51:26.849  READ FPDMA QUEUED
  60 00 00 01 00 00 01 b0 b6 f7 08 40 00  5d+09:51:26.845  READ FPDMA QUEUED
  60 00 00 08 00 00 01 b0 b6 ec d8 40 00  5d+09:51:26.836  READ FPDMA QUEUED

Error 2 [1] occurred at disk power-on lifetime: 53534 hours (2230 days + 14 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 51 00 00 00 01 6f 10 1a 08 00 00  Error: UNC at LBA = 0x16f101a08 = 6158293512

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  60 00 00 00 08 00 00 f9 96 d3 a8 40 00  5d+09:26:21.466  READ FPDMA QUEUED
  60 00 00 06 00 00 01 6f 10 24 88 40 00  5d+09:26:15.517  READ FPDMA QUEUED
  60 00 00 08 00 00 01 6f 10 1c 88 40 00  5d+09:26:15.517  READ FPDMA QUEUED
  60 00 00 08 00 00 01 6f 10 14 88 40 00  5d+09:26:15.517  READ FPDMA QUEUED
  2f 00 00 00 01 00 00 00 00 00 10 00 00  5d+09:26:15.426  READ LOG EXT

Error 1 [0] occurred at disk power-on lifetime: 53534 hours (2230 days + 14 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 51 00 00 00 01 6f 10 1a 30 00 00  Error: UNC at LBA = 0x16f101a30 = 6158293552

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  60 00 00 06 00 00 01 6f 10 24 88 40 00  5d+09:26:08.623  READ FPDMA QUEUED
  60 00 00 08 00 00 01 6f 10 1c 88 40 00  5d+09:26:08.622  READ FPDMA QUEUED
  60 00 00 01 00 00 01 6f 0b 13 88 40 00  5d+09:26:07.115  READ FPDMA QUEUED
  60 00 00 08 00 00 01 6f 0b 0b 88 40 00  5d+09:26:07.113  READ FPDMA QUEUED
  60 00 00 05 00 00 01 6f 0b 03 88 40 00  5d+09:26:06.981  READ FPDMA QUEUED

SMART Extended Self-test Log Version: 1 (1 sectors)
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed: read failure       50%     56563         4880373056
# 2  Extended offline    Completed: read failure       50%     56459         4880373048
# 3  Extended offline    Completed: read failure       10%     55835         7658382072
# 4  Extended offline    Completed: read failure       10%     55138         7658382072
# 5  Extended offline    Completed without error       00%     54394         -
# 6  Extended offline    Completed without error       00%     54079         -
# 7  Extended offline    Interrupted (host reset)      90%     54020         -
# 8  Extended offline    Completed: read failure       10%     53842         7658382080
# 9  Extended offline    Completed without error       00%     53779         -
#10  Extended offline    Completed without error       00%     53056         -
#11  Extended offline    Interrupted (host reset)      00%     52385         -
#12  Extended offline    Completed without error       00%     51652         -
#13  Extended offline    Completed without error       00%     50929         -
#14  Extended offline    Completed without error       00%     50260         -
#15  Extended offline    Completed without error       00%     49605         -
#16  Extended offline    Completed without error       00%     48860         -
#17  Extended offline    Completed without error       00%     48161         -
#18  Extended offline    Completed without error       00%     47417         -
#19  Extended offline    Completed without error       00%     46696         -
1 of 5 failed self-tests are outdated by newer successful extended offline self-test # 5

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

SCT Status Version:                  3
SCT Version (vendor specific):       522 (0x020a)
Device State:                        Active (0)
Current Temperature:                    26 Celsius
Power Cycle Min/Max Temperature:     21/31 Celsius
Lifetime    Min/Max Temperature:      7/45 Celsius
Under/Over Temperature Limit Count:   0/0

SCT Temperature History Version:     2
Temperature Sampling Period:         3 minutes
Temperature Logging Interval:        59 minutes
Min/Max recommended Temperature:     14/55 Celsius
Min/Max Temperature Limit:           10/60 Celsius
Temperature History Size (Index):    128 (97)

Index    Estimated Time   Temperature Celsius
  98    2024-05-30 11:03    25  ******
  99    2024-05-30 12:02    26  *******
 100    2024-05-30 13:01    28  *********
 101    2024-05-30 14:00    30  ***********
 ...    ..(  2 skipped).    ..  ***********
 104    2024-05-30 16:57    30  ***********
 105    2024-05-30 17:56    29  **********
 106    2024-05-30 18:55    29  **********
 107    2024-05-30 19:54    28  *********
 ...    ..(  2 skipped).    ..  *********
 110    2024-05-30 22:51    28  *********
 111    2024-05-30 23:50    27  ********
 112    2024-05-31 00:49    28  *********
 ...    ..(  2 skipped).    ..  *********
 115    2024-05-31 03:46    28  *********
 116    2024-05-31 04:45    29  **********
 ...    ..(  2 skipped).    ..  **********
 119    2024-05-31 07:42    29  **********
 120    2024-05-31 08:41    28  *********
 ...    ..(  5 skipped).    ..  *********
 126    2024-05-31 14:35    28  *********
 127    2024-05-31 15:34    29  **********
   0    2024-05-31 16:33    29  **********
   1    2024-05-31 17:32    29  **********
   2    2024-05-31 18:31    28  *********
   3    2024-05-31 19:30    27  ********
   4    2024-05-31 20:29    27  ********
   5    2024-05-31 21:28    27  ********
   6    2024-05-31 22:27    26  *******
 ...    ..(  2 skipped).    ..  *******
   9    2024-06-01 01:24    26  *******
  10    2024-06-01 02:23    27  ********
  11    2024-06-01 03:22    27  ********
  12    2024-06-01 04:21    26  *******
  13    2024-06-01 05:20    26  *******
  14    2024-06-01 06:19    23  ****
 ...    ..(  2 skipped).    ..  ****
  17    2024-06-01 09:16    23  ****
  18    2024-06-01 10:15    24  *****
  19    2024-06-01 11:14    24  *****
  20    2024-06-01 12:13    25  ******
 ...    ..(  6 skipped).    ..  ******
  27    2024-06-01 19:06    25  ******
  28    2024-06-01 20:05    24  *****
  29    2024-06-01 21:04    24  *****
  30    2024-06-01 22:03    24  *****
  31    2024-06-01 23:02    23  ****
 ...    ..(  2 skipped).    ..  ****
  34    2024-06-02 01:59    23  ****
  35    2024-06-02 02:58    24  *****
  36    2024-06-02 03:57    23  ****
 ...    ..(  4 skipped).    ..  ****
  41    2024-06-02 08:52    23  ****
  42    2024-06-02 09:51    24  *****
  43    2024-06-02 10:50    25  ******
  44    2024-06-02 11:49    26  *******
  45    2024-06-02 12:48    27  ********
  46    2024-06-02 13:47    27  ********
  47    2024-06-02 14:46    28  *********
  48    2024-06-02 15:45    28  *********
  49    2024-06-02 16:44    27  ********
  50    2024-06-02 17:43    26  *******
  51    2024-06-02 18:42    26  *******
  52    2024-06-02 19:41    25  ******
 ...    ..(  3 skipped).    ..  ******
  56    2024-06-02 23:37    25  ******
  57    2024-06-03 00:36    24  *****
 ...    ..( 13 skipped).    ..  *****
  71    2024-06-03 14:22    24  *****
  72    2024-06-03 15:21    25  ******
  73    2024-06-03 16:20    25  ******
  74    2024-06-03 17:19    24  *****
 ...    ..(  5 skipped).    ..  *****
  80    2024-06-03 23:13    24  *****
  81    2024-06-04 00:12    23  ****
  82    2024-06-04 01:11    23  ****
  83    2024-06-04 02:10    22  ***
  84    2024-06-04 03:09    22  ***
  85    2024-06-04 04:08    22  ***
  86    2024-06-04 05:07    21  **
 ...    ..(  3 skipped).    ..  **
  90    2024-06-04 09:03    21  **
  91    2024-06-04 10:02    22  ***
  92    2024-06-04 11:01    23  ****
  93    2024-06-04 12:00    24  *****
  94    2024-06-04 12:59    25  ******
  95    2024-06-04 13:58    27  ********
  96    2024-06-04 14:57    27  ********
  97    2024-06-04 15:56    26  *******

SCT Error Recovery Control command not supported

Device Statistics (GP Log 0x04)
Page  Offset Size        Value Flags Description
0x01  =====  =               =  ===  == General Statistics (rev 1) ==
0x01  0x008  4             488  ---  Lifetime Power-On Resets
0x01  0x010  4           56647  ---  Power-on Hours
0x01  0x018  6    107048466604  ---  Logical Sectors Written
0x01  0x020  6       868989331  ---  Number of Write Commands
0x01  0x028  6    218156754173  ---  Logical Sectors Read
0x01  0x030  6      1012947587  ---  Number of Read Commands
0x01  0x038  6               -  ---  Date and Time TimeStamp
0x03  =====  =               =  ===  == Rotating Media Statistics (rev 1) ==
0x03  0x008  4       107753387  N--  Spindle Motor Power-on Hours
0x03  0x010  4       107700933  N--  Head Flying Hours
0x03  0x018  4           15813  ---  Head Load Events
0x03  0x020  4               8  ---  Number of Reallocated Logical Sectors
0x03  0x028  4               0  ---  Read Recovery Attempts
0x03  0x030  4               0  ---  Number of Mechanical Start Failures
0x03  0x038  4              72  ---  Number of Realloc. Candidate Logical Sectors
0x03  0x040  4            2544  ---  Number of High Priority Unload Events
0x04  =====  =               =  ===  == General Errors Statistics (rev 1) ==
0x04  0x008  4              18  ---  Number of Reported Uncorrectable Errors
0x04  0x010  4               0  ---  Resets Between Cmd Acceptance and Completion
0x05  =====  =               =  ===  == Temperature Statistics (rev 1) ==
0x05  0x008  1              26  ---  Current Temperature
0x05  0x010  1              23  ---  Average Short Term Temperature
0x05  0x018  1              25  ---  Average Long Term Temperature
0x05  0x020  1              44  ---  Highest Temperature
0x05  0x028  1               0  ---  Lowest Temperature
0x05  0x030  1              36  ---  Highest Average Short Term Temperature
0x05  0x038  1              15  ---  Lowest Average Short Term Temperature
0x05  0x040  1              33  ---  Highest Average Long Term Temperature
0x05  0x048  1              19  ---  Lowest Average Long Term Temperature
0x05  0x050  4               0  ---  Time in Over-Temperature
0x05  0x058  1              60  ---  Specified Maximum Operating Temperature
0x05  0x060  4            1650  ---  Time in Under-Temperature
0x05  0x068  1               5  ---  Specified Minimum Operating Temperature
0x06  =====  =               =  ===  == Transport Statistics (rev 1) ==
0x06  0x008  4            1127  ---  Number of Hardware Resets
0x06  0x010  4             643  ---  Number of ASR Events
0x06  0x018  4               0  ---  Number of Interface CRC Errors
                                |||_ C monitored condition met
                                ||__ D supports DSN
                                |___ N normalized value

Pending Defects log (GP Log 0x0c) not supported

SATA Phy Event Counters (GP Log 0x11)
ID      Size     Value  Description
0x000a  2            3  Device-to-host register FISes sent due to a COMRESET
0x0001  2            0  Command failed due to ICRC error
0x0003  2            0  R_ERR response for device-to-host data FIS
0x0004  2            0  R_ERR response for host-to-device data FIS
0x0006  2            0  R_ERR response for device-to-host non-data FIS
0x0007  2            0  R_ERR response for host-to-device non-data FIS

Seagate FARM log (GP Log 0xa6) supported [try: -l farm]

etorix · June 4, 2024, 6:26am

Hiding the relevant data means that you actually do NOT want help, right?

Stux · June 4, 2024, 6:37am

I think you meant “details” instead of “spoiler” right?

but wrapping in triple backticks ``` would be best.

8 reallocated sectors, 72 pending/unreadable.

Its beginning to fail, and its done nearly 60,000 hours… time to put it out of its misery

mooseca · June 4, 2024, 6:42am

Lol my bad, yeah just wanted to see someone’s opinion on what all the info meant. Thanks for the responses

mooseca · June 4, 2024, 6:43am

Is it likely it has corrupted data at this stage and how could I check for that? It really only has movies/shows on it.

Stux · June 4, 2024, 6:48am

Well. in a way. Its refusing to allow those 72 sectors to be read unless you first write to them.

They will generate read errors, but it won’t return the wrong data… it will just refuse to return data.

So technically, its not “corrupt”

Meanwhile, I just had a drive flip around after 65K hours (the power on count flips at 65,536 hours), and 1000 hours later, it returned an uncorrectable sector… ie it KNOWS it corrupted the sector.

ZFS caught this as a checksum error.

ZFS will catch your pending sectors as read errors.

AND if you have redundancy will rewrite the sectors from the redundant data and the HD will be happy again…

but in effect, those sectors have no redundancy if you only have one drive of redundancy, say if you have mirrors.

BUT those sectors may not be in use by ZFS.

Make sure you have regular scrubs scheduled.

SO, you don’t have to replace the disk. You can keep running it, and it will most likely get worse and worse as it re-allocates more sectors… and fails to read more sectors… whcih will begin impacting you more often.

BUT, its dieing. Just replace it.

And since its not completely dead, you can zero it before discarding.

essinghigh · June 4, 2024, 7:48am

definitely time for a replacement, preferably with a CMR drive

dan · June 4, 2024, 8:42am

In addition to the bad sectors, it’s consistently failing SMART self-tests. Definitely time to replace it, with a CMR drive.

mooseca · June 4, 2024, 9:57am

Thanks guys, just ordered a 12tb exos drive to replace it. It was an old faithful drive from my first PC build and has been in a few pcs now so it’s had a good life. Thanks for all the help!

mooseca · June 4, 2024, 9:59am

Is it safe to use the GUI and remove this disk from the pool now, or is there a command I should be using to do this? I have another 12tb disk installed that can fit everything on it.

essinghigh · June 4, 2024, 10:02am

Without knowing your pool layout it is hard to say.
Could you show us the output of zpool status -v?

mooseca · June 4, 2024, 10:04am

Here is the requested output

pool: Server
state: ONLINE
scan: scrub repaired 0B in 12:43:47 with 0 errors on Sun May 5 12:43:49 2024
config:

    NAME        STATE     READ WRITE CKSUM
    Server      ONLINE       0     0     0
      sdc2      ONLINE       0     0     0
      sda2      ONLINE       0     0     0

errors: No known data errors

pool: boot-pool
state: ONLINE
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using ‘zpool clear’ or replace the device with ‘zpool replace’.
see: Message ID: ZFS-8000-9P — OpenZFS documentation
scan: scrub repaired 0B in 00:01:38 with 0 errors on Tue Jun 4 03:46:41 2024
config:

    NAME                                       STATE     READ WRITE CKSUM
    boot-pool                                  ONLINE       0     0     0
      ata-SSD_60GB_AA000000000000000897-part2  ONLINE       0     2     0

errors: No known data errors

essinghigh · June 4, 2024, 10:09am

Oh, a stripe. Fun!
This is a very not recommended way of setting up a pool.
If you lose one of those disks, the entire pool is gone.

Here’s what I’d do.
Add your 12TiB disk to the pool as a mirror to the disk that is not having issues. This is the “extend” button in the WebUI.

After this has completed you can use the “remove” button to remove the faulty disk.
You will now have a mirror, though a little smaller than you might like (as the mirror you’ve created will only have the available space of the smallest drive).

Once that other 12tb disk arrives, replace the smaller disk with it. You should be able to then expand the pool to fit the extra available disk space.

Of course, you could also just add the 12tb disk you have to the stripe, remove the faulty disk, and then either add that other 12tb disk to a stripe or convert to a mirror once it arrives, but I’m not going to recommend doing anything that leaves you with no fault tolerance.

mooseca · June 4, 2024, 10:15am

Thank you, I will do what you have suggested. I will definitely set it up with redundancy this time, didn’t think I would care losing data at the time but it is quite annoying.