Suddenly getting errors on two identical hard drives

I have 2x Ironwolf Pro 18TB drives set up as a ZFS mirror. They’ve been running fine for about 18 months but yesterday I got this alert:

Device: /dev/sdb [SAT], ATA error count increased from 0 to 8.

Then today I got this alert:

Device: /dev/sdc [SAT], ATA error count increased from 8 to 11.

This suggests that both disks may have a problem. I kicked off a ZFS scrub which completed successfully. I’ve also just kicked off a long SMART test on both drives - it looks like these will take ~28 hours to complete.

What I wanted to ask is is there any way I can figure out if perhaps the problem isn’t the drives but something else? I’m running Truenas Scale 24.10.2.1 as a VM inside Proxmox 8.4. I have both drives connected to a Broadcom LCI SAS2008 card which is passed through to the Truenas VM. I then use a SAS to SATA spider cable to connect the drives.

Please could someone help me understand how to progress? I’ve included the output from smartctl -a for both drives:

=== START OF INFORMATION SECTION ===
Device Model:     ST18000NM000J-2TV103
Serial Number:    ZR528H4Z
LU WWN Device Id: 5 000c50 0db80c932
Firmware Version: SC02
User Capacity:    18,000,207,937,536 bytes [18.0 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        Not in smartctl database 7.3/5528
ATA Version is:   ACS-4 (minor revision not indicated)
SATA Version is:  SATA 3.3, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Thu Apr 10 17:28:02 2025 BST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      ( 249) Self-test routine in progress...
                                        90% of test remaining.
Total time to complete Offline
data collection:                (  559) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        (1534) minutes.
Conveyance self-test routine
recommended polling time:        (   2) minutes.
SCT capabilities:              (0x70bd) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   080   064   044    Pre-fail  Always       -       109340048
  3 Spin_Up_Time            0x0003   091   089   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       57
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   086   060   045    Pre-fail  Always       -       444618829
  9 Power_On_Hours          0x0032   081   081   000    Old_age   Always       -       16808
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       57
 18 Unknown_Attribute       0x000b   100   100   050    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   061   047   000    Old_age   Always       -       39 (Min/Max 29/39)
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       37
193 Load_Cycle_Count        0x0032   084   084   000    Old_age   Always       -       33903
194 Temperature_Celsius     0x0022   039   053   000    Old_age   Always       -       39 (0 21 0 0 0)
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0023   100   100   001    Pre-fail  Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       13606 (178 5 0)
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       96106968202
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       14974284770743

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Self-test routine in progress 90%     16808         -
# 2  Extended offline    Aborted by host               90%     16800         -
# 3  Short offline       Completed without error       00%     15290         -
# 4  Extended offline    Interrupted (host reset)      00%     15239         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

and

=== START OF INFORMATION SECTION ===
Device Model:     ST18000NM000J-2TV103
Serial Number:    WR507D32
LU WWN Device Id: 5 000c50 0ecfa285d
Firmware Version: SN02
User Capacity:    18,000,207,937,536 bytes [18.0 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        Not in smartctl database 7.3/5528
ATA Version is:   ACS-4 (minor revision not indicated)
SATA Version is:  SATA 3.3, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Thu Apr 10 17:31:23 2025 BST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      ( 249) Self-test routine in progress...
                                        90% of test remaining.
Total time to complete Offline
data collection:                (  567) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        (1551) minutes.
Conveyance self-test routine
recommended polling time:        (   2) minutes.
SCT capabilities:              (0x70bd) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   062   062   044    Pre-fail  Always       -       97533848
  3 Spin_Up_Time            0x0003   091   090   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       127
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       79
  7 Seek_Error_Rate         0x000f   087   060   045    Pre-fail  Always       -       549447125
  9 Power_On_Hours          0x0032   074   074   000    Old_age   Always       -       23155
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       77
 18 Unknown_Attribute       0x000b   100   100   050    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   089   089   000    Old_age   Always       -       11
188 Command_Timeout         0x0032   100   099   000    Old_age   Always       -       1
190 Airflow_Temperature_Cel 0x0022   064   049   000    Old_age   Always       -       36 (Min/Max 27/37)
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       43
193 Load_Cycle_Count        0x0032   083   083   000    Old_age   Always       -       34247
194 Temperature_Celsius     0x0022   036   051   000    Old_age   Always       -       36 (0 19 0 0 0)
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0023   100   100   001    Pre-fail  Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       19852 (249 196 0)
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       125879898110
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       18497958011507

SMART Error Log Version: 1
ATA Error Count: 11 (device log contains only the most recent five errors)
        CR = Command Register [HEX]
        FR = Features Register [HEX]
        SC = Sector Count Register [HEX]
        SN = Sector Number Register [HEX]
        CL = Cylinder Low Register [HEX]
        CH = Cylinder High Register [HEX]
        DH = Device/Head Register [HEX]
        DC = Device Command Register [HEX]
        ER = Error register [HEX]
        ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 11 occurred at disk power-on lifetime: 23153 hours (964 days + 17 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 00 ff ff ff 4f 00  15d+11:54:22.749  READ FPDMA QUEUED
  60 00 00 ff ff ff 4f 00  15d+11:54:22.749  READ FPDMA QUEUED
  60 00 00 ff ff ff 4f 00  15d+11:54:21.449  READ FPDMA QUEUED
  60 00 00 ff ff ff 4f 00  15d+11:54:19.375  READ FPDMA QUEUED
  60 00 00 ff ff ff 4f 00  15d+11:54:19.361  READ FPDMA QUEUED

Error 10 occurred at disk power-on lifetime: 23153 hours (964 days + 17 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 00 ff ff ff 4f 00  15d+11:53:02.725  READ FPDMA QUEUED
  60 00 00 ff ff ff 4f 00  15d+11:53:02.725  READ FPDMA QUEUED
  60 00 00 ff ff ff 4f 00  15d+11:53:02.717  READ FPDMA QUEUED
  60 00 00 ff ff ff 4f 00  15d+11:53:02.486  READ FPDMA QUEUED
  60 00 00 ff ff ff 4f 00  15d+11:53:02.486  READ FPDMA QUEUED

Error 9 occurred at disk power-on lifetime: 23153 hours (964 days + 17 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 00 ff ff ff 4f 00  15d+11:52:55.794  READ FPDMA QUEUED
  60 00 00 ff ff ff 4f 00  15d+11:52:53.855  READ FPDMA QUEUED
  60 00 00 ff ff ff 4f 00  15d+11:52:51.977  READ FPDMA QUEUED
  60 00 00 ff ff ff 4f 00  15d+11:52:51.970  READ FPDMA QUEUED
  60 00 00 ff ff ff 4f 00  15d+11:52:51.950  READ FPDMA QUEUED

Error 8 occurred at disk power-on lifetime: 23132 hours (963 days + 20 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: WP at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  61 00 00 ff ff ff 4f 00  14d+15:12:39.043  WRITE FPDMA QUEUED
  61 00 00 ff ff ff 4f 00  14d+15:12:39.042  WRITE FPDMA QUEUED
  61 00 00 ff ff ff 4f 00  14d+15:12:39.042  WRITE FPDMA QUEUED
  61 00 00 ff ff ff 4f 00  14d+15:12:39.042  WRITE FPDMA QUEUED
  61 00 00 ff ff ff 4f 00  14d+15:12:39.041  WRITE FPDMA QUEUED

Error 7 occurred at disk power-on lifetime: 23132 hours (963 days + 20 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: WP at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  61 00 00 ff ff ff 4f 00  14d+15:12:36.691  WRITE FPDMA QUEUED
  61 00 00 ff ff ff 4f 00  14d+15:12:36.689  WRITE FPDMA QUEUED
  60 00 00 ff ff ff 4f 00  14d+15:12:36.689  READ FPDMA QUEUED
  61 00 00 ff ff ff 4f 00  14d+15:12:36.689  WRITE FPDMA QUEUED
  61 00 00 ff ff ff 4f 00  14d+15:12:36.669  WRITE FPDMA QUEUED

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Self-test routine in progress 90%     23155         -
# 2  Short offline       Completed without error       00%     21638         -
# 3  Extended offline    Interrupted (host reset)      00%     21587         -
# 4  Short offline       Completed without error       00%     14831         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

Many thanks!

Something just occured to me - Truenas would have rebooted after installing the latest minor update this morning. I’m wondering if it doesn’t always mount the drives with the same sdX label - i.e. what was sdb yesterday is now sdc after the reboot…

1 Like

Yes, this is likely what happened.

You probably only one drive that is going bad. Doublecheck the smart reports from all of them to make sure.

1 Like