ATA error count increased from 0 to 1

Hello, using TrueNAS Core 13.0-U6.1. I have a raidz1 pool of 5 x 10TB WD Red Drives, about 4 years old. Last Sunday I received an email alert with the text in the title.
Checking the logs, I saw: smartd 1616 - - Device: /dev/ada1, ATA error count increased from 0 to 1. Then, the next day, two more errors.
Since Monday, nothing. This is the result of smartctl -a on the affected drive:

root@truenas:~ # smartctl -a /dev/ada1
smartctl 7.2 2021-09-14 r5236 [FreeBSD 13.1-RELEASE-p9 amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     WDC WD101EFAX-68LDBN0
Serial Number:    VCH45HHP
LU WWN Device Id: 5 000cca 0b0cffda2
Firmware Version: 81.00A81
User Capacity:    10,000,831,348,736 bytes [10.0 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Form Factor:      3.5 inches
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ACS-2, ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Sat Feb 22 18:05:48 2025 EET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (   87) seconds.
Offline data collection
capabilities:                    (0x5b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        (1002) minutes.
SCT capabilities:              (0x003d) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   100   100   016    Pre-fail  Always       -       0
  2 Throughput_Performance  0x0004   128   128   054    Old_age   Offline      -       108
  3 Spin_Up_Time            0x0007   141   141   024    Pre-fail  Always       -       587 (Average 567)
  4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       33
  5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000a   100   100   067    Old_age   Always       -       0
  8 Seek_Time_Performance   0x0004   128   128   020    Old_age   Offline      -       18
  9 Power_On_Hours          0x0012   095   095   000    Old_age   Always       -       36544
 10 Spin_Retry_Count        0x0012   100   100   060    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       32
192 Power-Off_Retract_Count 0x0032   099   099   000    Old_age   Always       -       1539
193 Load_Cycle_Count        0x0012   099   099   000    Old_age   Always       -       1539
194 Temperature_Celsius     0x0002   118   118   000    Old_age   Always       -       55 (Min/Max 20/61)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       0

SMART Error Log Version: 1
ATA Error Count: 3
        CR = Command Register [HEX]
        FR = Features Register [HEX]
        SC = Sector Count Register [HEX]
        SN = Sector Number Register [HEX]
        CL = Cylinder Low Register [HEX]
        CH = Cylinder High Register [HEX]
        DH = Device/Head Register [HEX]
        DC = Device Command Register [HEX]
        ER = Error register [HEX]
        ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 3 occurred at disk power-on lifetime: 36409 hours (1517 days + 1 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 41 00 00 00 00 00  Error: UNC at LBA = 0x00000000 = 0

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 40 70 8c ad 40 08  21d+05:41:23.254  READ FPDMA QUEUED
  61 08 f8 b0 e0 40 40 08  21d+05:41:16.002  WRITE FPDMA QUEUED
  61 18 f0 b8 2f 6f 40 08  21d+05:41:16.002  WRITE FPDMA QUEUED
  61 08 e8 78 a3 5f 40 08  21d+05:41:16.002  WRITE FPDMA QUEUED
  61 10 e0 88 b9 be 40 08  21d+05:41:16.001  WRITE FPDMA QUEUED

Error 2 occurred at disk power-on lifetime: 36399 hours (1516 days + 15 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 41 00 00 00 00 00  Error: UNC at LBA = 0x00000000 = 0

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 b8 50 b3 06 40 08  20d+20:16:29.588  READ FPDMA QUEUED
  60 00 c0 50 bb 06 40 08  20d+20:16:22.643  READ FPDMA QUEUED
  60 00 b0 50 ab 06 40 08  20d+20:16:22.642  READ FPDMA QUEUED
  60 00 a8 50 a3 06 40 08  20d+20:16:21.714  READ FPDMA QUEUED
  60 00 a0 50 9b 06 40 08  20d+20:16:21.714  READ FPDMA QUEUED

Error 1 occurred at disk power-on lifetime: 36389 hours (1516 days + 5 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 41 00 00 00 00 00  Error: UNC at LBA = 0x00000000 = 0

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 f0 48 68 15 61 40 08  20d+09:31:35.104  READ FPDMA QUEUED
  60 c8 58 30 25 61 40 08  20d+09:31:28.140  READ FPDMA QUEUED
  60 d8 50 58 1d 61 40 08  20d+09:31:28.140  READ FPDMA QUEUED
  60 78 40 f0 11 61 40 08  20d+09:31:27.429  READ FPDMA QUEUED
  60 40 38 b0 11 61 40 08  20d+09:31:27.429  READ FPDMA QUEUED

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

Does the drive need replacing? Do I need to test anything else?
Thank you for the answers.

I don’t know why it’s throwing those errors.

The only thing that stands out to me is that the drive appears to be running pretty hot, at 55°C with a max recorded temp of 61°C.

I didn’t find the datasheet for your specific model, but it’s possible it’s formally within the allowed operating temperature. Had it been one of my Toshibas, it would not have been.

Oh, and over the 1500+ days it’s been operating not a single SMART test has been recorded. I recommend you run a long one ASAP and schedule them to run at least once or twice a month, expect them to take around 16 hours to complete. Short tests will only take 2 minutes but also barely test anything at all. I’ve scheduled one to run daily on all drives but don’t expect much out of it.

Looking at your smartctl output I see nothing that shows immediate failure of the disk; in my experience ATA errors have always been wiring/controller fault.

How is the disk connected to the system? Consider reseating connections and/or swapping wires.

Hello,

Thanks for the answers! I ran a long smart test (took about a day I think) and now here’s the output:

root@truenas:~ # smartctl -a /dev/ada1
smartctl 7.2 2021-09-14 r5236 [FreeBSD 13.1-RELEASE-p9 amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     WDC WD101EFAX-68LDBN0
Serial Number:    VCH45HHP
LU WWN Device Id: 5 000cca 0b0cffda2
Firmware Version: 81.00A81
User Capacity:    10,000,831,348,736 bytes [10.0 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Form Factor:      3.5 inches
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ACS-2, ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Mon Feb 24 07:01:26 2025 EET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (   87) seconds.
Offline data collection
capabilities:                    (0x5b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        (1002) minutes.
SCT capabilities:              (0x003d) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   100   100   016    Pre-fail  Always       -       0
  2 Throughput_Performance  0x0004   128   128   054    Old_age   Offline      -       108
  3 Spin_Up_Time            0x0007   141   141   024    Pre-fail  Always       -       587 (Average 567)
  4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       33
  5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000a   100   100   067    Old_age   Always       -       0
  8 Seek_Time_Performance   0x0004   128   128   020    Old_age   Offline      -       18
  9 Power_On_Hours          0x0012   095   095   000    Old_age   Always       -       36581
 10 Spin_Retry_Count        0x0012   100   100   060    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       32
192 Power-Off_Retract_Count 0x0032   099   099   000    Old_age   Always       -       1541
193 Load_Cycle_Count        0x0012   099   099   000    Old_age   Always       -       1541
194 Temperature_Celsius     0x0002   120   120   000    Old_age   Always       -       54 (Min/Max 20/61)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       0

SMART Error Log Version: 1
ATA Error Count: 3
        CR = Command Register [HEX]
        FR = Features Register [HEX]
        SC = Sector Count Register [HEX]
        SN = Sector Number Register [HEX]
        CL = Cylinder Low Register [HEX]
        CH = Cylinder High Register [HEX]
        DH = Device/Head Register [HEX]
        DC = Device Command Register [HEX]
        ER = Error register [HEX]
        ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 3 occurred at disk power-on lifetime: 36409 hours (1517 days + 1 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 41 00 00 00 00 00  Error: UNC at LBA = 0x00000000 = 0

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 40 70 8c ad 40 08  21d+05:41:23.254  READ FPDMA QUEUED
  61 08 f8 b0 e0 40 40 08  21d+05:41:16.002  WRITE FPDMA QUEUED
  61 18 f0 b8 2f 6f 40 08  21d+05:41:16.002  WRITE FPDMA QUEUED
  61 08 e8 78 a3 5f 40 08  21d+05:41:16.002  WRITE FPDMA QUEUED
  61 10 e0 88 b9 be 40 08  21d+05:41:16.001  WRITE FPDMA QUEUED

Error 2 occurred at disk power-on lifetime: 36399 hours (1516 days + 15 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 41 00 00 00 00 00  Error: UNC at LBA = 0x00000000 = 0

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 b8 50 b3 06 40 08  20d+20:16:29.588  READ FPDMA QUEUED
  60 00 c0 50 bb 06 40 08  20d+20:16:22.643  READ FPDMA QUEUED
  60 00 b0 50 ab 06 40 08  20d+20:16:22.642  READ FPDMA QUEUED
  60 00 a8 50 a3 06 40 08  20d+20:16:21.714  READ FPDMA QUEUED
  60 00 a0 50 9b 06 40 08  20d+20:16:21.714  READ FPDMA QUEUED

Error 1 occurred at disk power-on lifetime: 36389 hours (1516 days + 5 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 41 00 00 00 00 00  Error: UNC at LBA = 0x00000000 = 0

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 f0 48 68 15 61 40 08  20d+09:31:35.104  READ FPDMA QUEUED
  60 c8 58 30 25 61 40 08  20d+09:31:28.140  READ FPDMA QUEUED
  60 d8 50 58 1d 61 40 08  20d+09:31:28.140  READ FPDMA QUEUED
  60 78 40 f0 11 61 40 08  20d+09:31:27.429  READ FPDMA QUEUED
  60 40 38 b0 11 61 40 08  20d+09:31:27.429  READ FPDMA QUEUED

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%     36572         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

It seems it completed without any errors.
The disks are connected using normal SATA cables to the motherboard. It’s HIGHLY unlikely it’s a connection issue since the server has been sitting in its place since it was built, never moved, but I will check just to be sure.

IMO that output looks perfectly fine to me. A bit toasty on the temps, but nothing indicating imminent failure. Whats the motherboard? Any chance the chipset is also running hot? I had to slap a fan onto mine else it would be >70*c on idle…

Once again, only times I’ve had those errors was either cheap wires, wire requiring reseating, or controller getting toasty.

If it is in the budget, you can grab yourself a cold spare & burn it in. Not critical, but those are generally a good idea to have on hand with raidz1.

Considering you have 4 years on these drives & this is your first SMART test, it would be a very good idea to start implementing regular tests & scrubs. Short test once/twice a week & a long every 2 weeks/month, and a scrub once/twice a month (don’t overlap).