Drive with just a few R/W errors

StrandedCamel · November 17, 2024, 2:33pm

I recently had a drive in my main data array fail. It was a non-NAS drive (one of those that WD markets as being for surveillance systems), so I was neither surprised nor upset. I replaced the failed drive with a WD Red and started resilvering.

Shortly after that was done, my NEW drive was failed due to 31 read errors and 89 write errors. Doesn’t seem like much, but on the other hand, I have 6+ -year-old drives that all have 0 errors, so it ain’t normal nor good. But I chalked it up to bad sectors being discovered early in new drives’ lifecycles, and proceeded to replace the new drive with itself and run a scrub.

About 60% of the way through the 10 TB scrub, the read error count jumped from 31 to 47 and the write error count jumped from 89 to 232. Out of some 6 billion, that seems like peanuts, but of course it’s not – both numbers should be 0.

The thing is, when I’ve had drives fail before, they’ve shown millions of errors, not tens or hundreds. Furthermore, this drive has passed a SMART test with no errors or issues.

For what it’s worth, the drive is a WD101EFBX-68B0AN0.

Where should I go from here?

prez02 · November 17, 2024, 4:05pm

Short or long tests?

And aren’t those ZFS errors, which can also be caused by bad cabling, overheating etc?

winnielinnie · November 17, 2024, 4:16pm

You already were met with I/O errors the first time you tried. Why would you try to force it to resilver again without at least running a short (and preferably long) SMART test?

Did it pass a short SMART test?

Stux · November 17, 2024, 4:22pm

Either you have cabling type issue, or the drive is sick and should be replaced.

Best to run a long test (shorts are almost useless) and then check the smart results.

StrandedCamel · November 17, 2024, 4:22pm

Short tests. I’m going to run a long test as soon as the scrub is done.

As to the types of errors, I really don’t know what type they are! Hence my question here.

StrandedCamel · November 17, 2024, 4:25pm

[quote=“Stux, post:4, topic:24656, full:true”]
Either you have cabling type issue, or the drive is sick and should be replaced.[/quote]

Well, that narrows it down a bit. Thanks.

I’ll be doing just that on your advice. Much obliged.

StrandedCamel · November 17, 2024, 4:26pm

[quote=“winnielinnie, post:3, topic:24656, full:true”]

You already were met with I/O errors the first time you tried. Why would you try to force it to resilver again without at least running a short (and preferably long) SMART test? [/quote]

I did, and it passed the short SMART test. Now on to a long one…

winnielinnie · November 17, 2024, 5:11pm

Before or after the first resilver attempt?

And what about smartctl -l error <drive>

joeschmuck · November 17, 2024, 5:28pm

So these are ZFS errors, not necessarily a drive failure.

As the others have stated, you have an issue but likely not the drive at fault.

Post the output of smartctl -x /dev/adaX where adaX is the drive name.

Also post the outputs of zpool status and zpool list. I want to see how full things are and the actual configuration.

Next, as you have been recommended, run a SMART Long test smartctl -t long /dev/adaX and let the run. You can check the status of the test by running smartctl -a /dev/adaX, but with a 10TB drive, it will likely run for 18 hours unless it fails before then.

Once you have posted the requested data (the smartctl -x stuff, in code brackets) then we can see if there is any obvious issue immediately while waiting for the Long test to pass. Given what you have said, I doubt the drive is bad. It sounds more like a data cable, HBA, or power supply.

If nothing is obvious after the Long test (by the way run the smartctl -x again after the long test completed and post it), then a few tests you should probably do to at a minimum, rule out some hardware, if not flat out identifying it.

Inspect all the fans, including the one in the power supply to make sure they are all working.
CPU Stress Test (Prime95 or similar for several hours, maybe 4 hours would be fine)
RAM Test (I prefer 4 complete passes)
You didn’t say if the drives are in a carrier or hard wired, but consider using a different carrier if this applies.

The goal is to rule out items. If you are lucky, the problem will show itself sooner than later.

StrandedCamel · November 18, 2024, 1:29am

Thanks so much for your help – I greatly appreciate it!

Here’s the smartctl output:

$ smartctl -x /dev/ada5
smartctl 7.2 2021-09-14 r5236 [FreeBSD 13.1-RELEASE-p9 amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     WDC WD101EFBX-68B0AN0
Serial Number:    VH1TPJVM
LU WWN Device Id: 5 000cca 0d8d95148
Firmware Version: 85.00A85
User Capacity:    10,000,831,348,736 bytes [10.0 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ACS-2, ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Sun Nov 17 22:19:53 2024 -03
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
AAM feature is:   Unavailable
APM feature is:   Disabled
Rd look-ahead is: Enabled
Write cache is:   Enabled
DSN feature is:   Unavailable
ATA Security is:  Disabled, frozen [SEC2]
Wt Cache Reorder: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Enabled.
Self-test execution status:      ( 249) Self-test routine in progress...
90% of test remaining.
Total time to complete Offline
data collection: (   87) seconds.
Offline data collection
capabilities: (0x5b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: (   2) minutes.
Extended self-test routine
recommended polling time: (1031) minutes.
SCT capabilities:       (0x003d) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
  1 Raw_Read_Error_Rate     PO-R--   100   100   016    -    0
  2 Throughput_Performance  --S---   129   129   054    -    104
  3 Spin_Up_Time            POS---   100   100   024    -    0
  4 Start_Stop_Count        -O--C-   100   100   000    -    3
  5 Reallocated_Sector_Ct   PO--CK   100   100   005    -    0
  7 Seek_Error_Rate         -O-R--   100   100   067    -    0
  8 Seek_Time_Performance   --S---   128   128   020    -    18
  9 Power_On_Hours          -O--C-   100   100   000    -    99
 10 Spin_Retry_Count        -O--C-   100   100   060    -    0
 12 Power_Cycle_Count       -O--CK   100   100   000    -    3
192 Power-Off_Retract_Count -O--CK   100   100   000    -    6
193 Load_Cycle_Count        -O--C-   100   100   000    -    6
194 Temperature_Celsius     -O----   147   147   000    -    44 (Min/Max 25/47)
196 Reallocated_Event_Count -O--CK   100   100   000    -    0
197 Current_Pending_Sector  -O---K   100   100   000    -    0
198 Offline_Uncorrectable   ---R--   100   100   000    -    0
199 UDMA_CRC_Error_Count    -O-R--   200   200   000    -    0
                            ||||||_ K auto-keep
                            |||||__ C event count
                            ||||___ R error rate
                            |||____ S speed/performance
                            ||_____ O updated online
                            |______ P prefailure warning

General Purpose Log Directory Version 1
SMART           Log Directory Version 1 [multi-sector log support]
Address    Access  R/W   Size  Description
0x00       GPL,SL  R/O      1  Log Directory
0x01           SL  R/O      1  Summary SMART error log
0x02           SL  R/O      1  Comprehensive SMART error log
0x03       GPL     R/O      1  Ext. Comprehensive SMART error log
0x04       GPL     R/O    256  Device Statistics log
0x04       SL      R/O    255  Device Statistics log
0x06           SL  R/O      1  SMART self-test log
0x07       GPL     R/O      1  Extended self-test log
0x08       GPL     R/O      2  Power Conditions log
0x09           SL  R/W      1  Selective self-test log
0x0c       GPL     R/O   5501  Pending Defects log
0x10       GPL     R/O      1  NCQ Command Error log
0x11       GPL     R/O      1  SATA Phy Event Counters log
0x12       GPL     R/O      1  SATA NCQ Non-Data log
0x13       GPL     R/O      1  SATA NCQ Send and Receive log
0x15       GPL     R/W      1  Rebuild Assist log
0x21       GPL     R/O      1  Write stream error log
0x22       GPL     R/O      1  Read stream error log
0x24       GPL     R/O    256  Current Device Internal Status Data log
0x25       GPL     R/O    256  Saved Device Internal Status Data log
0x2f       GPL     -        1  Set Sector Configuration
0x30       GPL,SL  R/O      9  IDENTIFY DEVICE data log
0x80-0x9f  GPL,SL  R/W     16  Host vendor specific log
0xe0       GPL,SL  R/W      1  SCT Command/Status
0xe1       GPL,SL  R/W      1  SCT Data Transfer

SMART Extended Comprehensive Error Log Version: 1 (1 sectors)
No Errors Logged

SMART Extended Self-test Log Version: 1 (1 sectors)
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%        89         -
# 2  Short offline       Completed without error       00%        78         -
# 3  Short offline       Completed without error       00%        53         -
# 4  Short offline       Completed without error       00%         6         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

SCT Status Version:                  3
SCT Version (vendor specific):       256 (0x0100)
Device State:                        DST executing in background (3)
Current Temperature:                    44 Celsius
Power Cycle Min/Max Temperature:     38/47 Celsius
Lifetime    Min/Max Temperature:     25/47 Celsius
Under/Over Temperature Limit Count:   0/0

SCT Temperature History Version:     2
Temperature Sampling Period:         1 minute
Temperature Logging Interval:        1 minute
Min/Max recommended Temperature:      0/65 Celsius
Min/Max Temperature Limit:           -40/70 Celsius
Temperature History Size (Index):    128 (109)

Index    Estimated Time   Temperature Celsius
 110    2024-11-17 20:12    44  *************************
 ...    ..(108 skipped).    ..  *************************
  91    2024-11-17 22:01    44  *************************
  92    2024-11-17 22:02    43  ************************
  93    2024-11-17 22:03    44  *************************
  94    2024-11-17 22:04    43  ************************
 ...    ..(  7 skipped).    ..  ************************
 102    2024-11-17 22:12    43  ************************
 103    2024-11-17 22:13    44  *************************
 ...    ..(  5 skipped).    ..  *************************
 109    2024-11-17 22:19    44  *************************

SCT Error Recovery Control:
           Read:     70 (7.0 seconds)
          Write:     70 (7.0 seconds)

Device Statistics (GP Log 0x04)
Page  Offset Size        Value Flags Description
0x01  =====  =               =  ===  == General Statistics (rev 1) ==
0x01  0x008  4               3  ---  Lifetime Power-On Resets
0x01  0x010  4              99  ---  Power-on Hours
0x01  0x018  6     18222005812  ---  Logical Sectors Written
0x01  0x020  6       111561677  ---  Number of Write Commands
0x01  0x028  6      9937009636  ---  Logical Sectors Read
0x01  0x030  6        14822300  ---  Number of Read Commands
0x01  0x038  6       359641100  ---  Date and Time TimeStamp
0x03  =====  =               =  ===  == Rotating Media Statistics (rev 1) ==
0x03  0x008  4              99  ---  Spindle Motor Power-on Hours
0x03  0x010  4              99  ---  Head Flying Hours
0x03  0x018  4               6  ---  Head Load Events
0x03  0x020  4               0  ---  Number of Reallocated Logical Sectors
0x03  0x028  4               0  ---  Read Recovery Attempts
0x03  0x030  4               0  ---  Number of Mechanical Start Failures
0x04  =====  =               =  ===  == General Errors Statistics (rev 1) ==
0x04  0x008  4               0  ---  Number of Reported Uncorrectable Errors
0x04  0x010  4              44  ---  Resets Between Cmd Acceptance and Completion
0x05  =====  =               =  ===  == Temperature Statistics (rev 1) ==
0x05  0x008  1              44  ---  Current Temperature
0x05  0x010  1              40  N--  Average Short Term Temperature
0x05  0x018  1               -  N--  Average Long Term Temperature
0x05  0x020  1              47  ---  Highest Temperature
0x05  0x028  1              25  ---  Lowest Temperature
0x05  0x030  1              44  N--  Highest Average Short Term Temperature
0x05  0x038  1              25  N--  Lowest Average Short Term Temperature
0x05  0x040  1               -  N--  Highest Average Long Term Temperature
0x05  0x048  1               -  N--  Lowest Average Long Term Temperature
0x05  0x050  4               0  ---  Time in Over-Temperature
0x05  0x058  1              65  ---  Specified Maximum Operating Temperature
0x05  0x060  4               0  ---  Time in Under-Temperature
0x05  0x068  1               0  ---  Specified Minimum Operating Temperature
0x06  =====  =               =  ===  == Transport Statistics (rev 1) ==
0x06  0x008  4             135  ---  Number of Hardware Resets
0x06  0x010  4               1  ---  Number of ASR Events
0x06  0x018  4               0  ---  Number of Interface CRC Errors
0xff  =====  =               =  ===  == Vendor Specific Statistics (rev 1) ==
0xff  0x040  7              25  ---  Vendor Specific
0xff  0x048  7              14  ---  Vendor Specific
0xff  0x050  7               0  ---  Vendor Specific
0xff  0x058  7               0  ---  Vendor Specific
0xff  0x060  7               0  ---  Vendor Specific
0xff  0x068  7               0  ---  Vendor Specific
0xff  0x070  7               0  ---  Vendor Specific
0xff  0x078  7               0  ---  Vendor Specific
0xff  0x080  7              20  ---  Vendor Specific
                                |||_ C monitored condition met
                                ||__ D supports DSN
                                |___ N normalized value

Pending Defects log (GP Log 0x0c)
No Defects Logged

SATA Phy Event Counters (GP Log 0x11)
ID      Size     Value  Description
0x0001  2            0  Command failed due to ICRC error
0x0002  2           41  R_ERR response for data FIS
0x0003  2            0  R_ERR response for device-to-host data FIS
0x0004  2           41  R_ERR response for host-to-device data FIS
0x0005  2           47  R_ERR response for non-data FIS
0x0006  2            0  R_ERR response for device-to-host non-data FIS
0x0007  2           47  R_ERR response for host-to-device non-data FIS
0x0008  2            0  Device-to-host non-data FIS retries
0x0009  2          131  Transition from drive PhyRdy to drive PhyNRdy
0x000a  2          130  Device-to-host register FISes sent due to a COMRESET
0x000b  2           76  CRC errors within host-to-device FIS
0x000d  2           12  Non-CRC errors within host-to-device FIS

Sure. Here they are:

$ zpool status
  pool: boot-pool
 state: ONLINE
status: Some supported and requested features are not enabled on the pool.
The pool can still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
the pool may no longer be accessible by software that does not support
the features. See zpool-features(7) for details.
  scan: scrub repaired 0B in 00:02:30 with 0 errors on Sun Nov 17 03:47:30 2024
config:

NAME        STATE     READ WRITE CKSUM
boot-pool   ONLINE       0     0     0
 mirror-0  ONLINE       0     0     0
   ada4p2  ONLINE       0     0     0
   ada0p2  ONLINE       0     0     0

errors: No known data errors

  pool: ssd2
 state: ONLINE
  scan: scrub repaired 0B in 00:05:54 with 0 errors on Sun Nov 17 00:05:54 2024
config:

NAME                                            STATE     READ WRITE CKSUM
ssd2                                            ONLINE       0     0     0
 mirror-0                                      ONLINE       0     0     0
   gptid/bd519b48-985a-11ef-b55f-6805cac2af59  ONLINE       0     0     0
   gptid/bd55fb93-985a-11ef-b55f-6805cac2af59  ONLINE       0     0     0

errors: No known data errors

  pool: tank
 state: ONLINE
status: Some supported and requested features are not enabled on the pool.
The pool can still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
the pool may no longer be accessible by software that does not support
the features. See zpool-features(7) for details.
  scan: resilvered 390M in 00:00:34 with 0 errors on Sun Nov 17 22:13:03 2024
config:

NAME                                            STATE     READ WRITE CKSUM
tank                                            ONLINE       0     0     0
 raidz3-0                                      ONLINE       0     0     0
   gptid/762911e4-cf54-11eb-99b1-6805cac2af59  ONLINE       0     0     0
   gptid/fd4e845a-a206-11ef-81de-6805cac2af59  ONLINE       0     0     0
   gptid/75327973-d177-11eb-b3ea-6805cac2af59  ONLINE       0     0     0
   gptid/4e7ac0fa-d02e-11eb-8116-6805cac2af59  ONLINE       0     0     0
   gptid/a80ef512-0112-11eb-85ef-244bfe964b84  ONLINE       0     0     0
   gptid/a892dff0-0112-11eb-85ef-244bfe964b84  ONLINE       0     0     0
   gptid/a8c81491-0112-11eb-85ef-244bfe964b84  ONLINE       0     0     0
   gptid/a8c45c3f-0112-11eb-85ef-244bfe964b84  ONLINE       0     0     0
   gptid/a90173ee-0112-11eb-85ef-244bfe964b84  ONLINE       0     0     0
   gptid/a94bfc5b-0112-11eb-85ef-244bfe964b84  ONLINE       0     0     0

errors: No known data errors

The above information is from after I replaced the drive with itself, zpool cleared it, and ran a scrub. So is the following:

$ zpool list
NAME        SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
boot-pool  95.5G  8.96G  86.5G        -         -     9%     9%  1.00x    ONLINE  -
ssd2       1.81T   111G  1.70T        -         -     2%     5%  1.00x    ONLINE  /mnt
tank       90.9T  84.7T  6.19T        -         -    16%    93%  1.00x    ONLINE  /mnt

It’s running as I type this. It has over 90% remaining, and so will take a while.

Again, thanks a ton for your help. I’ll report back with the results of the long SMART test as soon as it’s done.

joeschmuck · November 18, 2024, 3:46am

The drive looks perfectly fine and I’m 99% certain it will after the Long SMART test as well.

The only problem I see right now is you have past the 90% storage mark on “tank” and this is a bad thing. ZFS slows down during writing operations due to fragmentation. However this should not be causing any issues.

With all this data, and the fact that you unfortunately already cleared the errors, we really have nothing to troubleshoot.

After the Long test is done, the only thing I could advise besides running the CPU Stress Test, RAM Test, and checking your hardware connections, is to run another Scrub on that pool to see if errors are generated.

For argument sake: If none of the tests fail however the Scrub generates errors, that means the problem is reproducible and a good thing. Now you need to make some changes and run a Scrub to see if you have isolated it.

Example: Your drive Serial Number: VH1TPJVM throws more errors. Now you swap the data and power connectors (at the drive side) from that drive with another drive and run the Scrub again. Did the problem move to the other drive or stay with the same drive? If the problem moved to the other drive then the problem is the HBA or data cable, or possibly the power cable, but unlikely. Next you would swap the two cables on the HBA. Run a Scrub, if the problem moves to another drive then the problem is in the HBA or data cable.

Now you need to figure out if it is always the same HBA connection (possibly the second SATA cable connector out of the SF connector regardless of which cable you use). You can see where this is going. Hopefully you can repeat the failure. But first, run the tests on the CPU and RAM, may sure it all checks out.

With your particular problem, it is too bad you don’t have ECC RAM, there is a possibility your data is becoming corrupt with every Scrub, but hopefully the possibility is low. It is good you have a RAIDZ3.

Topic		Replies	Views
ZFS read/write errors during scrub TrueNAS General SCALE , ZFS	21	1623	January 29, 2025
One or more devices has experienced an unrecoverable error. Not sure of cause TrueNAS General CORE	9	462	May 26, 2025
Write errors during pool extend - what do I do now? TrueNAS General SCALE , ZFS , Replication	24	292	March 28, 2025
Identical CKSUM errors across all 7 drives in a raidz1 TrueNAS General ZFS	6	101	July 13, 2026
ZFS checksum error TrueNAS General SCALE , Hardware , ZFS	12	1239	November 13, 2024

Drive with just a few R/W errors

Related topics