Getting IO error on 55 day old drive. Smart test passes. Please advise

Hi,

I have a drive that has started giving IO errors. The drive is only 55 days into use. Could this be a simple cable problem? I run a short smart test which passed cleared the error but it returned. Running a long test now.

Please let me know how to confirm that it is the drive before i send it out for warranty.

Thanks

Steve.

truenas_admin@LlanelwyNAS[~]$ zpool status -v
pool: HDDs
state: ONLINE
status: One or more devices are faulted in response to persistent errors.
Sufficient replicas exist for the pool to continue functioning in a
degraded state.
action: Replace the faulted device, or use ‘zpool clear’ to mark the device
repaired.
scan: scrub repaired 0B in 03:36:03 with 0 errors on Sun Mar 22 03:36:05 2026
expand: expanded raidz1-0 copied 11.5T in 11:38:55, on Sun Feb 15 15:54:57 2026
config:

NAME STATE READ WRITE CKSUM
HDDs ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
2b377b42-8529-4a87-851e-6731e264ec69 ONLINE 0 0 0
5703283f-33ed-435e-b481-94389a07fb14 ONLINE 0 0 0
9a7093d2-039d-4b74-a2d3-bb6703e7dbe5 ONLINE 0 0 0
260dd39c-6a68-4235-baa0-76f31e8d6cc6 ONLINE 0 0 0
71f6fb01-7451-4a22-a8d2-b34cfc58f3b5 ONLINE 0 0 0
5e3f77f3-1418-4e8f-9179-e1cc8e1820a7 ONLINE 0 0 0
bb88c437-f4f0-42fc-b3d5-568b07bb47a6 FAULTED 28 138 0 too many errors

errors: No known data errors

pool: boot-pool
state: ONLINE
scan: scrub repaired 0B in 00:00:59 with 0 errors on Fri Apr 10 03:46:01 2026
config:

NAME STATE READ WRITE CKSUM
boot-pool ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
sde3 ONLINE 0 0 0
sdc3 ONLINE 0 0 0

errors: No known data errors

Short test passed ok. Long test running now.

smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.12.33-production+truenas] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke,

=== START OF INFORMATION SECTION ===
Device Model: ST4000VN006-3CW104
Serial Number: ZW63YXYN
LU WWN Device Id: 5 000c50 0eb9a6fb8
Firmware Version: SC60
User Capacity: 4,000,787,030,016 bytes [4.00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: 5400 rpm
Form Factor: 3.5 inches
Device is: Not in smartctl database 7.3/5528
ATA Version is: ACS-3 T13/2161-D revision 5
SATA Version is: SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Sat Apr 11 02:16:22 2026 BST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status: (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 249) Self-test routine in progress…
90% of test remaining.
Total time to complete Offline
data collection: ( 0) seconds.
Offline data collection
capabilities: (0x73) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
No Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: ( 447) minutes.
Conveyance self-test routine
recommended polling time: ( 2) minutes.
SCT capabilities: (0x70bd) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 078 064 006 Pre-fail Always - 66855096
3 Spin_Up_Time 0x0003 099 099 000 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 4
5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0
7 Seek_Error_Rate 0x000f 071 060 045 Pre-fail Always - 12022767
9 Power_On_Hours 0x0032 099 099 000 Old_age Always - 1326
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 4
183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 0
184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0
189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0
190 Airflow_Temperature_Cel 0x0022 067 058 040 Old_age Always - 33 (Min/Max 31/33)
191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 0
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 55
193 Load_Cycle_Count 0x0032 100 100 000 Old_age Always - 80
194 Temperature_Celsius 0x0022 033 042 000 Old_age Always - 33 (0 22 0 0 0)
195 Hardware_ECC_Recovered 0x001a 078 064 000 Old_age Always - 66855096
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0
240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 1313 (43 9 0)
241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 6987110328
242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 58367426464

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error

1 Extended offline Self-test routine in progress 90% 1326 -

2 Short offline Completed without error 00% 1326 -

SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

The above only provides legacy SMART information - try ‘smartctl -x’ for more

Try this from the Resources section.

This is concerning but it does say the hardware corrected it.

What is the result of the Extended test?

Are you 100% certain this is the drive in question?

I don’t think you have a data cable or power cable issue, assuming this is the correct drive. Use lsblk -o NAME,PARTUUID to cross-reference the UUID to the drive ID.

I had a look at that flowchart and did not see IO errors that i had the alert for listed.

Here is the result of that command. If i am reading it right that is the right ID for the drive giving the error.

The extended test just finished

SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error

1 Extended offline Completed without error 00% 1333 -

2 Short offline Completed without error 00% 1326 -

..truenas_admin@LlanelwyNAS[~]$ lsblk -o NAME,PARTUUID
NAME PARTUUID
loop1
sda
└─sda1 bb88c437-f4f0-42fc-b3d5-568b07bb47a6
sdb
└─sdb1 5e3f77f3-1418-4e8f-9179-e1cc8e1820a7
sdc
├─sdc1 67d8a465-deed-40d0-b5ea-99e7f9075e1e
├─sdc2 8872a193-7a12-4bb7-96ec-d880695c2be7
└─sdc3 d699d3a5-dbe0-4166-afb1-d4e4a7e12ee1
sdd
└─sdd1 71f6fb01-7451-4a22-a8d2-b34cfc58f3b5
sde
├─sde1 cef2c4ce-4f6f-45b0-ad02-4c4afb9aca16
├─sde2 f0e1a0bb-efdd-4d4f-a12d-4cfb20cee680
└─sde3 512abad4-d33b-4e8d-8740-4e80baf5579f
sdf
└─sdf1 5703283f-33ed-435e-b481-94389a07fb14
sdg
└─sdg1 260dd39c-6a68-4235-baa0-76f31e8d6cc6
sdh
└─sdh1 2b377b42-8529-4a87-851e-6731e264ec69
sdi
└─sdi1 9a7093d2-039d-4b74-a2d3-bb6703e7dbe5

I have shut the server down. Moved the drive to a new sata port on my HDA card and have swapped out its power. Running a scrub now to see if i get any more errors.

If you came up with drive sda then you were correct. But remember that drives can change Drive IDs during a reboot or power cycling. It is generally not the SATA cable doing that.

If you ran the SCRUB, did you run the clear first? zpool clear HDDs to remove the recorded errors?

The Drive Troubleshooting Flowchart does not cover all situations, however I would like to update it periodically with new situations. Let me know what happens here. I’d like to know if swapping the SATA cable was the resolution. This would mean to me, the HBA was bad.

Speaking of HBA, which one do you have? Are you providing enough cooling? Cooling is a big problem for folks here who purchase and install a server HBA in a consumer case.

Out of curiosity, did you burn in the drive with badblocks before you installed it? A 4TB drive would not take terribly long, not a week, but possibly just over a day. I might be wrong on how long it would last, I only use it when I get a new drive.

Good Luck.

1 Like

I did clear before the scrub. The scrub completed without any errors. Will see if it reoccures.

I was still SDA after the reboot and i checked the serial on the drive to confirm before reseating it.

This is the HBA card i have. It is a fujitsu 9211-8i that i got off ebay. Seems to work ok and other drives plugged into it seem to be ok also. (i cant give you the link as it wont let me post it on this form)

It has been solid for about a year, this is the first problem i have had. No idea what casued it hopfully was just loose connection.

My Case is a Mini ATX, but it has great airflow as it was by gaming System previously and has a silly number of fans in it.

I did not use Badblock before putting the drive in. Im quite new to this as did not know of that command.

A 9211-8i card does require a lot of cooling. Hopefully all is okay.

Been a few days but the same drive has alerted again. Hmm, maybe it’s time to warranty the drive. The odd thing is it is still showing clear on smart checks though.

Can anyone recommend anything else i should check?

zpool status -v
pool: HDDs
state: ONLINE
status: One or more devices are faulted in response to persistent errors.
Sufficient replicas exist for the pool to continue functioning in a
degraded state.
action: Replace the faulted device, or use ‘zpool clear’ to mark the device
repaired.
scan: scrub repaired 0B in 03:34:53 with 0 errors on Sat Apr 11 12:53:00 2026
expand: expanded raidz1-0 copied 11.5T in 11:38:55, on Sun Feb 15 15:54:57 2026
config:

    NAME                                      STATE     READ WRITE CKSUM
    HDDs                                      ONLINE       0     0     0
      raidz1-0                                ONLINE       0     0     0
        2b377b42-8529-4a87-851e-6731e264ec69  ONLINE       0     0     0
        5703283f-33ed-435e-b481-94389a07fb14  ONLINE       0     0     0
        9a7093d2-039d-4b74-a2d3-bb6703e7dbe5  ONLINE       0     0     0
        260dd39c-6a68-4235-baa0-76f31e8d6cc6  ONLINE       0     0     0
        71f6fb01-7451-4a22-a8d2-b34cfc58f3b5  ONLINE       0     0     0
        5e3f77f3-1418-4e8f-9179-e1cc8e1820a7  ONLINE       0     0     0
        bb88c437-f4f0-42fc-b3d5-568b07bb47a6  FAULTED     35    90     0  too many errors

errors: No known data errors

pool: boot-pool
state: ONLINE
scan: scrub repaired 0B in 00:01:05 with 0 errors on Fri Apr 17 03:46:06 2026
config:

    NAME        STATE     READ WRITE CKSUM
    boot-pool   ONLINE       0     0     0
      mirror-0  ONLINE       0     0     0
        sdd3    ONLINE       0     0     0
        sda3    ONLINE       0     0     0

errors: No known data errors

Mind translating these values? Seagate isn’t instantly clear.

*replace ‘X’ with relevant drive identifier:
smartctl -a /dev/sdx -v 1,raw48:54 -v 7,raw48:54 -v 195,raw48:54

That’ll translate fields 1, 7, & 195 (Raw read error, seek error, and hardware ecc errors) to a value that can be read.

Here is the version from that command.

sudo smartctl -a /dev/sde -v 1,raw48:54 -v 7,raw48:54 -v 195,raw48:54
[sudo] password for truenas_admin:
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.12.33-production+truenas] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke,

=== START OF INFORMATION SECTION ===
Device Model: ST4000VN006-3CW104
Serial Number: ZW63YXYN
LU WWN Device Id: 5 000c50 0eb9a6fb8
Firmware Version: SC60
User Capacity: 4,000,787,030,016 bytes [4.00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: 5400 rpm
Form Factor: 3.5 inches
Device is: Not in smartctl database 7.3/5528
ATA Version is: ACS-3 T13/2161-D revision 5
SATA Version is: SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Fri Apr 17 19:12:02 2026 BST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status: (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 0) seconds.
Offline data collection
capabilities: (0x73) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
No Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: ( 447) minutes.
Conveyance self-test routine
recommended polling time: ( 2) minutes.
SCT capabilities: (0x70bd) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 080 064 006 Pre-fail Always - 0
3 Spin_Up_Time 0x0003 098 097 000 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 66
5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0
7 Seek_Error_Rate 0x000f 072 060 045 Pre-fail Always - 0
9 Power_On_Hours 0x0032 099 099 000 Old_age Always - 1487
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 66
183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 0
184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0
189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0
190 Airflow_Temperature_Cel 0x0022 069 058 040 Old_age Always - 31 (Min/Max 31/32)
191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 0
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 122
193 Load_Cycle_Count 0x0032 100 100 000 Old_age Always - 150
194 Temperature_Celsius 0x0022 031 042 000 Old_age Always - 31 (0 22 0 0 0)
195 Hardware_ECC_Recovered 0x001a 080 064 000 Old_age Always - 0
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0
240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 1472 (235 110 0)
241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 7053112416
242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 61603577760

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error

1 Extended offline Completed without error 00% 1333 -

2 Short offline Completed without error 00% 1326 -

SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

The above only provides legacy SMART information - try ‘smartctl -x’ for more

That is the odd thing. Smart shows no Errors but trunas is giving IO errors.

Yeah looking clean there too - uhh, reseat cabling? Run memtest?

I reseated the cabling last time it happend. Thought that fixed it.

I just rebooted the server and got lots of IO errors from the drive on boot and it was not detected in truenas. Reseated the cables again and it reappreared. When it was giving IO errors was making a ping ish noise which was very odd.

Hmm - swapping locations with a working drive makes any changes?

that i have not tried. will try it after the long smart test and scrub ends

Swapped with known good drive and drive with the error moved. looks like any one plugged into the second half of my HBA gets the error. Wonder if it is the cable from the HBA.

anyone know the spec of the cable i need to order for this HBA?

It could be the cable or the HBA itself.

Quick solution until you figure the HBA out, use a Motherboard SATA port if you have one free.

EDIT: And you can always swap the data cables, but make sure you connect the drives to the same port. In other words, remove the suspect cable, replace the known good cable, then install the known good cable in the possible bad port and connect up that last drive. This will allow you to validate either the HBA port is bad or the cable is bad. Why buy a cable if if will not work.

BUT: Do not place all your drives on that suspect bad port. Not unless you have a backup of your data. No need to see most o your drives with data errors.

Thinking out loud, I would have expected UDMA CRC Errors if the cable was bad. I’m thinking bad HBA, but it would be less expensive for you if it were the cable.

Which HBA? 99% chance “mini sas to sata” is what you want.

I’m glad it is at least not the drive, a ‘new’ used HBA or cable is likely cheaper to replace in today’s market :frowning:

you called it. Got one on order now. The (bad cable) is currently plugged into my boot mirror instead of data drive which has stopped the errors so far. Very possile since not much changes on the boot drive that it is small enough data to not error.

New cable is on the way though. if that does not fix it think ill get another HBA and replace it