S.M.A.R.T. error - Time to RMA the disk? - Help a noob

Hi all,

Got the message: " Device: /dev/sdd [SAT], 1 Offline uncorrectable sectors."
and disk failed a S.M.A.R.T. test, time to replace? Do another LONG test and pray?

Thanks for your advice.

Here’s the full log:

=== START OF INFORMATION SECTION ===
Device Model:     ST22000NT001-3LS101
Serial Number:    ZX22425P
LU WWN Device Id: 5 000c50 0e80d9fef
Firmware Version: EN01
User Capacity:    22,000,969,973,760 bytes [22.0 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        Not in smartctl database 7.3/5528
ATA Version is:   ACS-4 (minor revision not indicated)
SATA Version is:  SATA 3.3, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Tue Apr  1 22:49:11 2025 PDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      ( 249) Self-test routine in progress...
                                        90% of test remaining.
Total time to complete Offline 
data collection:                (  559) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine 
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        (1876) minutes.
Conveyance self-test routine
recommended polling time:        (   2) minutes.
SCT capabilities:              (0x50bd) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   084   064   044    Pre-fail  Always       -       234197206
  3 Spin_Up_Time            0x0003   090   089   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       47
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   081   060   045    Pre-fail  Always       -       125545979
  9 Power_On_Hours          0x0032   092   092   000    Old_age   Always       -       7646
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       47
 18 Unknown_Attribute       0x000b   100   100   050    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   062   053   000    Old_age   Always       -       38 (Min/Max 22/40)
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       8
193 Load_Cycle_Count        0x0032   086   086   000    Old_age   Always       -       29591
194 Temperature_Celsius     0x0022   038   047   000    Old_age   Always       -       38 (0 21 0 0 0)
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       1
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       1
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0023   100   100   001    Pre-fail  Always       -       0
240 Head_Flying_Hours       0x0000   100   100   000    Old_age   Offline      -       5681 (143 251 0)
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       49374710135
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       79941489706

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Self-test routine in progress 90%      7646         -
# 2  Extended offline    Completed: read failure       90%      7640         697274656
# 3  Short offline       Completed without error       00%      7603         -
# 4  Short offline       Completed without error       00%      7435         -
# 5  Short offline       Completed without error       00%      7267         -
# 6  Short offline       Completed without error       00%      7099         -
# 7  Short offline       Completed without error       00%      6932         -
# 8  Short offline       Completed without error       00%      6764         -
# 9  Short offline       Completed without error       00%      6596         -
#10  Short offline       Completed without error       00%      6428         -
#11  Short offline       Completed without error       00%      6260         -
#12  Short offline       Completed without error       00%      6092         -
#13  Short offline       Completed without error       00%      5924         -
#14  Short offline       Completed without error       00%      5756         -
#15  Short offline       Completed without error       00%      5588         -
#16  Short offline       Completed without error       00%      5420         -
#17  Short offline       Completed without error       00%      5252         -
#18  Short offline       Completed without error       00%      5084         -
#19  Short offline       Completed without error       00%      4916         -
#20  Short offline       Completed without error       00%      4748         -
#21  Short offline       Completed without error       00%      4580         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

EDIT By JoeScmuck to fix formatting.

Realised it looks horrible, here’s in screenshots:


It only has 1 pending sector. If it’s still under warranty, yes, you can replace it likely they will do so with pending sectors. If it’s not under warranty, I likely would not for 1 bad sector as ZFS would have corrected it. I would however make sure you have ongoing smart tests, and, scrubs. Myself, I’d just monitor it for further issues. I can take a few pending sectors without worrying much about it. Assuming you have parity that is, like raidz(1-3) or a mirror?

If it’s running 24/7 with no spindown, that’s < 1 year old so should be warranted. However, I also note you have a load cycle count of 29,591 which is quite a bit in a year. My seagate with about the same number of hours of usage has a count of 2,597. Here is a reference thread if it’s the same issue:

Likely the count isn’t a problem at all for those types of drives, but you can check the rated lifetime for that drive with Seagate.

The best solution is to use the </> button, or to paste the text between two lines of three backquotes.

```
text
```

But the drive is due for RMA. One sector is not much YET, but failing a SMART test is terminal.

Thanks,
I think I’ll monitor while I figure out if I can RMA and will certainly look at the cycle count issue. Not sure if being set up as a time machine and (low usage) Plex server can do that to cycle count. I’ve compared to its mirror and shows the same.

Thanks for the tip!

Looking at RMA, should be in warranty.

It’s not really an issue unless the drives cycle count limit is not very high. If it’s half a million, doesn’t matter. If it’s 100k, you are almost 1/3rd there already. No, Time Machine or Plex will not do that. It’s likely a drive power saving feature. But it may not matter, check the limit, then decide.

If it’s under warranty, definitely RMA it. There is no harm and if the drive gets worse, better now than later.

Thanks!

Also, as an add on question for everyone…

My pool was a 2 disk mirror, can I take the disk offline and out of the system (to RMA) and continue using the remaining disk normally while I wait for my replacement?

Yes, you can. For the sake of redundancy, though, see if you can do an advance exchange. Get the new drive, burn it in, resilver it into the pool, then send back the old one.

The system will run fine. As Dan mentions, the problem is you have no redundancy in the meantime, so, any parity or other error can mean loss of data. This is why I keep a drive on hand for my system, so it can be replaced as soon as possible. A hot spare if you have ports for it. As he says, see if you can do an advance replacement for this reason. And if you can afford it, keep a spare drive on hand.

I’m looking at at least a month given where Seagate wants me to send the disk, I’ll try getting another one, even if I have to buy it outright, though.

Yeah, which is why I keep a spare on hand. In my case, it is a hot spare. I will not go a month without any sort of redundancy. But not everyone can swing it.

If you decide to buy one, you’ll have that spare when they (eventually) send the replacement.

1 Like

Mostly for price-to-capacity ratio convenience I have a mirror (2 large disks). Since I might end up with 3 by buying one ASAP, would you suggest a three way mirror or a raidz1?

Or keep my 2 disk mirror and a spare as you suggest… I do have the bays to put it in though.

My choice, with little info, is mirror with spare. If you convert to a raidz1 by moving your data off and back, you would still want a spare, and therefore then a 4th drive.

Think of the situation you are in now. (same if it was a raidz1). If some time in the next month this drives fails, you lost everything. Waiting a month carries some risk of that.

No matter how you use the drives, which raidz, etc, you still need backups. So, hopefully you are backing up your data unless you simply don’t care if you lose it.

1 Like

Mirror + cold spare is the most natural choice.
With several vdevs you could consider a hot spare; with a single vdev there’s no point in a hot spare over a 3-way mirror.

2 Likes

Thanks everyone.

Conclusion is I ordered another disk and RMAing the faulty one today, I hope I can go without redundancy for just a few days.

If you’ll indulge me, here’s a last question for you all. Easy enough to take the disk offline and exctract it for replacement. Is there any way to also wipe it straight from truenas? Should I move it to another system and do it there?

I want to wipe it before sending it in to RMA.

Thanks again!

You could wipe in on the TrueNAS machine however I would recommend putting the drive in a different machine to wipe it. Without redundancy, why take the risk of screwing it up.

Great! Exactly my thought, wiping and sending it out today.
Thanks