Need Help Evaluating Warning Message

I just switched the GoldEye from Fangtooth and now two days later I received this message in the notification tray.

Warning: 3 uncorrectable errors reported for sdp

There is of course, no longer a SMART GUI to investigate further as GUIs are silly. That’s why I chose a NAS OS with a GUI centric configuration method.
The glorious new “Disk Health” box proudly informs me that there are no temperature related issues and the “Disk Reports” screen helpfully notes that there is no data currently being written to the pool.
The “Storage Health” box reports “Online, No errors”

The drive in question is a 6 TB 7200 RPM Seagate Ironwolf HDD.

It is circa late 2017 / early 2018 so it is certainly an aging drive though it has not been in use continuously since then. By my interpretation of the readout it has about 4.64 years of uptime.
It is part of an Z1 array of 5 6 TB Ironwolfs of various batches. This pool primarily hold very large (200 GB+) archives that are only accessed occasionally.

I pulled the SMART data from the shell.

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate     0x000f   100   064   044    Pre-fail  Always       -       682221
3 Spin_Up_Time            0x0003   085   084   000    Pre-fail  Always       -       0
4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       67
5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       16
7 Seek_Error_Rate         0x000f   091   060   045    Pre-fail  Always       -       1315304036
9 Power_On_Hours          0x0032   090   036   000    Old_age   Always       -       9539 (70 154 0)
10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       23
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   097   097   000    Old_age   Always       -       3
188 Command_Timeout         0x0032   100   001   000    Old_age   Always       -       8590131878
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   065   055   040    Old_age   Always       -       35 (Min/Max 33/37)
191 G-Sense_Error_Rate      0x0032   093   093   000    Old_age   Always       -       15325
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       21
193 Load_Cycle_Count        0x0032   001   001   000    Old_age   Always       -       282703
194 Temperature_Celsius     0x0022   035   045   000    Old_age   Always       -       35 (0 17 0 0 0)
195 Hardware_ECC_Recovered  0x001a   100   001   000    Old_age   Always       -       682221
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       40675h+06m+52.961s
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       84733007586
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       319741340236

I’m guessing that the warning is referring to line 187.

I’m ordering a new drive to throw in as a hot spare but I don’t know enough to determine just how severe of an issue this is.
Should I put this array in a cold shutdown until I can definitively resilver or is this a premature warning that as long as I am preparing I should be able to live with for a short while.
I was already developing a plan to fully replace the array but really can’t afford to fully implement it immediately.

I was trying to get the SMART results into collapsible Quote/Code/Spoiler tags but seem to have failed.

Looking at

5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       16
...
187 Reported_Uncorrect      0x0032   097   097   000    Old_age   Always       -       3

That means 16 bad sectors, of which 3 recent. This reads as “small number of bad blocks but growing”.

Option 1. observe and wait; if raw values increase, replace at some point.

Option 2. replace right away.

With pre-AI HDD prices, I personally would replace as soon as the replacement is available (two days normally). With current prices, I would be more inclined to observe and wait assuming all other drives in the array are perfect.

Thanks for the explanation.

As far as pricing, that was basically what was running through my head.

As far as the other drives, I just got scrutiny running and it is flagging line 188 Command_Timeout on all of them with nearly the same level. It reports it in the 650-690 range for all of them though.

I’ve started looking it but it’s not really clear how much this value matters or if it has any actionable meaning for Seagate drives.

Value itself does not matter. Increase does, and for 188 if it grows, it could be a drive, controller, power, or cable issue. I would guess if it all the same for all, there was (or is, if growing) some issue with the controller.

Is this a seagate drive? There are some additional parameters needed to make it humanly readable if it is.

Sorry for the slow response. It is a Seagate drive. Other drives (different pool) on the same backplane are not showing errors.

None of the flags in question have changed again. I think the command errors may be from when the drive were previously used in a 5-Bay USB enclosure.

From what I can tell Scrutiny and Goldeye have a lower threshold for user alerts than SMART and Fangtooth which is why they are suddenly alerting now.

Side note, is there a way to address drives in a shell SMART command other than /dev/sdx.
With TrueNAS shuffling the IDs on every boot I am having to manually rematch the IDs to the drives at issue at every boot.

Easiest is to just bring up the manual page for the command you are running and look at the options.

man smartctl

To my knowledge you’re stuck with the ever changing sda/b/etc & there is no way to search smartctl with uid, sn, or anything else. Only hope is after smartctl -a /dev/whatever is to check the sn# at the top to make sure it is the right drive.

Actually, you can run smartctl -a /dev/disk/by-id/<udev persistent disk ID>. The sd[a-z] device names aren’t guaranteed to be stable boot-over-boot, but the udev persistent IDs are. On my system, the persistent names have the format <bus>-<disk model>_<disk serial number>.

1 Like

I knew that if I admitted I didn’t know something that someone smarter would come & correct me. Thank you for teaching me something new!

1 Like