Critical sector errors for one drive

Nikotine · April 22, 2024, 4:57pm

Hi,
I recently started using a new NAS, upgrading from my old nas4free box.
I started fresh with four new drives.
Took me a while to setup everything like I wanted in Truenas, and transfer all the data to the new box, which took me 4 days using rsync…

I forgot to burn in the new drives…
They are Seagate ST12000VN0008 12TB drives.

Now all of a sudden I am getting these errors for one of them:

I started a full smart test but it failed:

I’m currently running it again, which takes some time.
But I suspect something is seriously wrong.

I’d like to do the burn-in tests described here: Hard Drive Burn-in Testing | TrueNAS Community
Considering how long it took me to transfer all my data, setup shares and permissions… Is there a way to recover at least that setup work?
I have my four drives in 2x mirror setup, so could I offline one drive, do the burn-in tests and then resilver it, repeating this for each drive?
Worst case I still have my old NAS, and the data didn’t change since I transferred it.

Nikotine · April 22, 2024, 6:24pm

Full report, the second test failed again:

admin@nas[~]$ sudo smartctl -a /dev/sdc
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.1.74-production+truenas] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Seagate IronWolf
Device Model:     ST12000VN0008-2YS101
Serial Number:    <redacted>
LU WWN Device Id: <redacted>
Firmware Version: SC60
User Capacity:    12,000,138,625,024 bytes [12.0 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database 7.3/5528
ATA Version is:   ACS-4 (minor revision not indicated)
SATA Version is:  SATA 3.3, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Mon Apr 22 20:18:12 2024 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      ( 121) The previous self-test completed having
                                        the read element of the test failed.
Total time to complete Offline 
data collection:                (  567) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine 
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        (1039) minutes.
Conveyance self-test routine
recommended polling time:        (   2) minutes.
SCT capabilities:              (0x50bd) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   083   064   044    Pre-fail  Always       -       212077992
  3 Spin_Up_Time            0x0003   092   092   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       7
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   069   060   045    Pre-fail  Always       -       8400691
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       165
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       7
 18 Head_Health             0x000b   100   100   050    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   056   054   000    Old_age   Always       -       44 (Min/Max 39/46)
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       1
193 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       294
194 Temperature_Celsius     0x0022   044   046   000    Old_age   Always       -       44 (0 22 0 0 0)
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       2
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       2
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
200 Pressure_Limit          0x0023   100   100   001    Pre-fail  Always       -       0
240 Head_Flying_Hours       0x0000   100   100   000    Old_age   Offline      -       157h+23m+34.230s
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       4218995240
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       144161467

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed: read failure       90%       165         -
# 2  Extended offline    Completed: read failure       90%       162         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

The above only provides legacy SMART information - try 'smartctl -x' for more

Davvo · April 22, 2024, 6:34pm

RMA said drive, period. And cool the others down.

chuck32 · April 22, 2024, 6:36pm

I’m not the most experienced in assessing smart data but for me it looks like the drive is indeed faulty.
You could try switching cables and check if everything is seated properly, but if that fails I’d RMA the drive.

Possible, but with 12 TB drives this will take a very long time, given that you can only burn in one drive at a time.
You’ll have no redundancy when you burn in the other drives and losing another disk will kill the whole pool.
I’d advise you get your other NAS up to speed on the data right now.
Do you have another source of backup?

Then you can check the cabling, if the drive still throws errors RMA it. Then burn in all drives simultaneously and recreate the pool from scratch.

Davvo · April 22, 2024, 6:37pm

In my experience you don’t get pending sectors and test failures from bad cables… not without CRC errors at the very least.

Nikotine · April 22, 2024, 6:37pm

Yes, I found this page: Hard Drive Troubleshooting Guide (All Versions of FreeNAS) | TrueNAS Community and it’s clear this drive is problematic.

But I want to fully test the three other drives so I only have to RMA once.
So any chance of doing the burn-in tests without redoing all the setup?

Davvo · April 22, 2024, 6:37pm

jgreco’s solnet-array-test | TrueNAS Community can burn-in without data loss.

chuck32 · April 22, 2024, 6:45pm

CRC is usually my go to indicator for bad cables, I wasn’t sure if OPs smart data will rule out any cabling for sure.
I can’t really remember, I think I also got pending sectors with a bad cable on one of my SSDs.

You still need to act now though, I interpret @Davvo meant this with cooling down the other drives. Two mirrored vdevs with one lost drive is not safe at this point, especially with untested drives.

Get your backups in order ASAP.

You can check the three remaining drives and already start the RMA process. Waiting for other drives to be burned in won’t speed up anything. If they all come back clean you’re already a few days ahead since you started the RMA process on the known faulty one.

Davvo · April 22, 2024, 6:51pm

I meant literally cooling the drives since I do not like seeing them at 44°C.

As far as I understand @Nikotine’s data is safely stored on his other Nas as well; if that’s not so, prioritize the backup of your data.

ericloewe · April 22, 2024, 7:31pm

You don’t get them at all, unless there’s a firmware bug in the disk. They’re both strictly internal things.

Nikotine · April 22, 2024, 8:19pm

Yeah the temperatures are a bit too high, I agree. The NAS is a Terramaster F4-424 pro and I just confirmed that the fan is spinning. I need to check if I can increase the speed in the BIOS.

I have mym original NAS still, and the only thing that was added since migration are some snapshots from my Proxmox server.

I’ll start the RMA process.

Nikotine · April 22, 2024, 8:27pm

That looks like a very handy script, although read-only?
Will it test the full disk this way?
Just for my understanding, I need to let this script run for a few passes, then do the SMART test again to check for errors?

Davvo · April 23, 2024, 4:03pm

Similarly to memtest, run the script for as much as you want but at least a single pass. Then a long test at the end.

This is usually enough for 90% of the drives. If you want to be more zealous, run badblocks.