Drive faulted quite often - TrueNAS Scale

czar.united · April 25, 2024, 10:14am

Dear,

I have built TrueNAS Scale in mid of May 2023 with 4 drives, 4TB each with 1xRAIDZ2 configuration.

The TrueNAS is under Proxmox VE and I recently passthrogh the 4 drive with HBA LSI IT mode, it was pass through the SATA previously.

I use Seagate Ironwolf ST4000VN006, and if I understand correctly, it is CMR disk.
However, during a year, I have replaced 2 drives due to faulted or degraded indicated in Truenas.

The first was on June 2023 (only a month after I built the system), and the second was on March 2024. And now, I have the 3rd disk that currently under faulted condition.

I am curious, is it normal to have 3 drives failure in a year?

What I need to do with my current drive that in faulted condition? how to check that the drive is exactly need to be replaced? Because previously, when I was in this situation, I contacted Seagate immediately to have replacement disk without checking anything, fortunately it was still under warranty.

What I need to consider to prevent this drive failure quite often?

I just use TrueNAS for storing my data, use SMB share to my devices, and NFS for backup my VMs in Proxmox.

czar.united · April 25, 2024, 11:30am

The faulted disk is sdb as captured below

Capture

I run sudo smartctl -a /dev/sdb, the result as below


smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.131+truenas] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     ST4000VN006-3CW104
Serial Number:    ZW60SM3P
LU WWN Device Id: 5 000c50 0e6eb77c2
Firmware Version: SC60
User Capacity:    4,000,787,030,016 bytes [4.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Form Factor:      3.5 inches
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Thu Apr 25 18:10:02 2024 WIB
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever 
                                        been run.
Total time to complete Offline 
data collection:                (    0) seconds.
Offline data collection
capabilities:                    (0x73) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        No Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine 
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        ( 464) minutes.
Conveyance self-test routine
recommended polling time:        (   2) minutes.
SCT capabilities:              (0x70bd) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   079   064   006    Pre-fail  Always       -       77000152
  3 Spin_Up_Time            0x0003   096   095   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       20
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   072   060   045    Pre-fail  Always       -       17778721
  9 Power_On_Hours          0x0032   099   099   000    Old_age   Always       -       913
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       20
183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always       -       0
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   062   060   040    Old_age   Always       -       38 (Min/Max 37/39)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       46
193 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       96
194 Temperature_Celsius     0x0022   038   040   000    Old_age   Always       -       38 (0 29 0 0 0)
195 Hardware_ECC_Recovered  0x001a   079   064   000    Old_age   Always       -       77000152
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       884 (95 22 0)
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       5115255193
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       2300006586

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%       871         -
# 2  Short offline       Completed without error       00%       704         -
# 3  Short offline       Completed without error       00%       536         -
# 4  Short offline       Completed without error       00%       368         -
# 5  Short offline       Completed without error       00%       200         -
# 6  Short offline       Completed without error       00%        35         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

Now I am running the smartctl -t long /dev/sdb, will update the output of smartctl -a once finished. It is expected to finish the next 7 to 8 hours.

Kindly help to understand the situation.

Many thanks

czar.united · April 25, 2024, 11:43pm

Hello member,

Can anybody help me please?
Here below is the smart long test result of the faulted drive

I am not quite familiar how to read this, bet it seems indication of “completed without error”; 0 value in “Reallocated_Sector_Ct”; “Current_Pending_Sector” is good, is not it?

sudo smartctl -a /dev/sdb
[sudo] password for admin: 
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.131+truenas] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     ST4000VN006-3CW104
Serial Number:    ZW60SM3P
LU WWN Device Id: 5 000c50 0e6eb77c2
Firmware Version: SC60
User Capacity:    4,000,787,030,016 bytes [4.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Form Factor:      3.5 inches
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Fri Apr 26 06:35:21 2024 WIB
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever 
                                        been run.
Total time to complete Offline 
data collection:                (    0) seconds.
Offline data collection
capabilities:                    (0x73) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        No Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine 
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        ( 464) minutes.
Conveyance self-test routine
recommended polling time:        (   2) minutes.
SCT capabilities:              (0x70bd) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   079   064   006    Pre-fail  Always       -       77009784
  3 Spin_Up_Time            0x0003   096   095   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       21
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   073   060   045    Pre-fail  Always       -       20027650
  9 Power_On_Hours          0x0032   099   099   000    Old_age   Always       -       926
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       21
183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always       -       0
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   063   060   040    Old_age   Always       -       37 (Min/Max 36/40)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       47
193 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       107
194 Temperature_Celsius     0x0022   037   040   000    Old_age   Always       -       37 (0 29 0 0 0)
195 Hardware_ECC_Recovered  0x001a   079   064   000    Old_age   Always       -       77009784
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       894 (216 218 0)
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       5115255193
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       2300016218

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%       922         -
# 2  Extended offline    Interrupted (host reset)      00%       914         -
# 3  Short offline       Completed without error       00%       871         -
# 4  Short offline       Completed without error       00%       704         -
# 5  Short offline       Completed without error       00%       536         -
# 6  Short offline       Completed without error       00%       368         -
# 7  Short offline       Completed without error       00%       200         -
# 8  Short offline       Completed without error       00%        35         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

ericloewe · April 25, 2024, 11:55pm

Not unheard of, but it is pretty nasty.

The SMART data for that disk looks okay. What does zpool status say, exactly?

czar.united · April 26, 2024, 12:32am

Hello

Many thanks for the reply
I just restarted the system, and the faulted disk back to normal.
currently zpool status as below, now is coming back to normal condition

  pool: Bigbre
 state: ONLINE
  scan: resilvered 90.0M in 00:00:24 with 0 errors on Fri Apr 26 06:46:32 2024
config:

        NAME                                      STATE     READ WRITE CKSUM
        Bigbre                                    ONLINE       0     0     0
          raidz2-0                                ONLINE       0     0     0
            553e4473-e078-4c0c-a542-cd1b0ef84674  ONLINE       0     0     0
            f688c6a0-b537-4c1f-826b-806ef81ccb0e  ONLINE       0     0     0
            9e9b2bd6-c910-4bda-b378-a0e50dc3fe1a  ONLINE       0     0     0
            8a39d49d-f643-4ca9-b63d-d757a733946c  ONLINE       0     0     0

errors: No known data errors

  pool: boot-pool
 state: ONLINE
status: Some supported and requested features are not enabled on the pool.
        The pool can still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
        the pool may no longer be accessible by software that does not support
        the features. See zpool-features(7) for details.
  scan: scrub repaired 0B in 00:00:07 with 0 errors on Mon Apr 22 03:45:09 2024
config:

        NAME        STATE     READ WRITE CKSUM
        boot-pool   ONLINE       0     0     0
          sda3      ONLINE       0     0     0

errors: No known data errors

However, when I had the first 2 disks faulted, I restarted the system, back to normal, and faulted happened again in the next days, or even hours.

So, definitely I will come back again when the disk is faulted

nickspacemonkey · April 26, 2024, 12:41am

Did these errors occur when using the on board SATA? Also what kind of HBA exactly is it? Some of them can be abit finicky.

czar.united · April 26, 2024, 12:49am

If I remember well, the first 2 disks faulted was using onboard SATA.
After having that 2 disks faulted, I decided to pass through the disks with INSPUR 9207-8i HBA LSI that laying around, it waspurchased from Aliexpress, and the 3rd disk faulted is using this configuration

Davvo · April 26, 2024, 7:04am

Output of sas3flash -list or, if it does not work, sas2flash -list.

What PSU are you using?

czar.united · April 26, 2024, 7:17am

The output of sas2flash -list as below

LSI Corporation SAS2 Flash Utility
Version 20.00.00.00 (2014.09.18) 
Copyright (c) 2008-2014 LSI Corporation. All rights reserved 

        Adapter Selected is a LSI SAS: SAS2308_1(D1) 

        Controller Number              : 0
        Controller                     : SAS2308_1(D1) 
        PCI Address                    : 00:00:10:00
        SAS Address                    : 56c92bf-0-0004-e695
        NVDATA Version (Default)       : 14.01.00.06
        NVDATA Version (Persistent)    : 14.01.00.06
        Firmware Product ID            : 0x2214 (IT)
        Firmware Version               : 20.00.06.00
        NVDATA Vendor                  : LSI
        NVDATA Product ID              : SAS9207-8i
        BIOS Version                   : 07.39.02.00
        UEFI BSD Version               : N/A
        FCODE Version                  : N/A
        Board Name                     : SAS9207-8i
        Board Assembly                 : N/A
        Board Tracer Number            : N/A

        Finished Processing Commands Successfully.
        Exiting SAS2Flash.

The PSU is Seasonic 80+ Bronze SS520GM Active PFC F3
I think it is Gen 1 of Seasonic M12II semi modular, it is old PSU, purchased 8+ years ago.

There is a label on it about the warranty, number 12 was crossed, I think it means purchased date was in 2012

Davvo · April 26, 2024, 9:06am

The PSU or its cables could be a possibility worth investigating.

czar.united · April 26, 2024, 9:41am

Many thanks for your advice
Yes probably the PSU needs to be investigated further.

Since these 3 disks faulted under different connection to the system.
the first 2 disk faulted were using pass through SATA onboard, and the 3rd disk is using HBA pass through.

Thank you!

Stux · April 26, 2024, 11:08pm

A PSU that is failing can cause brownouts to the drives.

This can cause errors.

czar.united · April 29, 2024, 7:50am

I realized when I open the case, there was a weird sound coming from hardisk
“nguuukkkk” and it repeated every 10 seconds. Probably it is a sign my PSU is going to die.

Just replaced the PSU with fractal design 80+ gold. Hopefully it is getting better.

How long does it usually take for a hard disk to die? at least based on your experience?

nickspacemonkey · April 29, 2024, 1:25pm

Anywhere between a day and 20+ years.

Davvo · April 29, 2024, 7:30pm

I have heard of people having a drive die half an hour into the system… after burn in. You don’t disrespect the HDDs Gods.

Your best shots are SMART tests. Long ones.