Do these warnings mean I have a drive failure/ why can't i manuall run smart?

Hi Y’all, I received the following message yesterday, and then today received the next messages.

TrueNAS @ truenas alerts: 
  • Pool [poolName] state is ONLINE: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected.
    
Current alerts: 
  • Pool [poolName] state is ONLINE: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected.
    

And the following alerts came today

TrueNAS @ truenasFreddieNew alert:

  • Pool [poolName] state is DEGRADED: One or more devices are faulted in response to persistent errors. Sufficient replicas exist for the pool to continue functioning in a degraded state.The following devices are not healthy:
    
    • Disk WDC_WD40EZAX-00C8UB0 WD-WX32D83957EU is FAULTED
      
The following alert has been cleared: 
  • Pool [poolName] state is ONLINE: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected.
    
 Current alerts: 
  • Pool [poolName] state is DEGRADED: One or more devices are faulted in response to persistent errors. Sufficient replicas exist for the pool to continue functioning in a degraded state.The following devices are not healthy:
    
    • Disk WDC_WD40EZAX-00C8UB0 WD-WX32D83957EU is FAULTED
      

    This NAS was setup 2 weeks ago with two refurbished 4tb wd blues from western digital’s website (they are mirrored). It is running bare metal, installed on a 256gb m.2 nvme. Each of the hard disks are connected into a SATA port on the motherboard.

    When I first installed and setup TrueNAS I tried to manually run SMART tests, but learned that that feature had been removed (automated in the background) as well as the feature for scheduling SMART tests. I tried to setup a cronjob that called smartctl but it kept returning nothing/null and i figured that its probably that the trueNAS api has changed/hidden manual SMART calling.

    Anyways, my question is, do these two errors for sure mean one of the drives is failing? What would the right steps to fixing this be? My thoughts are

    1. Locate which hard drive the error is referring too by reading and matching the model number to the physical drive
    2. RMA the original drive (it should still be under warranty)
    3. Reinstall the new drive, and rebuild the pool.

    Does that sound right? Should I shutdown and not touch the NAS in the meantime? Should I just remove the bad drive immediately and then keep running it like normal, assuming the degraded pool will operate reasonably fine (i have all the important Data backed up in cloud)?

Consumer drives such as WD Blue can take so long to respond while trying to read a dubious sector that ZFS will declare them faulted.
You should be able to manually run sudo smartctl -t long /dev/sdX against your drives (substitute X as required).

If the pool is online but degraded, your priority should be to make a backup, at least of the most important data. Then investigate what’s going on with the drives and pool.
sudo zpool status -v

1 Like

This is the most important part. If you have a mirror and you mess up the only good drive data, well it is a hard lesson to learn.

All the data you posted makes it look like a ZFS error, however follow the flowcharts, they will help you identify the issue. And if it is only a ZFS error, then you likely have a system stability issue. A SCRUB will likely repair the damage. But follow the flowcharts. If you have a question about a command, just ask and someone will offer assistance.

@etorix “Go Go Gadget”

Run memtest86.

And/or reduce memory speed to standard.

If that works without error … are you using intel 14XXX ?

Reduce clockspeed to 3000MHz.

If you still have an unstable system, fault the drives.

Otherwise RMA the CPU.

Oh, and also: #man smartctl, so you understand what it does and doesn’t do.

Result of a scrub that i just did:

Last Scan:
Finished Scrub on 2025-12-11 17:35:16
Last Scan Errors:
0
Last Scan Duration:
1 hour 19 minutes 4 seconds

Results of a zpool status -v

pool: [poolName]
 state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
        Sufficient replicas exist for the pool to continue functioning in a
        degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the device
        repaired.
  scan: scrub repaired 0B in 01:19:04 with 0 errors on Thu Dec 11 17:35:16 2025
config:

        NAME                                      STATE     READ WRITE CKSUM
        [poolName]                                DEGRADED     0     0     0
          mirror-0                                DEGRADED     0     0     0
            d417efca-457e-46e5-a620-6614bb26b1ac  FAULTED     24     0     0  too many errors
            81bdb5a6-e7f5-4dab-9698-8d4795e129cc  ONLINE       0     0     0

errors: No known data errors

Results of a long test on the troublesome drive

smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.12.33-production+truenas] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     WDC WD40EZAX-00C8UB0
Serial Number:    WD-WX32D83957EU
LU WWN Device Id: 5 0014ee 2c0c7514a
Firmware Version: 01.01A01
User Capacity:    4,000,787,030,016 bytes [4.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Form Factor:      3.5 inches
Device is:        Not in smartctl database 7.3/5528
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Fri Dec 12 13:37:21 2025 PST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (  57) A fatal error or unknown test error
                                        occurred while the device was executing
                                        its self-test routine and the device 
                                        was unable to complete the self-test 
                                        routine.
Total time to complete Offline 
data collection:                (41280) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine 
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        ( 430) minutes.
Conveyance self-test routine
recommended polling time:        (   5) minutes.
SCT capabilities:              (0x3031) SCT Status supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   197   197   051    Pre-fail  Always       -       161
  3 Spin_Up_Time            0x0027   100   253   021    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       3
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       314
 10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       2
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       3
193 Load_Cycle_Count        0x0032   190   190   000    Old_age   Always       -       30218
194 Temperature_Celsius     0x0022   118   112   000    Old_age   Always       -       29
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   100   253   000    Old_age   Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Fatal or unknown error        90%       295         -
# 2  Extended offline    Interrupted (host reset)      90%        30         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

The above only provides legacy SMART information - try 'smartctl -x' for more

If I followed the flowchart correctly, to me this indicates a failing armature/read head and would require a device replacement. Curious if I understood it right/y’all agree. My reasoning below

  1. The ZFS status shows no corrupted files, and scrubbing did not change any of the read or write or checksum values

  2. None of the SMART long data shows any warnings on the critical drive issues

  3. On the non critical issues, the high raw read error rate, with no other discernable errors, suggest an armature failure.

1 Like

RMA

1 Like