Hi Y’all, I received the following message yesterday, and then today received the next messages.
TrueNAS @ truenas alerts:
Pool [poolName] state is ONLINE: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected.
Current alerts:
Pool [poolName] state is ONLINE: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected.
And the following alerts came today
TrueNAS @ truenasFreddieNew alert:
Pool [poolName] state is DEGRADED: One or more devices are faulted in response to persistent errors. Sufficient replicas exist for the pool to continue functioning in a degraded state.The following devices are not healthy:
Disk WDC_WD40EZAX-00C8UB0 WD-WX32D83957EU is FAULTED
The following alert has been cleared:
Pool [poolName] state is ONLINE: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected.
Current alerts:
Pool [poolName] state is DEGRADED: One or more devices are faulted in response to persistent errors. Sufficient replicas exist for the pool to continue functioning in a degraded state.The following devices are not healthy:
Disk WDC_WD40EZAX-00C8UB0 WD-WX32D83957EU is FAULTED
This NAS was setup 2 weeks ago with two refurbished 4tb wd blues from western digital’s website (they are mirrored). It is running bare metal, installed on a 256gb m.2 nvme. Each of the hard disks are connected into a SATA port on the motherboard.
When I first installed and setup TrueNAS I tried to manually run SMART tests, but learned that that feature had been removed (automated in the background) as well as the feature for scheduling SMART tests. I tried to setup a cronjob that called smartctl but it kept returning nothing/null and i figured that its probably that the trueNAS api has changed/hidden manual SMART calling.
Anyways, my question is, do these two errors for sure mean one of the drives is failing? What would the right steps to fixing this be? My thoughts are
Locate which hard drive the error is referring too by reading and matching the model number to the physical drive
RMA the original drive (it should still be under warranty)
Reinstall the new drive, and rebuild the pool.
Does that sound right? Should I shutdown and not touch the NAS in the meantime? Should I just remove the bad drive immediately and then keep running it like normal, assuming the degraded pool will operate reasonably fine (i have all the important Data backed up in cloud)?
Consumer drives such as WD Blue can take so long to respond while trying to read a dubious sector that ZFS will declare them faulted.
You should be able to manually run sudo smartctl -t long /dev/sdX against your drives (substitute X as required).
If the pool is online but degraded, your priority should be to make a backup, at least of the most important data. Then investigate what’s going on with the drives and pool. sudo zpool status -v
This is the most important part. If you have a mirror and you mess up the only good drive data, well it is a hard lesson to learn.
All the data you posted makes it look like a ZFS error, however follow the flowcharts, they will help you identify the issue. And if it is only a ZFS error, then you likely have a system stability issue. A SCRUB will likely repair the damage. But follow the flowcharts. If you have a question about a command, just ask and someone will offer assistance.
Last Scan:
Finished Scrub on 2025-12-11 17:35:16
Last Scan Errors:
0
Last Scan Duration:
1 hour 19 minutes 4 seconds
Results of a zpool status -v
pool: [poolName]
state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
Sufficient replicas exist for the pool to continue functioning in a
degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the device
repaired.
scan: scrub repaired 0B in 01:19:04 with 0 errors on Thu Dec 11 17:35:16 2025
config:
NAME STATE READ WRITE CKSUM
[poolName] DEGRADED 0 0 0
mirror-0 DEGRADED 0 0 0
d417efca-457e-46e5-a620-6614bb26b1ac FAULTED 24 0 0 too many errors
81bdb5a6-e7f5-4dab-9698-8d4795e129cc ONLINE 0 0 0
errors: No known data errors
Results of a long test on the troublesome drive
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.12.33-production+truenas] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Device Model: WDC WD40EZAX-00C8UB0
Serial Number: WD-WX32D83957EU
LU WWN Device Id: 5 0014ee 2c0c7514a
Firmware Version: 01.01A01
User Capacity: 4,000,787,030,016 bytes [4.00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: 5400 rpm
Form Factor: 3.5 inches
Device is: Not in smartctl database 7.3/5528
ATA Version is: ACS-3 T13/2161-D revision 5
SATA Version is: SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Fri Dec 12 13:37:21 2025 PST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 57) A fatal error or unknown test error
occurred while the device was executing
its self-test routine and the device
was unable to complete the self-test
routine.
Total time to complete Offline
data collection: (41280) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 430) minutes.
Conveyance self-test routine
recommended polling time: ( 5) minutes.
SCT capabilities: (0x3031) SCT Status supported.
SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 197 197 051 Pre-fail Always - 161
3 Spin_Up_Time 0x0027 100 253 021 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 3
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 314
10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0
11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 2
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 3
193 Load_Cycle_Count 0x0032 190 190 000 Old_age Always - 30218
194 Temperature_Celsius 0x0022 118 112 000 Old_age Always - 29
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 253 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0008 100 253 000 Old_age Offline - 0
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Fatal or unknown error 90% 295 -
# 2 Extended offline Interrupted (host reset) 90% 30 -
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
The above only provides legacy SMART information - try 'smartctl -x' for more
If I followed the flowchart correctly, to me this indicates a failing armature/read head and would require a device replacement. Curious if I understood it right/y’all agree. My reasoning below
The ZFS status shows no corrupted files, and scrubbing did not change any of the read or write or checksum values
None of the SMART long data shows any warnings on the critical drive issues
On the non critical issues, the high raw read error rate, with no other discernable errors, suggest an armature failure.