Hello everyone!
I’m looking for some advice after encountering a drive issue in my TrueNAS system.
I’m running TrueNAS-SCALE-24.04.2.5 in a VM hosted by Proxmox. My specs:
Asus PRIME H510M-A WIFI
Intel Core i3-10300T
16 GB RAM for the VM
4 Hard drives, each 4 TB connected to an HBA Controller (IBM M1015, flashed to IT mode) which is passed through to the TrueNAS VM via PCI-passthrough
Pool
mirror 0
WD Red (WDC_WD40EFRX-68N32N0)
Seagate IRONWOLF (ST4000VN006-3CW104)
mirror 1
WD Red (WDC_WD40EFPX-68C6CN0)
Seagate IRONWOLF (ST4000VN006-3CW104)
After a scheduled scrub of my pool, I received the following notification:
state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
action: Replace the faulted device, or use 'zpool clear' to mark the device repaired.
NAME STATE READ WRITE CKSUM
vault DEGRADED 0 0 0
mirror-0 DEGRADED 0 0 0
<WD Red 0> ONLINE 0 0 0
<Seagate Ironwolf 0> FAULTED 12 0 6 too many errors
mirror-1 ONLINE 0 0 0
<WD Red 1> ONLINE 0 0 0
<Seagate Ironwolf 1> ONLINE 0 0 0
errors: No known data errors
To dig deeper, I ran a long S.M.A.R.T. test on the FAULTED drive (Seagate IronWolf 0, model ST4000VN006-3CW104). To my understanding the test completed without error, and the overall SMART status reports PASSED.
Here is the full log:
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.6.32-production+truenas] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Device Model: ST4000VN006-3CW104
Serial Number: <Seagate Ironwolf 0>
Firmware Version: SC60
User Capacity: 4,000,787,030,016 bytes [4.00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: 5400 rpm
Form Factor: 3.5 inches
Device is: Not in smartctl database 7.3/5528
ATA Version is: ACS-3 T13/2161-D revision 5
SATA Version is: SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Tue Jun 17 10:21:24 2025 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
AAM feature is: Unavailable
APM feature is: Unavailable
Rd look-ahead is: Enabled
Write cache is: Enabled
DSN feature is: Unavailable
ATA Security is: Disabled, NOT FROZEN [SEC1]
Wt Cache Reorder: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 0) seconds.
Offline data collection
capabilities: (0x73) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
No Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: ( 453) minutes.
Conveyance self-test routine
recommended polling time: ( 2) minutes.
SCT capabilities: (0x70bd) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAGS VALUE WORST THRESH FAIL RAW_VALUE
1 Raw_Read_Error_Rate POSR-- 100 064 006 - 43232
3 Spin_Up_Time PO---- 096 095 000 - 0
4 Start_Stop_Count -O--CK 100 100 020 - 111
5 Reallocated_Sector_Ct PO--CK 100 100 010 - 0
7 Seek_Error_Rate POSR-- 082 060 045 - 160706227
9 Power_On_Hours -O--CK 092 092 000 - 7129
10 Spin_Retry_Count PO--C- 100 100 097 - 0
12 Power_Cycle_Count -O--CK 100 100 020 - 111
183 Runtime_Bad_Block -O--CK 100 100 000 - 0
184 End-to-End_Error -O--CK 100 100 099 - 0
187 Reported_Uncorrect -O--CK 100 100 000 - 0
188 Command_Timeout -O--CK 100 100 000 - 0
189 High_Fly_Writes -O-RCK 100 100 000 - 0
190 Airflow_Temperature_Cel -O---K 069 057 040 - 31 (Min/Max 27/34)
191 G-Sense_Error_Rate -O--CK 100 100 000 - 0
192 Power-Off_Retract_Count -O--CK 100 100 000 - 326
193 Load_Cycle_Count -O--CK 100 100 000 - 512
194 Temperature_Celsius -O---K 031 043 000 - 31 (0 20 0 0 0)
195 Hardware_ECC_Recovered -O-RC- 100 064 000 - 43232
197 Current_Pending_Sector -O--C- 100 100 000 - 0
198 Offline_Uncorrectable ----C- 100 100 000 - 0
199 UDMA_CRC_Error_Count -OSRCK 200 200 000 - 452
240 Head_Flying_Hours ------ 100 253 000 - 7035 (97 50 0)
241 Total_LBAs_Written ------ 100 253 000 - 53800882549
242 Total_LBAs_Read ------ 100 253 000 - 116877034411
||||||_ K auto-keep
|||||__ C event count
||||___ R error rate
|||____ S speed/performance
||_____ O updated online
|______ P prefailure warning
General Purpose Log Directory Version 1
SMART Log Directory Version 1 [multi-sector log support]
Address Access R/W Size Description
0x00 GPL,SL R/O 1 Log Directory
0x01 SL R/O 1 Summary SMART error log
0x02 SL R/O 5 Comprehensive SMART error log
0x03 GPL R/O 5 Ext. Comprehensive SMART error log
0x04 GPL,SL R/O 8 Device Statistics log
0x06 SL R/O 1 SMART self-test log
0x07 GPL R/O 1 Extended self-test log
0x08 GPL R/O 2 Power Conditions log
0x09 SL R/W 1 Selective self-test log
0x0c GPL R/O 2048 Pending Defects log
0x10 GPL R/O 1 NCQ Command Error log
0x11 GPL R/O 1 SATA Phy Event Counters log
0x21 GPL R/O 1 Write stream error log
0x22 GPL R/O 1 Read stream error log
0x24 GPL R/O 512 Current Device Internal Status Data log
0x30 GPL,SL R/O 9 IDENTIFY DEVICE data log
0x80-0x9f GPL,SL R/W 16 Host vendor specific log
0xa1 GPL,SL VS 24 Device vendor specific log
0xa2 GPL VS 8160 Device vendor specific log
0xa6 GPL VS 192 Device vendor specific log
0xa8-0xa9 GPL,SL VS 136 Device vendor specific log
0xab GPL VS 1 Device vendor specific log
0xb0 GPL VS 9048 Device vendor specific log
0xbe-0xbf GPL VS 65535 Device vendor specific log
0xc0 GPL,SL VS 1 Device vendor specific log
0xc1 GPL,SL VS 16 Device vendor specific log
0xc3 GPL,SL VS 8 Device vendor specific log
0xc4 GPL,SL VS 24 Device vendor specific log
0xd1 GPL VS 264 Device vendor specific log
0xd3 GPL VS 1920 Device vendor specific log
0xe0 GPL,SL R/W 1 SCT Command/Status
0xe1 GPL,SL R/W 1 SCT Data Transfer
SMART Extended Comprehensive Error Log Version: 1 (5 sectors)
No Errors Logged
SMART Extended Self-test Log Version: 1 (1 sectors)
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed without error 00% 7110 -
# 2 Short offline Completed without error 00% 7000 -
# 3 Extended offline Completed without error 00% 6912 -
# 4 Short offline Completed without error 00% 6832 -
# 5 Short offline Completed without error 00% 6592 -
# 6 Extended offline Completed without error 00% 6504 -
# 7 Short offline Completed without error 00% 6424 -
# 8 Short offline Completed without error 00% 6256 -
# 9 Extended offline Completed without error 00% 6183 -
#10 Short offline Completed without error 00% 6103 -
#11 Short offline Completed without error 00% 5887 -
#12 Extended offline Completed without error 00% 5799 -
#13 Short offline Completed without error 00% 5719 -
#14 Short offline Completed without error 00% 5551 -
#15 Extended offline Completed without error 00% 5463 -
#16 Short offline Completed without error 00% 5383 -
#17 Short offline Completed without error 00% 5144 -
#18 Extended offline Completed without error 00% 5057 -
#19 Short offline Completed without error 00% 4976 -
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
SCT Status Version: 3
SCT Version (vendor specific): 522 (0x020a)
Device State: Active (0)
Current Temperature: 31 Celsius
Power Cycle Min/Max Temperature: 27/34 Celsius
Lifetime Min/Max Temperature: 20/43 Celsius
Under/Over Temperature Limit Count: 0/0
SCT Temperature History Version: 2
Temperature Sampling Period: 3 minutes
Temperature Logging Interval: 94 minutes
Min/Max recommended Temperature: 1/61 Celsius
Min/Max Temperature Limit: 2/60 Celsius
Temperature History Size (Index): 128 (24)
Index Estimated Time Temperature Celsius
25 2025-06-09 03:22 30 ***********
... ..( 50 skipped). .. ***********
76 2025-06-12 11:16 30 ***********
77 2025-06-12 12:50 31 ************
78 2025-06-12 14:24 31 ************
79 2025-06-12 15:58 30 ***********
80 2025-06-12 17:32 30 ***********
81 2025-06-12 19:06 30 ***********
82 2025-06-12 20:40 31 ************
83 2025-06-12 22:14 31 ************
84 2025-06-12 23:48 31 ************
85 2025-06-13 01:22 30 ***********
... ..( 4 skipped). .. ***********
90 2025-06-13 09:12 30 ***********
91 2025-06-13 10:46 31 ************
... ..( 18 skipped). .. ************
110 2025-06-14 16:32 31 ************
111 2025-06-14 18:06 32 *************
112 2025-06-14 19:40 31 ************
... ..( 7 skipped). .. ************
120 2025-06-15 08:12 31 ************
121 2025-06-15 09:46 30 ***********
122 2025-06-15 11:20 31 ************
123 2025-06-15 12:54 31 ************
124 2025-06-15 14:28 31 ************
125 2025-06-15 16:02 30 ***********
126 2025-06-15 17:36 30 ***********
127 2025-06-15 19:10 30 ***********
0 2025-06-15 20:44 31 ************
1 2025-06-15 22:18 30 ***********
... ..( 6 skipped). .. ***********
8 2025-06-16 09:16 30 ***********
9 2025-06-16 10:50 33 **************
... ..( 3 skipped). .. **************
13 2025-06-16 17:06 33 **************
14 2025-06-16 18:40 31 ************
15 2025-06-16 20:14 31 ************
16 2025-06-16 21:48 31 ************
17 2025-06-16 23:22 30 ***********
... ..( 3 skipped). .. ***********
21 2025-06-17 05:38 30 ***********
22 2025-06-17 07:12 29 **********
23 2025-06-17 08:46 29 **********
24 2025-06-17 10:20 30 ***********
SCT Error Recovery Control:
Read: Disabled
Write: Disabled
Device Statistics (GP Log 0x04)
Page Offset Size Value Flags Description
0x01 ===== = = === == General Statistics (rev 1) ==
0x01 0x008 4 111 --- Lifetime Power-On Resets
0x01 0x010 4 7129 --- Power-on Hours
0x01 0x018 6 53856377809 --- Logical Sectors Written
0x01 0x020 6 326655901 --- Number of Write Commands
0x01 0x028 6 116869163343 --- Logical Sectors Read
0x01 0x030 6 130822314 --- Number of Read Commands
0x01 0x038 6 - --- Date and Time TimeStamp
0x03 ===== = = === == Rotating Media Statistics (rev 1) ==
0x03 0x008 4 7119 --- Spindle Motor Power-on Hours
0x03 0x010 4 7078 --- Head Flying Hours
0x03 0x018 4 512 --- Head Load Events
0x03 0x020 4 0 --- Number of Reallocated Logical Sectors
0x03 0x028 4 0 --- Read Recovery Attempts
0x03 0x030 4 0 --- Number of Mechanical Start Failures
0x03 0x038 4 0 --- Number of Realloc. Candidate Logical Sectors
0x03 0x040 4 326 --- Number of High Priority Unload Events
0x04 ===== = = === == General Errors Statistics (rev 1) ==
0x04 0x008 4 0 --- Number of Reported Uncorrectable Errors
0x04 0x010 4 0 --- Resets Between Cmd Acceptance and Completion
0x05 ===== = = === == Temperature Statistics (rev 1) ==
0x05 0x008 1 31 --- Current Temperature
0x05 0x010 1 30 --- Average Short Term Temperature
0x05 0x018 1 29 --- Average Long Term Temperature
0x05 0x020 1 43 --- Highest Temperature
0x05 0x028 1 25 --- Lowest Temperature
0x05 0x030 1 36 --- Highest Average Short Term Temperature
0x05 0x038 1 28 --- Lowest Average Short Term Temperature
0x05 0x040 1 32 --- Highest Average Long Term Temperature
0x05 0x048 1 29 --- Lowest Average Long Term Temperature
0x05 0x050 4 0 --- Time in Over-Temperature
0x05 0x058 1 70 --- Specified Maximum Operating Temperature
0x05 0x060 4 0 --- Time in Under-Temperature
0x05 0x068 1 0 --- Specified Minimum Operating Temperature
0x06 ===== = = === == Transport Statistics (rev 1) ==
0x06 0x008 4 269 --- Number of Hardware Resets
0x06 0x010 4 130 --- Number of ASR Events
0x06 0x018 4 452 --- Number of Interface CRC Errors
|||_ C monitored condition met
||__ D supports DSN
|___ N normalized value
Pending Defects log (GP Log 0x0c)
No Defects Logged
SATA Phy Event Counters (GP Log 0x11)
ID Size Value Description
0x000a 2 9 Device-to-host register FISes sent due to a COMRESET
0x0001 2 452 Command failed due to ICRC error
0x0003 2 452 R_ERR response for device-to-host data FIS
0x0004 2 0 R_ERR response for host-to-device data FIS
0x0006 2 0 R_ERR response for device-to-host non-data FIS
0x0007 2 0 R_ERR response for host-to-device non-data FIS
Seagate FARM log (GP Log 0xa6) supported [try: -l farm]
Unfortunately I’m a bit unsure how to interpet this situation
Should I trust the SMART test and try to zpool clear the fault or should I proactively replace this drive?
I appreciate any insights or recommendations. Thanks so much in advance for your help!
Best regards,
John