Just experienced my first raidz2 pool drive failure. The pool has a hot spare and the whole replacement process was 100% painless. However, the failed drive was usable in Windows and CrystalDiskInfo shows the drive is in good condition. Can I put it back in the pool as spare again?
Appreciate any advise in advance.
Which version of TrueNAS?
What specific errors were reported in TrueNAS?
What does the S.M.A.R.T. report say? sudo smartctl -a /dev/XXXX
Run smartests and maybe even a pass of badblocks on that disk. Depending on the output, you can maybe put it back into service
If the disk passes, might be worth looking into why it was dropped from the pool.
A ZFS failure does not mean a drive failure, however it could have been caused by a drive failure. Looking at the SMART data will help and if the drive hasnât had a Long/Extended test run in a while, that is what Iâd run to try and validate the drive is actually good.
Looking at the TrueNAS logs may shed some light as well as to why the drive was replaced.
For all replied. I thank you. Unfortunately, I wasnât knowledgable enough to run those suggested. It would be my next weekendâs project to actually put it back in the pool as spare to see if it will fail again or not. If it happens, I would be armed and ready to test it. Thank you ,all.
Before you put it into a pool is when youâd test.
Some of the more thorough tests (ie badblocks) will overwrite all data on a drive, several times over, to make sure every last bit on the hdd is still good.
Let us know when youâve slapped that thing to any linux os & weâll be happy to try to help test it further.
Personally, I hate running badblocks on any system that has data I want to keep. The chance that I make a fat finger error & run the command on the wrong drive is too much for me to handle.
I do have a Linux Mint laptop that I use occasionally with the GUI. I can easily connect the drive to it using an USB encloser. However, I am not sure how to thoroughly test it? Any utility I could use? By the way, I am absolutely a novice on Linux commands and will need a lot of handholding so to speak.
Appreciate further advise and thanks in advance. Oh, I will Google it too.
First thing would be a âsmartctl -t long /dev/sd#â # being the letter assigned to the drive. Once that is done, post the results here & we can help advise if it worth testing further or better to just RMA
I had a guide for using badblocks but away from computer atm ![]()
ChatGPT also shown me the detail how to use smartctl and badblocks. Smartctl shown some pre-fail results but nothing alarming (no reallocted sector and pending reallocations). However, smartctl -H /dev/sdx returned the drive is not responding properly. Maybe that caused TrueNAS failed it?
Not sure when I will do the badblock -wsv /dev/sdx. The drive is 6TB. It will take hours and hours as I understand it. Since my pool is in good health and running with a hot spare. I might not pursue this anymore.
I really appreciat the help. Thank you again.
Do you mind posting the full outputs? Smartctl output can sometimes be unintuitive.
I donât typically see the -H flag, was that something the LLM output? Normally an -a is suffient, sometimes an -x. You didnât use /dev/sdx, that was only illustrative, right?
/dev/sdx was only illustration.
I am running the badblocks -wsv to /dev/sdd now, but I use a separate terminal windows to generated the smartctl -a /dev/sdd output as follow. By the way, I must performed smartctl on the wrong drive. /dev/sdd smart output are either pre-fail or old aged! I checled my purchase history and found that this was a refurbished drive which I donât think I should put it back to service.
smartctl -H was recommended by ChatGPT to just retrieve the general health of the drive. The output was the first paragraph under the â===START OF READ SMART DATA SECTION===â.
Here is the smartctl -a output
=== START OF INFORMATION SECTION ===
Device Model: OOS6000G
Serial Number: 000F3MR7
LU WWN Device Id: 5 000c50 0e7c284a0
Firmware Version: OOS1
User Capacity: 6,001,175,126,016 bytes [6.00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: 7200 rpm
Form Factor: 3.5 inches
Device is: Not in smartctl database 7.3/5528
ATA Version is: ACS-4 (minor revision not indicated)
SATA Version is: SATA 3.3, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Mon Sep 29 13:09:21 2025 CDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART Status not supported: Incomplete response, ATA output registers missing
SMART overall-health self-assessment test result: PASSED
Warning: This result is based on an Attribute check.
General SMART Values:
Offline data collection status: (0x82) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 25) The self-test routine was aborted by
the host.
Total time to complete Offline
data collection: ( 567) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: ( 549) minutes.
Conveyance self-test routine
recommended polling time: ( 2) minutes.
SCT capabilities: (0x70bd) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 079 064 044 Pre-fail Always - 73399296
3 Spin_Up_Time 0x0003 093 092 000 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 099 099 020 Old_age Always - 1925
5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0
7 Seek_Error_Rate 0x000f 073 060 045 Pre-fail Always - 18838485
9 Power_On_Hours 0x0032 095 095 000 Old_age Always - 4706
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 566
18 Unknown_Attribute 0x000b 100 100 050 Pre-fail Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
188 Command_Timeout 0x0032 100 075 000 Old_age Always - 25
190 Airflow_Temperature_Cel 0x0022 040 040 000 Old_age Always - 60 (Min/Max 25/60)
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 74
193 Load_Cycle_Count 0x0032 090 090 000 Old_age Always - 20845
194 Temperature_Celsius 0x0022 060 060 000 Old_age Always - 60 (0 15 0 0 0)
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0023 100 100 001 Pre-fail Always - 0
240 Head_Flying_Hours 0x0000 100 100 000 Old_age Offline - 997 (197 5 0)
241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 22444269349
242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 15145211233
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
1 Extended offline Aborted by host 90% 4703 -
2 Short offline Completed without error 00% 4702 -
3 Extended offline Aborted by host 90% 4702 -
4 Short offline Completed without error 00% 4702 -
5 Short offline Completed without error 00% 4701 -
6 Short offline Completed without error 00% 4624 -
7 Short offline Completed without error 00% 4595 -
8 Short offline Completed without error 00% 4566 -
9 Short offline Completed without error 00% 4537 -
#10 Short offline Completed without error 00% 4488 -
#11 Short offline Completed without error 00% 4463 -
#12 Short offline Completed without error 00% 4420 -
#13 Short offline Completed without error 00% 4391 -
#14 Short offline Completed without error 00% 4362 -
#15 Short offline Completed without error 00% 4318 -
#16 Extended offline Interrupted (host reset) 00% 4308 -
#17 Short offline Completed without error 00% 4289 -
#18 Short offline Completed without error 00% 4260 -
#19 Short offline Completed without error 00% 4217 -
#20 Short offline Completed without error 00% 4188 -
#21 Short offline Completed without error 00% 4159 -
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
The above only provides legacy SMART information - try âsmartctl -xâ for more
pretty damn toasty - good idea to get a fan pointed at it.
Uhh, guessing it is a seagate drive?
smartctl -a -v 1,raw48:54 /dev/sdd -v 7,raw48:54 -v 195,raw48:54 should translate it it into something I could actually read. Otherwise I see only short tests complete - so this isnât full info.
Current theory; drive got toasty, but based on this limited info, it should be healthy. Donât run any smart test until badblocks is done doing its thing.
Itâll take a long time, but badblocks followed by a smartctl -t long will let us know if the drive can be trusted.
A long/extended smart test will take approximately 9:ish hours to complete. It looks like @twirl843 shutdown or restarted before that test period has completed the last three times the test was started. It would be good to complete at least one now to see if anything crops up.
As I said, smart reports are not exactly intuitive to decipher, none of the monitored attributes are actually signalling failure yet.
It is running hot due to the fact that it is in an external USB encloser and I purposely closed the cover when badblocks is running.
I do have 120mm fans pointng to all the drives inside TrueNAS case. They all run in the low 40 degree C with heavy read/write load. Badblocks just started the read test. I guess another 9 hours will be complete and follow up another 9 hours of smartctl -t long. I will share the smartctl -a output as soon as I could or when I remember.
My bad turned off the machine not remembering smartctl is running.
I can tell badblocks is running because the drive light is flashing. smartctl -t long only return an estimated time of completion but the drive light is not flashing at all. Is there a way to check the progress status of smartctl -t long command?
âŚJust a headsup - badblocks does THREE passes of write/reads. So expect this whole process to take like 4 days.
smartctl -a /dev/sdd will somewhere in the output (I think near the bottom) give the progress on any presently running smart tests. Use the command I posted a few replies earlier & itâll hopefully provide a readable output for some of the important fields.
4 days ! This thing is going to melt.
Assumming you like to see the output of
smartctl -a -v 1,raw48:54 /dev/sdd -v 7,raw48:54 -v 195,raw48:54 while badblocks is running. Here is the output below. Would you kindly explain what the command options are for?
sudo smartctl -a -v 1,raw48:54 /dev/sdd -v 7,raw48:54 -v 195,raw48:54
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.8.0-84-generic] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Device Model: OOS6000G
Serial Number: 000F3MR7
LU WWN Device Id: 5 000c50 0e7c284a0
Firmware Version: OOS1
User Capacity: 6,001,175,126,016 bytes [6.00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: 7200 rpm
Form Factor: 3.5 inches
Device is: Not in smartctl database 7.3/5528
ATA Version is: ACS-4 (minor revision not indicated)
SATA Version is: SATA 3.3, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Mon Sep 29 22:47:05 2025 CDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART Status not supported: Incomplete response, ATA output registers missing
SMART overall-health self-assessment test result: PASSED
Warning: This result is based on an Attribute check.
General SMART Values:
Offline data collection status: (0x82) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 25) The self-test routine was aborted by
the host.
Total time to complete Offline
data collection: ( 567) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: ( 549) minutes.
Conveyance self-test routine
recommended polling time: ( 2) minutes.
SCT capabilities: (0x70bd) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 080 064 044 Pre-fail Always - 0
3 Spin_Up_Time 0x0003 093 092 000 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 099 099 020 Old_age Always - 1925
5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0
7 Seek_Error_Rate 0x000f 074 060 045 Pre-fail Always - 0
9 Power_On_Hours 0x0032 095 095 000 Old_age Always - 4715
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 566
18 Unknown_Attribute 0x000b 100 100 050 Pre-fail Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
188 Command_Timeout 0x0032 100 075 000 Old_age Always - 25
190 Airflow_Temperature_Cel 0x0022 042 038 000 Old_age Always - 58 (Min/Max 25/62)
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 74
193 Load_Cycle_Count 0x0032 090 090 000 Old_age Always - 20845
194 Temperature_Celsius 0x0022 058 062 000 Old_age Always - 58 (0 15 0 0 0)
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0023 100 100 001 Pre-fail Always - 0
240 Head_Flying_Hours 0x0000 100 100 000 Old_age Offline - 1007 (109 96 0)
241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 29930283989
242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 21675519841
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
1 Extended offline Aborted by host 90% 4703 -
2 Short offline Completed without error 00% 4702 -
3 Extended offline Aborted by host 90% 4702 -
4 Short offline Completed without error 00% 4702 -
5 Short offline Completed without error 00% 4701 -
6 Short offline Completed without error 00% 4624 -
7 Short offline Completed without error 00% 4595 -
8 Short offline Completed without error 00% 4566 -
9 Short offline Completed without error 00% 4537 -
#10 Short offline Completed without error 00% 4488 -
#11 Short offline Completed without error 00% 4463 -
#12 Short offline Completed without error 00% 4420 -
#13 Short offline Completed without error 00% 4391 -
#14 Short offline Completed without error 00% 4362 -
#15 Short offline Completed without error 00% 4318 -
#16 Extended offline Interrupted (host reset) 00% 4308 -
#17 Short offline Completed without error 00% 4289 -
#18 Short offline Completed without error 00% 4260 -
#19 Short offline Completed without error 00% 4217 -
#20 Short offline Completed without error 00% 4188 -
#21 Short offline Completed without error 00% 4159 -
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
The above only provides legacy SMART information - try âsmartctl -xâ for more
Beauty - see how these two fields actually show â0â instead of some insanely large number? Thats good!
Depending the results after the three days of testing go, I have a strong feeling that the drive might actually be fine. Thats good news. The bad news is that you may have to investigate what caused it to drop out. Might be a loose or faulty wire. Maybe the HBA needs more airflow. Hard to say, but outside of @winnielinnie 's favourite MemTest failures, those would probably the two most common failures.
Edit:
Since you asked what the command options are for; -a outputs all info, -v specifies that we want to translate specific entries (for example, 1 - Raw_Read_Error_Rate and 7 - Seek_Error_Rate) & then how we want to translate them. This isnât required for all drives, but Seagate specifically you generally need to do this to read some of the entries.
https://linux.die.net/man/8/smartctl
In case you want a full, detailed break down.
How do I know that we need to specify âraw48:54â to translate these values? I read it somewhere on the forums once when trying to figure out why WD drives gave me values like â0â for error rate, while seagate gave â51239873â & was considered âin specâ.
Edit 2:
This isnât to say WD is better than Seagate, just to explain that there are minor differences in how to get a readable output depending on vendor.
To all interested, badblocks finally finished and reported no bad blocks found with 0/0/0 errors.
I am not going to run smartctl -t long since I have problem tracking its progress and abort it once today already. I donât even know what I did, but smartctl -a shown another long offline test was aborted by the host. With this in my pocket, I will keep this drive as emergency spare.
Thank you very much for all the help.
Cheers