Pool degraded after update to 25.04.0

It gives me this:

A 700W power supply should be plenty, correct? I wouldn’t think the size would be an issue now unless the power supply itself is going bad…

Thanks!

Hmm. Swear that’s worked before.

Elevate to a root shell with sudo su and then try:

for i in /sys/class/scsi_host/host*/link_power_management_policy; do echo max_performance >$i; done

Unless it’s a really low-quality 700W that can’t actually deliver, 3 drives shouldn’t be anywhere near enough to upset that drive. What we’re doing with the script above is setting the policy on the SATA link itself to never go to a sleep/idle state.

This time it didn’t return anything. I ran it 2x to make sure I did it right the first time…

Thank you!

That should mean it took - you can check with

cat sys/class/scsi_host/host*/link_power_management_policy

And see if it comes back with maximum_performance for all of them.

It returns this…

Thanks for the help!

I’m running a Scrub right now.

OK, figured out what was wrong. missed the “/” at the very beginning of the location… Now I get:

Exactly what you expected…

If this was the issue, should the degrade disappear by itself?

Thank you!

It’ll stop reoccurring. Do a zpool clear and then a scrub and see if it reoccurs. Some of the issues I saw that were related to this also had older AMD based builds, so I’m wondering if something in the SATA controllers in those systems were exposing an oddity.

Default setting for power control is to follow the BIOS/system defaults, and if those are a bit off, that might be causing the failed writes.

1 Like

@HoneyBadger

Well, I have done a number of things to see what would happen.

I ran another complete surface scan (took about 8hrs or so) on the disks with the WD software which turned up with zero errors.

I loaded the drives (my data pool drives and my boot pool drive) into another computer I have that is much newer and built for much heavier lifting. It did seem to have less errors but they did not completely go away.

I have abandoned the back plain altogether because it did seem to cause more errors.

I have all drives wired directly to the mother board. (back in the original computer).

I’m running a second scrub in 2 days. Ran one on the newer computer and now running another on the original computer.

I noted that there were several files labeled as possibly damaged during the first scrub. They were all in the backup of my one laptop so I tried to delete the complete backup. I was able to delete most of it but for some reason it would not let me delete the main directory and a couple sub directories. It kept telling me I did not have permission when I tired to delete them from my MacBook and when I tried to ssh in and delete them that way it kept telling me “invalid exchange”. I have no idea what that means and couldn’t find anything from searching… Anyway. I am not seeing any files listed as damaged so far during this latest scrub.

I am not getting any of the “Degraded” or “Faulted” messages on the drives but I am still getting checksum errors. After last boot (about 5-1/2 hours ago), I am up to 324 checksum errors on 2 of the 3 disks and 20 on the 3rd.

I also have 2 more 2T WD Black drives that I tried to setup as mirrored drives to create a second pool but it failed when I tried to set it up?!

Do you think that upgrading my pool to the latest ZFS version might help? I have not researched what the latest update was for, so I don’t know.

Thanks for all the help!!!

@HoneyBadger So, now the pool and a disk show degraded again. UGH…

I got this message. I don’t know if it helps at all. I’m not sure what it’s telling me.

TrueNAS @ truenas

New alerts:

  • Pool WizzPool state is DEGRADED: One or more devices has experienced an error resulting in data corruption. Applications may be affected.
    The following devices are not healthy:
    • Disk WDC_WD40EFRX-68WT0N0 WD-WCC4E7KPT054 is DEGRADED

Current alerts:

  • Failed to sync TRUENAS catalog: [EFAULT] Failed to clone ‘GitHub - truenas/apps’ repository at ‘/mnt/.ix-apps/truenas_catalog’ destination: [EFAULT] Failed to clone ‘GitHub - truenas/apps’ repository at ‘/mnt/.ix-apps/truenas_catalog’ destination: Cloning into ‘/mnt/.ix-apps/truenas_catalog’…
  • Pool WizzPool state is DEGRADED: One or more devices has experienced an error resulting in data corruption. Applications may be affected.
    The following devices are not healthy:
    • Disk WDC_WD40EFRX-68WT0N0 WD-WCC4E7KPT054 is DEGRADED

Thanks!

@HoneyBadger

Well, I don’t think it’s necessarily a hardware issue. I mean it could still be but I changed to my other computer and it is still getting checksum errors and lots of them…

I saw in the Truenas documentation that the checksum setting in each of the datasets should be set to “SHA512”. Do you think that is correct. I went ahead and changed all mine, but it seems to have made it worse instead of better…

This system is:

AMD Ryzen 7 5800X 8-Core Processor
Corsair Vengeance Pro 16GB x 4 (64 GB)
ROG STRIX X570-E Gaming Mother Board

Thanks…

Though you’ve done extensive testing & have validated using manufacturer software, I’m curious to the output of ‘smartctl -a /dev/sd#’ (replace # with relevant drive letter)… Maybe your drives are actually failing?..

Yes, maybe and I will try that. The curious thing is… They have exactly the same amount of errors. All 3 are exactly the same. Just seems a little weird to me. But who knows.

Thanks for your advice!

Here is Disk “1”…

Machine  ix-applications
root@truenas[/mnt/LilWizz]# smartctl -a /dev/sdc 
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.12.15-production+truenas] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Red
Device Model:     WDC WD40EFRX-68WT0N0
Serial Number:    WD-WCC4E4EE670X
LU WWN Device Id: 5 0014ee 20b484633
Firmware Version: 82.00A82
User Capacity:    4,000,787,030,016 bytes [4.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Device is:        In smartctl database 7.3/5528
ATA Version is:   ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Fri Jul 11 23:44:56 2025 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00)	Offline data collection activity
					was never started.
					Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0)	The previous self-test routine completed
					without error or no self-test has ever 
					been run.
Total time to complete Offline 
data collection: 		(50760) seconds.
Offline data collection
capabilities: 			 (0x7b) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					Offline surface scan supported.
					Self-test supported.
					Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   2) minutes.
Extended self-test routine
recommended polling time: 	 ( 508) minutes.
Conveyance self-test routine
recommended polling time: 	 (   5) minutes.
SCT capabilities: 	       (0x703d)	SCT Status supported.
					SCT Error Recovery Control supported.
					SCT Feature Control supported.
					SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   195   178   021    Pre-fail  Always       -       7241
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       150
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   075   075   000    Old_age   Always       -       18887
 10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       149
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       107
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       484
194 Temperature_Celsius     0x0022   120   104   000    Old_age   Always       -       32
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       1

SMART Error Log Version: 1
ATA Error Count: 12373 (device log contains only the most recent five errors)
	CR = Command Register [HEX]
	FR = Features Register [HEX]
	SC = Sector Count Register [HEX]
	SN = Sector Number Register [HEX]
	CL = Cylinder Low Register [HEX]
	CH = Cylinder High Register [HEX]
	DH = Device/Head Register [HEX]
	DC = Device Command Register [HEX]
	ER = Error register [HEX]
	ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 12373 occurred at disk power-on lifetime: 18761 hours (781 days + 17 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  04 61 02 00 00 00 a0  Device Fault; Error: ABRT

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  ef 10 02 00 00 00 a0 08      00:31:21.317  SET FEATURES [Enable SATA feature]
  ec 00 00 00 00 00 a0 08      00:31:21.316  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 08      00:31:21.316  SET FEATURES [Set transfer mode]
  ef 10 02 00 00 00 a0 08      00:31:21.315  SET FEATURES [Enable SATA feature]
  ec 00 00 00 00 00 a0 08      00:31:21.315  IDENTIFY DEVICE

Error 12372 occurred at disk power-on lifetime: 18761 hours (781 days + 17 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  04 61 46 00 00 00 a0  Device Fault; Error: ABRT

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  ef 03 46 00 00 00 a0 08      00:31:21.316  SET FEATURES [Set transfer mode]
  ef 10 02 00 00 00 a0 08      00:31:21.315  SET FEATURES [Enable SATA feature]
  ec 00 00 00 00 00 a0 08      00:31:21.315  IDENTIFY DEVICE
  ef 10 02 00 00 00 a0 08      00:31:21.261  SET FEATURES [Enable SATA feature]

Error 12371 occurred at disk power-on lifetime: 18761 hours (781 days + 17 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  04 61 02 00 00 00 a0  Device Fault; Error: ABRT

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  ef 10 02 00 00 00 a0 08      00:31:21.315  SET FEATURES [Enable SATA feature]
  ec 00 00 00 00 00 a0 08      00:31:21.315  IDENTIFY DEVICE
  ef 10 02 00 00 00 a0 08      00:31:21.261  SET FEATURES [Enable SATA feature]
  ec 00 00 00 00 00 a0 08      00:31:21.260  IDENTIFY DEVICE

Error 12370 occurred at disk power-on lifetime: 18761 hours (781 days + 17 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  04 61 02 00 00 00 a0  Device Fault; Error: ABRT

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  ef 10 02 00 00 00 a0 08      00:31:21.261  SET FEATURES [Enable SATA feature]
  ec 00 00 00 00 00 a0 08      00:31:21.260  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 08      00:31:21.260  SET FEATURES [Set transfer mode]
  ef 10 02 00 00 00 a0 08      00:31:21.259  SET FEATURES [Enable SATA feature]
  ec 00 00 00 00 00 a0 08      00:31:21.259  IDENTIFY DEVICE

Error 12369 occurred at disk power-on lifetime: 18761 hours (781 days + 17 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  04 61 46 00 00 00 a0  Device Fault; Error: ABRT

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  ef 03 46 00 00 00 a0 08      00:31:21.260  SET FEATURES [Set transfer mode]
  ef 10 02 00 00 00 a0 08      00:31:21.259  SET FEATURES [Enable SATA feature]
  ec 00 00 00 00 00 a0 08      00:31:21.259  IDENTIFY DEVICE
  ef 10 02 00 00 00 a0 08      00:31:21.210  SET FEATURES [Enable SATA feature]

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Interrupted (host reset)      10%     18874         -
# 2  Short offline       Interrupted (host reset)      10%     18761         -
# 3  Short offline       Completed without error       00%     18699         -
# 4  Extended offline    Completed without error       00%     18695         -
# 5  Short offline       Completed without error       00%     18633         -
# 6  Short offline       Completed without error       00%     18490         -
# 7  Short offline       Completed without error       00%     18322         -
# 8  Short offline       Completed without error       00%     18155         -
# 9  Short offline       Completed without error       00%     17987         -
#10  Short offline       Completed without error       00%     17819         -
#11  Extended offline    Completed without error       00%     17257         -
#12  Extended offline    Completed without error       00%     15797         -
#13  Extended offline    Completed without error       00%     14382         -
#14  Extended offline    Completed without error       00%     12929         -
#15  Extended offline    Completed without error       00%     11501         -
#16  Extended offline    Completed without error       00%     10072         -
#17  Extended offline    Completed without error       00%      8648         -
#18  Extended offline    Completed without error       00%      7186         -
#19  Extended offline    Completed without error       00%      5748         -
#20  Extended offline    Completed without error       00%      4285         -
#21  Extended offline    Completed without error       00%      2822         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

The above only provides legacy SMART information - try 'smartctl -x' for more

Here is Disk “2”…

root@truenas[/mnt/LilWizz]# smartctl -a /dev/sdd
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.12.15-production+truenas] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Red
Device Model:     WDC WD40EFRX-68WT0N0
Serial Number:    WD-WCC4EF16HEXF
LU WWN Device Id: 5 0014ee 2b544f4f1
Firmware Version: 80.00A80
User Capacity:    4,000,787,030,016 bytes [4.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Device is:        In smartctl database 7.3/5528
ATA Version is:   ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Fri Jul 11 23:49:22 2025 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00)	Offline data collection activity
					was never started.
					Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0)	The previous self-test routine completed
					without error or no self-test has ever 
					been run.
Total time to complete Offline 
data collection: 		(52320) seconds.
Offline data collection
capabilities: 			 (0x7b) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					Offline surface scan supported.
					Self-test supported.
					Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   2) minutes.
Extended self-test routine
recommended polling time: 	 ( 523) minutes.
Conveyance self-test routine
recommended polling time: 	 (   5) minutes.
SCT capabilities: 	       (0x703d)	SCT Status supported.
					SCT Error Recovery Control supported.
					SCT Feature Control supported.
					SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       101
  3 Spin_Up_Time            0x0027   193   175   021    Pre-fail  Always       -       7350
  4 Start_Stop_Count        0x0032   001   001   000    Old_age   Always       -       100650
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   001   001   000    Old_age   Always       -       77038
 10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   098   098   000    Old_age   Always       -       2847
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       420
193 Load_Cycle_Count        0x0032   167   167   000    Old_age   Always       -       101276
194 Temperature_Celsius     0x0022   117   092   000    Old_age   Always       -       35
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       1

SMART Error Log Version: 1
ATA Error Count: 2008 (device log contains only the most recent five errors)
	CR = Command Register [HEX]
	FR = Features Register [HEX]
	SC = Sector Count Register [HEX]
	SN = Sector Number Register [HEX]
	CL = Cylinder Low Register [HEX]
	CH = Cylinder High Register [HEX]
	DH = Device/Head Register [HEX]
	DC = Device Command Register [HEX]
	ER = Error register [HEX]
	ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 2008 occurred at disk power-on lifetime: 26512 hours (1104 days + 16 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  04 61 46 00 00 00 a0  Device Fault; Error: ABRT

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  ef 03 46 00 00 00 a0 08      00:03:36.037  SET FEATURES [Set transfer mode]
  ec 00 00 00 00 00 a0 08      00:03:36.036  IDENTIFY DEVICE
  c8 00 60 20 00 00 e0 08      00:03:35.998  READ DMA
  ec 00 00 00 00 00 a0 08      00:03:35.988  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 08      00:03:35.988  SET FEATURES [Set transfer mode]

Error 2007 occurred at disk power-on lifetime: 26512 hours (1104 days + 16 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  04 61 60 20 00 00 e0  Device Fault; Error: ABRT 96 sectors at LBA = 0x00000020 = 32

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 60 20 00 00 e0 08      00:03:35.998  READ DMA
  ec 00 00 00 00 00 a0 08      00:03:35.988  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 08      00:03:35.988  SET FEATURES [Set transfer mode]
  ec 00 00 00 00 00 a0 08      00:03:35.988  IDENTIFY DEVICE
  c8 00 60 20 00 00 e0 08      00:03:35.949  READ DMA

Error 2006 occurred at disk power-on lifetime: 26512 hours (1104 days + 16 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  04 61 46 00 00 00 a0  Device Fault; Error: ABRT

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  ef 03 46 00 00 00 a0 08      00:03:35.988  SET FEATURES [Set transfer mode]
  ec 00 00 00 00 00 a0 08      00:03:35.988  IDENTIFY DEVICE
  c8 00 60 20 00 00 e0 08      00:03:35.949  READ DMA
  ec 00 00 00 00 00 a0 08      00:03:35.940  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 08      00:03:35.939  SET FEATURES [Set transfer mode]

Error 2005 occurred at disk power-on lifetime: 26512 hours (1104 days + 16 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  04 61 60 20 00 00 e0  Device Fault; Error: ABRT 96 sectors at LBA = 0x00000020 = 32

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 60 20 00 00 e0 08      00:03:35.949  READ DMA
  ec 00 00 00 00 00 a0 08      00:03:35.940  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 08      00:03:35.939  SET FEATURES [Set transfer mode]
  ec 00 00 00 00 00 a0 08      00:03:35.939  IDENTIFY DEVICE
  c8 00 60 20 00 00 e0 08      00:03:35.901  READ DMA

Error 2004 occurred at disk power-on lifetime: 26512 hours (1104 days + 16 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  04 61 46 00 00 00 a0  Device Fault; Error: ABRT

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  ef 03 46 00 00 00 a0 08      00:03:35.939  SET FEATURES [Set transfer mode]
  ec 00 00 00 00 00 a0 08      00:03:35.939  IDENTIFY DEVICE
  c8 00 60 20 00 00 e0 08      00:03:35.901  READ DMA
  ec 00 00 00 00 00 a0 08      00:03:35.891  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 08      00:03:35.891  SET FEATURES [Set transfer mode]

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Interrupted (host reset)      10%     11489         -
# 2  Short offline       Completed without error       00%     11376         -
# 3  Short offline       Completed without error       00%     11313         -
# 4  Extended offline    Aborted by host               90%     11313         -
# 5  Extended offline    Completed without error       00%     11310         -
# 6  Short offline       Completed without error       00%     11248         -
# 7  Short offline       Completed without error       00%     11242         -
# 8  Conveyance offline  Completed without error       00%     11218         -
# 9  Short offline       Completed without error       00%     11208         -
#10  Extended offline    Completed without error       00%     11183         -
#11  Short offline       Completed without error       00%     11171         -
#12  Short offline       Completed without error       00%     11104         -
#13  Short offline       Completed without error       00%     10937         -
#14  Short offline       Completed without error       00%     10770         -
#15  Short offline       Completed without error       00%     10602         -
#16  Short offline       Completed without error       00%     10434         -
#17  Extended offline    Completed without error       00%      9872         -
#18  Extended offline    Completed without error       00%      9604         -
#19  Extended offline    Completed without error       00%      9382         -
#20  Extended offline    Completed without error       00%      7931         -
#21  Extended offline    Completed without error       00%      6502         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

The above only provides legacy SMART information - try 'smartctl -x' for more

Here is Disk “3”…

root@truenas[/mnt/LilWizz]# smartctl -a /dev/sde
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.12.15-production+truenas] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Red
Device Model:     WDC WD40EFRX-68WT0N0
Serial Number:    WD-WCC4E7KPT054
LU WWN Device Id: 5 0014ee 2b5f38577
Firmware Version: 82.00A82
User Capacity:    4,000,787,030,016 bytes [4.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Device is:        In smartctl database 7.3/5528
ATA Version is:   ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Fri Jul 11 23:50:17 2025 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00)	Offline data collection activity
					was never started.
					Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0)	The previous self-test routine completed
					without error or no self-test has ever 
					been run.
Total time to complete Offline 
data collection: 		(51540) seconds.
Offline data collection
capabilities: 			 (0x7b) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					Offline surface scan supported.
					Self-test supported.
					Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   2) minutes.
Extended self-test routine
recommended polling time: 	 ( 515) minutes.
Conveyance self-test routine
recommended polling time: 	 (   5) minutes.
SCT capabilities: 	       (0x703d)	SCT Status supported.
					SCT Error Recovery Control supported.
					SCT Feature Control supported.
					SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   195   178   021    Pre-fail  Always       -       7233
  4 Start_Stop_Count        0x0032   092   092   000    Old_age   Always       -       8385
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   004   004   000    Old_age   Always       -       70667
 10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       947
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       739
193 Load_Cycle_Count        0x0032   198   198   000    Old_age   Always       -       8729
194 Temperature_Celsius     0x0022   119   099   000    Old_age   Always       -       33
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Interrupted (host reset)      10%      5117         -
# 2  Short offline       Completed without error       00%      5004         -
# 3  Short offline       Completed without error       00%      4942         -
# 4  Extended offline    Completed without error       00%      4938         -
# 5  Short offline       Completed without error       00%      4876         -
# 6  Short offline       Completed without error       00%      4733         -
# 7  Short offline       Completed without error       00%      4565         -
# 8  Short offline       Completed without error       00%      4397         -
# 9  Short offline       Completed without error       00%      4230         -
#10  Short offline       Completed without error       00%      4062         -
#11  Extended offline    Completed without error       00%      3500         -
#12  Extended offline    Completed without error       00%      2040         -
#13  Extended offline    Completed without error       00%       625         -
#14  Extended offline    Completed without error       00%     64708         -
#15  Extended offline    Completed without error       00%     63279         -
#16  Extended offline    Completed without error       00%     61851         -
#17  Extended offline    Completed without error       00%     60427         -
#18  Extended offline    Completed without error       00%     58965         -
#19  Extended offline    Completed without error       00%     57527         -
#20  Extended offline    Completed without error       00%     56065         -
#21  Extended offline    Completed without error       00%     54601         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

The above only provides legacy SMART information - try 'smartctl -x' for more

I’d actually argue that drive 1 & 2 do/did have a mechanical fault given the multizone failure being a non-zero value. I don’t know if these drives are in warranty, but given that output I’d try to RMA them.

WD in my experience are pretty reasonable & have rma’d drives with similar outputs for me, even though they ‘passed’.

Disk 2 is worst of the bunch with raw read error rate creeping up - it is already half way to hitting their failure threshold.

Disk 3 looks fine imo.

…uhh, I’d recommend starting to backup data.

Another thing to note, I see disk 2 is ~70k hours old, but last test successfuly ran was at 18k hours of life. Not sure if anything went wrong with your scheduled tests? Not sure if I’d recommend running full smart long tests on all of them for more recent test results or focus on backups asap.

1 Like

I have opened a ticket with WD. If nothing else, I want their opinion on what is going on with these disks.

I have monthly Long smart tests scheduled and weekly short tests. I also ran long tests on all 3 of these disks manually when I started having these issues, so I don’t know why it would show the last long test was sooooo long ago… Doesn’t make sense!

Thanks for your input! I appreciate it.

1 Like

Well, WD said they believe there isn’t anything wrong with the drives. Their comment was… If you ran our software and it passed, the drives are good.

They suggested that I reformat the drives and start over (which is where I was going next).

:man_shrugging: