Two ZFS drives failed at once? Hmm

I’m using the old TrueNAS Core (however honestly this seems like more of a zfs issue). 13.0-U6.2

I have a zfs storage pool configured as a RAIDZ2 and last week, suddenly two of the disks were labeled as “bad”.

Here is the info on my drives:

partition  label                                       zpool      device  disk                     size  type  serial                 rpm  sas-location
---------------------------------------------------------------------------------------------------------------------------------------------------------
ada0p2     gptid/2dc5870c-3eb5-11ee-bf88-0cc47a84a594  boot-pool  ada0    SuperMicro SSD             63  SSD   SMC0515D92221CN36281     0
ada1p2     gptid/2dd0931a-3eb5-11ee-bf88-0cc47a84a594  boot-pool  ada1    SuperMicro SSD             63  SSD   SMC0515D92221CN23213     0
ada2p1     gptid/752998f6-094f-11ec-9a66-0cc47a84a594  tank       ada2    Crucial CT525MX300SSD1    525  SSD   164414822ADC             0
ada3p2     gptid/df3e5f72-5c3a-11ef-acb2-0cc47a84a594  tank       ada3    WDC WD6003FFBX-68MU3N0   6001  HDD   V701D2SH              7200
da0p2      gptid/565aaa4b-d2c4-11ef-810f-0cc47a84a594  tank       da0     ATA ST10000NE0008-2P    10000  HDD   ZS50R086              7200  SAS3008(0):1#0
da1p2      gptid/2eff3431-d2f0-11e6-8e60-0cc47a84a594  tank       da1     ATA WDC WD60EFRX-68L     6001  HDD   WDWX11D86KCCFY        5700  SAS3008(0):1#1
da2p2      gptid/7dd7ae93-5817-11ef-a6d5-0cc47a84a594  tank       da2     ATA ST10000NE0008-2P    10000  HDD   ZS50QL2Y              7200  SAS3008(0):1#2
da3p2      gptid/d8bc1263-ddbc-11ef-81fa-0cc47a84a594  tank       da3     ATA ST10000NE0008-2P    10000  HDD   ZS50R98N              7200  SAS3008(0):1#3
da4p2      gptid/310fd248-d2f0-11e6-8e60-0cc47a84a594  tank       da4     ATA WDC WD60EFRX-68L     6001  HDD   WDWX11D86KCXE8        5700  SAS3008(0):1#4
da5p2      gptid/94e2a444-5608-11ed-baa2-0cc47a84a594  tank       da5     ATA WDC WD6003FFBX-6     6001  HDD   V7GDJHZH              7200  SAS3008(0):1#5
da6p2      gptid/2c91672c-2547-11ef-9f56-0cc47a84a594  tank       da6     ATA WDC WD6003FFBX-6     6001  HDD   V701UA4H              7200  SAS3008(0):1#6
da7p2      gptid/a60274b0-2e65-11ef-8343-0cc47a84a594  tank       da7     ATA ST10000NE0008-2P    10000  HDD   ZS50QKLA              7200  SAS3008(0):1#7

Here is the zpool status:
#zpool status -v tank

  pool: tank
 state: ONLINE
status: One or more devices is currently being resilvered.  The pool will
	continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Tue Mar 25 13:12:25 2025
	7.96T scanned at 4.72G/s, 7.96T issued at 4.72G/s, 19.4T total
	0B resilvered, 40.99% done, no estimated completion time
config:

	NAME                                              STATE     READ WRITE CKSUM
	tank                                              ONLINE       0     0     0
	  raidz2-0                                        ONLINE       0     0     0
	    gptid/565aaa4b-d2c4-11ef-810f-0cc47a84a594    ONLINE       0     0     0
	    spare-1                                       ONLINE       0     0     1
	      gptid/2eff3431-d2f0-11e6-8e60-0cc47a84a594  ONLINE       0     0     0
	      gptid/df3e5f72-5c3a-11ef-acb2-0cc47a84a594  ONLINE       0     0     0
	    gptid/7dd7ae93-5817-11ef-a6d5-0cc47a84a594    ONLINE       0     0     0
	    gptid/d8bc1263-ddbc-11ef-81fa-0cc47a84a594    ONLINE       0     0     0
	    gptid/310fd248-d2f0-11e6-8e60-0cc47a84a594    ONLINE       0     0     0
	    gptid/94e2a444-5608-11ed-baa2-0cc47a84a594    ONLINE       0     0     0
	    gptid/2c91672c-2547-11ef-9f56-0cc47a84a594    ONLINE       0     0     0
	    gptid/a60274b0-2e65-11ef-8343-0cc47a84a594    ONLINE       0     0    12
	cache
	  gptid/752998f6-094f-11ec-9a66-0cc47a84a594      ONLINE       0     0     0
	spares
	  gptid/df3e5f72-5c3a-11ef-acb2-0cc47a84a594      INUSE     currently in use

errors: No known data errors

The drive with the 12 chksum errors is /dev/da7.
If a do a smartctl on /dev/da7, I’m not seeing any errors and in addition although failure is definitely possible the drive is only about 1 year old:

# smartctl -a /dev/da7
smartctl 7.2 2021-09-14 r5236 [FreeBSD 13.1-RELEASE-p9 amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Seagate IronWolf Pro
Device Model:     ST10000NE0008-2PL103
Serial Number:    ZS50QKLA
LU WWN Device Id: 5 000c50 0db628590
Firmware Version: EN02
User Capacity:    10,000,831,348,736 bytes [10.0 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-4 (minor revision not indicated)
SATA Version is:  SATA 3.3, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Tue Mar 25 13:42:53 2025 CDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82)	Offline data collection activity
					was completed without error.
					Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)	The previous self-test routine completed
					without error or no self-test has ever
					been run.
Total time to complete Offline
data collection: 		(  567) seconds.
Offline data collection
capabilities: 			 (0x7b) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					Offline surface scan supported.
					Self-test supported.
					Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine
recommended polling time: 	 (   1) minutes.
Extended self-test routine
recommended polling time: 	 ( 930) minutes.
Conveyance self-test routine
recommended polling time: 	 (   2) minutes.
SCT capabilities: 	       (0x50bd)	SCT Status supported.
					SCT Error Recovery Control supported.
					SCT Feature Control supported.
					SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   082   064   044    Pre-fail  Always       -       162539760
  3 Spin_Up_Time            0x0003   092   090   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       16
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   089   060   045    Pre-fail  Always       -       807279464
  9 Power_On_Hours          0x0032   093   093   000    Old_age   Always       -       6552
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       16
 18 Head_Health             0x000b   100   100   050    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   072   050   040    Old_age   Always       -       28 (Min/Max 20/28)
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       11
193 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       290
194 Temperature_Celsius     0x0022   028   040   000    Old_age   Always       -       28 (0 20 0 0 0)
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
200 Pressure_Limit          0x0023   100   100   001    Pre-fail  Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       6541h+57m+21.567s
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       68685369919
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       62830272714

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%      6552         -
# 2  Short offline       Completed without error       00%      6461         -
# 3  Short offline       Completed without error       00%      6294         -
# 4  Short offline       Completed without error       00%      6126         -
# 5  Short offline       Completed without error       00%      5958         -
# 6  Short offline       Completed without error       00%      5790         -
# 7  Short offline       Completed without error       00%      5622         -
# 8  Short offline       Completed without error       00%      5454         -
# 9  Short offline       Completed without error       00%      5286         -
#10  Short offline       Completed without error       00%      5118         -
#11  Short offline       Completed without error       00%      4952         -
#12  Short offline       Completed without error       00%      4784         -
#13  Short offline       Completed without error       00%      4616         -
#14  Short offline       Completed without error       00%      4448         -
#15  Short offline       Completed without error       00%      4280         -
#16  Short offline       Completed without error       00%      4112         -
#17  Short offline       Completed without error       00%      3944         -
#18  Short offline       Completed without error       00%      3776         -
#19  Short offline       Completed without error       00%      3608         -
#20  Short offline       Completed without error       00%      3440         -
#21  Short offline       Completed without error       00%      3271         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

I’m also aware the device as I post it – the /dev/da1 is being replaced by the hot spare and a resilver is in process, however the smartctl of /dev/da1 is the following:

# smartctl -a /dev/da1
smartctl 7.2 2021-09-14 r5236 [FreeBSD 13.1-RELEASE-p9 amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Red
Device Model:     WDC WD60EFRX-68L0BN1
Serial Number:    WD-WX11D86KCCFY
LU WWN Device Id: 5 0014ee 20de9f367
Firmware Version: 82.00A82
User Capacity:    6,001,175,126,016 bytes [6.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5700 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2, ACS-3 T13/2161-D revision 3b
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Tue Mar 25 13:47:06 2025 CDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00)	Offline data collection activity
					was never started.
					Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0)	The previous self-test routine completed
					without error or no self-test has ever
					been run.
Total time to complete Offline
data collection: 		( 6524) seconds.
Offline data collection
capabilities: 			 (0x7b) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					Offline surface scan supported.
					Self-test supported.
					Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine
recommended polling time: 	 (   2) minutes.
Extended self-test routine
recommended polling time: 	 ( 719) minutes.
Conveyance self-test routine
recommended polling time: 	 (   5) minutes.
SCT capabilities: 	       (0x303d)	SCT Status supported.
					SCT Error Recovery Control supported.
					SCT Feature Control supported.
					SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   199   051    Pre-fail  Always       -       11
  3 Spin_Up_Time            0x0027   197   197   021    Pre-fail  Always       -       9150
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       124
  5 Reallocated_Sector_Ct   0x0033   186   186   140    Pre-fail  Always       -       428
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   011   011   000    Old_age   Always       -       65016
 10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       124
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       122
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       2907
194 Temperature_Celsius     0x0022   124   104   000    Old_age   Always       -       28
196 Reallocated_Event_Count 0x0032   187   187   000    Old_age   Always       -       13
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%     64935         -
# 2  Short offline       Completed without error       00%     64591         -
# 3  Short offline       Completed without error       00%     64423         -
# 4  Short offline       Completed without error       00%     64255         -
# 5  Short offline       Completed without error       00%     64088         -
# 6  Short offline       Completed without error       00%     63919         -
# 7  Short offline       Completed without error       00%     63752         -
# 8  Short offline       Completed without error       00%     63584         -
# 9  Short offline       Completed without error       00%     63417         -
#10  Short offline       Completed without error       00%     63250         -
#11  Short offline       Completed without error       00%     63082         -
#12  Short offline       Completed without error       00%     62914         -
#13  Short offline       Completed without error       00%     62746         -
#14  Short offline       Completed without error       00%     62578         -
#15  Short offline       Completed without error       00%     62411         -
#16  Short offline       Completed without error       00%     62243         -
#17  Short offline       Completed without error       00%     62075         -
#18  Short offline       Completed without error       00%     61907         -
#19  Short offline       Completed without error       00%     61738         -
#20  Short offline       Completed without error       00%     61571         -
#21  Short offline       Completed without error       00%     61403         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

My motherboard is a supermicro X11SSL-CF with the 8 SAS3 (12Gbps) via Broadcom® 3008 SW controller built into the board. I have 2 sas cables – each are sas to 4 SATA adapters. Drives /dev/da0->/dev/da3 or on one cable and drives /dev/da4->/dev/da7 are on the other cable. With drives /dev/da1 and /dev/da7 listed as “failed”, this would be a drive on each of the separate cables that has failed, which just seems strange.

I’ve looked in forums (both the new and old truenas forums), and I’ve seen, check cables, check HBA, check RAM, etc.

Is there anything specifically recommended in this case I should trouble shoot? The sas cables seem to be well fitting and seated to the drives. smartctl doesn’t show anything.

Correct firmware?

1 Like

So this suggests a dead/dying drive.

It’s not uncommon at all for a disk to die and then while being replaced another to fail due to the stress of the resilver. This is the main reason RAID-Z2 is recommended over Z1.

Defo worth checking the firmware of the controller as mentioned above.

1 Like

Hey thanks @pmh and @Johnny_Fartpants for reply. Looking at my board (damn it’s old – probably time to think about replacement) – I have



Firmware Revision : 01.15	IP address : 010.000.001.198
Firmware Build Time : 02/19/2016	BMC MAC address : 0c:c4:7a:8b:f2:8d
BIOS Version : 1.0a	System LAN1 MAC address : 0c:c4:7a:84:a5:94
BIOS Build Time : 01/29/2016	System LAN2 MAC address : 0c:c4:7a:84:a5:95
CPLD Version : 02.b1.01	
Redfish Version : 1.0.1

Looking at SMC website for the board: (tried posting link to firmware webpage for x11ssl-cf but wasn’t allowed to)
Definitely looks like there has been an updated revision to a newer version. Would this be considered a firmware related issue?

I can understand with WD-Red drive dying – it’s honestly older, but my new IronWolf Drive — arg – it’s barely a year old.

What does sas3flash -list give you?

# sas3flash -list
Avago Technologies SAS3 Flash Utility
Version 16.00.00.00 (2017.05.02)
Copyright 2008-2017 Avago Technologies. All rights reserved.

	Adapter Selected is a Avago SAS: SAS3008(C0)

	Controller Number              : 0
	Controller                     : SAS3008(C0)
	PCI Address                    : 00:02:00:00
	SAS Address                    : 5003048-0-1884-4402
	NVDATA Version (Default)       : 0b.02.30.26
	NVDATA Version (Persistent)    : 0b.02.30.26
	Firmware Product ID            : 0x2221 (IT)
	Firmware Version               : 12.00.02.00
	NVDATA Vendor                  : LSI
	NVDATA Product ID              : LSI3008-IT
	BIOS Version                   : 08.29.01.00
	UEFI BSD Version               : 14.00.00.00
	FCODE Version                  : N/A
	Board Name                     : LSI3008-IT
	Board Assembly                 : N/A
	Board Tracer Number            : N/A

	Finished Processing Commands Successfully.
	Exiting SAS3Flash.

(LSI 9300-xx Firmware Update | TrueNAS Community)

2 Likes

That’s ancient.
16.00.12.00 is what you would want to see there.

The second drive (WD-WX11D86KCCFY) is showing errors in the SMART report.
I recommend you kick off a long SMART test on both and report back in a day, especially since you haven’t done one for quite a while, if ever.

1 Like

Wow - thanks for the firmware link. I guess I’m a bit behind :rofl:. I think I’ll let the resilver process first of the hot spare, then flash the LSI controller and then run a long test on the IronWolf drive. The WD Red drive is very old and it’s probably time for replacement on that drive anyway.

Looks like I’ll report back in a few days as these resilverings take a very long time to complete.

I’d update the firmware first and then let resilver proceed. Do not introduce further errors in there.

The WD Red with reallocated sectors is due for the bin if out of warranty.

Not trying to play dumb, but is flashing the controller actually OK during the midst of an active resilvering process?

Yeah let the active resilver finish first.

Flashing is best done from the UEFI shell, so while the pool is NOT mounted.

Hey was reading up on the flashing process – a lot of information. Honestly I just was going to unplug the drives and boot to freenas and then run sas3flash -o -f SAS9300_xx_IT.bin from within the freenas shell. Looking at the post that was linked about it mentions 9300 cards. I have a 3008. Is this even compatible?

Considering the age of the drive only being 1 year old, maybe check the FARM data from it to verify its not a second hand drive from one of those chia farms sold as brand new.

“3008” is the controller chip.
“9300” is the (family of) add-in card(s) which use the 3008 chip. So all is in order.

Really, flash from the UEFI shell rather the OS shell: It is safer and easier.

So Supermicro have a slightly newer version of the firmware hosted on their website www.supermicro.com - /wdl/driver/SAS/Broadcom/3008/Firmware/

3008_FW_PH16.00.14.00

As @etorix has already said for all intense and purposes they are one in the same as they run the 3008 chip however when you download the link in the TrueNAS forums it does want you to select your specific HBA of which you don’t have one as its onboard. The closest option you could choose would be the 9300-8i based on your motherboard as it has two internal SAS3 ports it would appear.

If you would prefer going with the latest firmware from Supermicro as its an exact match for your controller then I think that would be fine also as I havent heard any issues around that.

1 Like

Wow! Any indication on what this .14 version fixes over .10 or .12?

Not that I am aware of.

The latest version provided by Supermicro used to be .10.

Then they developed .12 in collaboration with iX who diagnosed a rare race condition (?) with FreeBSD and the .10 firmware. AFAIK .12 was never released by Supermicro.

Going to check the .14 release notes. If it contains the .12 fixes, great. Good to have a single source official distribution again.

2 Likes