Strange behavior - drive FAULTED, then ONLINE, then Other drive FAILED, then ONLINE again

suhu · July 29, 2025, 9:11pm

I got a series of weird alert messages from truenas

MESSAGE 1, 03:50
New alerts:

Pool mainpool state is DEGRADED: One or more devices are faulted in response to persistent errors. Sufficient replicas exist for the pool to continue functioning in a degraded state.
The following devices are not healthy:
- Disk 14009742386182016449 is FAULTED

MESSAGE 2, 03:53
New alert:

Pool mainpool state is ONLINE: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected.

The following alert has been cleared:

Pool mainpool state is DEGRADED: One or more devices are faulted in response to persistent errors. Sufficient replicas exist for the pool to continue functioning in a degraded state.
The following devices are not healthy:
- Disk 14009742386182016449 is FAULTED

MESSAGE 3, 03:54
New alert:

Pool mainpool state is DEGRADED: One or more devices are faulted in response to persistent errors. Sufficient replicas exist for the pool to continue functioning in a degraded state.
The following devices are not healthy:
- Disk WDC_WD40EFPX-68C6CN0 WD-[redacted] is FAULTED

The following alert has been cleared:

Pool mainpool state is ONLINE: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected.

MESSAGE 4, 03:56
The following alert has been cleared:

Pool mainpool state is DEGRADED: One or more devices are faulted in response to persistent errors. Sufficient replicas exist for the pool to continue functioning in a degraded state.
The following devices are not healthy:
- Disk WDC_WD40EFPX-68C6CN0 WD-[redacted] is FAULTED

MESSAGE 5, 4:28
New alert:

Pool mainpool state is ONLINE: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state.

The following alert has been cleared:

Pool mainpool state is DEGRADED: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state.
The following devices are not healthy:
- Disk WDC_WD40EFPX-68C6CN0 WD-[redacted] is FAULTED
  (// This WD drive is fine now and works OK without any errors)

MESSAGE 6, 4:32
New alert:

Pool mainpool state is ONLINE: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state.
The following devices are not healthy:
- Disk ST4000VN006-3CW104 WW62ZGQ7 is FAULTED
  (// THIS IS A DIFFERENT DRIVE - Seagate - that actually failed. it had about 1000 errors and grinding noises)

The following alert has been cleared:

Pool mainpool state is ONLINE: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state.

MESSAGE 7, 5:10
New alert:

Pool mainpool state is ONLINE: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state.

The following alert has been cleared:

Pool mainpool state is ONLINE: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state.
The following devices are not healthy:
- Disk ST4000VN006-3CW104 WW62ZGQ7 is FAULTED
  (// this failed seagate drive somehow became cleared of all errors and became online, despite me hearing strange noises and scratches and grinding. I offline and disconnected this drive)

Current alerts:

Pool mainpool state is ONLINE: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state.

zpool status (before Seagate failure):

admin@truenas[~]$ sudo zpool status
  pool: boot-pool
 state: ONLINE
  scan: scrub repaired 0B in 00:01:49 with 0 errors on Tue Jul 29 03:46:51 2025
config:

	NAME        STATE     READ WRITE CKSUM
	boot-pool   ONLINE       0     0     0
	  sda3      ONLINE       0     0     0

errors: No known data errors

  pool: mainpool
 state: ONLINE
status: One or more devices is currently being resilvered.  The pool will
	continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Wed Jul 30 03:56:27 2025
	6.84T / 13.4T scanned at 10.2G/s, 0B / 7.14T issued
	0B resilvered, 0.00% done, no estimated completion time
config:

	NAME                                      STATE     READ WRITE CKSUM
	mainpool                                  ONLINE       0     0     0
	  raidz1-0                                ONLINE       0     0     0
	    bff12767-9075-47ae-a5ce-5735cc528250  ONLINE       0     0     0
	    6798ec85-c57a-49f9-9e25-d7de4e7ebf8d  ONLINE       0     0     0
	    f05c0713-0581-49c6-b99b-cd60b5941f43  ONLINE       0     0     0
	    a7672c77-b913-4921-a4dd-21d8932394ec  ONLINE       0     0     0
	    d3a8290d-1a24-4c53-849e-6e144bd782ba  ONLINE       0     0     0

zpool status now (with Seagate being offline and WD (which was marked as faulted several times) working fine)

admin@truenas[~]$ sudo zpool status
[sudo] password for admin: 
  pool: boot-pool
 state: ONLINE
  scan: scrub repaired 0B in 00:01:49 with 0 errors on Tue Jul 29 03:46:51 2025
config:

	NAME        STATE     READ WRITE CKSUM
	boot-pool   ONLINE       0     0     0
	  sda3      ONLINE       0     0     0

errors: No known data errors

  pool: mainpool
 state: DEGRADED
status: One or more devices has been taken offline by the administrator.
	Sufficient replicas exist for the pool to continue functioning in a
	degraded state.
action: Online the device using 'zpool online' or replace the device with
	'zpool replace'.
  scan: resilvered 229M in 00:00:41 with 0 errors on Wed Jul 30 05:11:11 2025
config:

	NAME                                      STATE     READ WRITE CKSUM
	mainpool                                  DEGRADED     0     0     0
	  raidz1-0                                DEGRADED     0     0     0
	    bff12767-9075-47ae-a5ce-5735cc528250  ONLINE       0     0     0
	    6798ec85-c57a-49f9-9e25-d7de4e7ebf8d  ONLINE       0     0     0
	    f05c0713-0581-49c6-b99b-cd60b5941f43  ONLINE       0     0     0
	    a7672c77-b913-4921-a4dd-21d8932394ec  ONLINE       0     0     0
	    d3a8290d-1a24-4c53-849e-6e144bd782ba  OFFLINE      0     0     0

errors: No known data errors

I have RAIDZ1, 5×4TB, couple Ironwolfs and others are WD Red Plus

Supermicro X9DRL-iF, drives connected directly to motherboard

UPDATE: (copied from my recent comment)
I have automatic SMART tests enabled: daily SHORT, weekly LONG. But both problematic drives (WD that was flagged as FAULTED and Seagate that actually FAILED) show a perfect status (all drives are less than a year old). No SMART errors at all

But the situation is very strange. As I mentioned in the first message, one of the drives (WD) was marked as FAULTED, then everything was fine, then FAULTED again, then fine again, thats in the span of 10 minutes. After that, I started hearing strange noises, and TrueNAS began reporting hundreds of errors on the disk. But this was a different disk - the Seagate one. There were over 1000 errors, and the disk status changed to FAILED. About 20 minutes later, TrueNAS cleared all the errors and the disk returned to OK status. I have no idea why that happened, especially since the disk was still making strange sounds and grinding noises. I manually offlined the Seagate drive and started making a fresh backup.

Why did the WD disk status flipped between FAULTED and OK and now this drive just works? Why did the Seagate disk go from FAILED back to HEALTHY with all errors cleared (from 1138 to 0) despite obvious problems? This is very strange behavior, so I hope someone will help me understand

P.S. Please ignore my boot ssd /dev/sda, I know it has pending sectors. This topic is about my main storage pool with HDDs

update 2: Photo of my drives

update 3: smartctl -a for both drives

NugentS · July 29, 2025, 9:15pm

sda has a fault. But you have no ZFS errors

Run a long smart test on sda and see what happens

suhu · July 29, 2025, 9:17pm

sda is my boot drive (sata ssd) and this topic is not about this error. I’m asking about alerts from my mainpool with hard drives

suhu · July 29, 2025, 9:20pm

It’s now started resilvering one of the drives in mainpool

admin@truenas[~]$ sudo zpool status
  pool: boot-pool
 state: ONLINE
  scan: scrub repaired 0B in 00:01:49 with 0 errors on Tue Jul 29 03:46:51 2025
config:

	NAME        STATE     READ WRITE CKSUM
	boot-pool   ONLINE       0     0     0
	  sda3      ONLINE       0     0     0

errors: No known data errors

  pool: mainpool
 state: ONLINE
status: One or more devices is currently being resilvered.  The pool will
	continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Wed Jul 30 03:56:27 2025
	11.8T / 13.4T scanned, 35.2G / 2.15T issued at 293M/s
	7.10G resilvered, 1.60% done, 02:06:18 to go
config:

	NAME                                      STATE     READ WRITE CKSUM
	mainpool                                  ONLINE       0     0     0
	  raidz1-0                                ONLINE       0     0     0
	    bff12767-9075-47ae-a5ce-5735cc528250  ONLINE       0     0     0  (resilvering)
	    6798ec85-c57a-49f9-9e25-d7de4e7ebf8d  ONLINE       0     0     0
	    f05c0713-0581-49c6-b99b-cd60b5941f43  ONLINE       0     0     0
	    a7672c77-b913-4921-a4dd-21d8932394ec  ONLINE       0     0     0
	    d3a8290d-1a24-4c53-849e-6e144bd782ba  ONLINE       0     0     0

errors: No known data errors

But I still don’t understand what happend, I didn’t do anything at all

suhu · July 29, 2025, 9:28pm

So, nevermind, my drive failed (i’m hearing scratching and weird noises)

SmallBarky · July 29, 2025, 9:31pm

You have posibly failing drives. If you lose more than one at a time on your data pool, it’s gone.

Check your drives as @NugentS said.

Back up all your data and get a current System Configuration download in case the boot device fails.

joeschmuck · July 30, 2025, 1:07pm

Are you running regular SMART Long tests? With how few drives you have, I’d recommend once a week for each drive. Short daily tests as well. Maybe you are doing this already.

suhu · July 30, 2025, 2:58pm

I have automatic SMART tests enabled: daily SHORT, weekly LONG. But both problematic drives (WD that was flagged as FAULTED and Seagate that actually FAILED) show a perfect status (all drives are less than a year old). No SMART errors at all

But the situation is very strange. As I mentioned in the first message, one of the drives (WD) was marked as FAULTED, then everything was fine, then FAULTED again, then fine again, thats in the span of 10 minutes. After that, I started hearing strange noises, and TrueNAS began reporting hundreds of errors on the disk. But this was a different disk - the Seagate one. There were over 1000 errors, and the disk status changed to FAILED. About 20 minutes later, TrueNAS cleared all the errors and the disk returned to OK status. I have no idea why that happened, especially since the disk was still making strange sounds and grinding noises. I manually offlined the Seagate drive and started making a fresh backup.

Why did the WD disk status flipped between FAULTED and OK and now this drive just works? Why did the Seagate disk go from FAILED back to HEALTHY with all errors cleared (from 1138 to 0) despite obvious problems? This is very strange behavior, so I hope someone will help me understand

P.S. Please ignore my boot drive /dev/sda, I know it has pending sectors. This topic is about my main storage pool with HDDs, not my ssd boot pool

suhu · July 30, 2025, 3:12pm

What I got after the initial post:

MESSAGE 5, 4:28
TrueNAS @ truenas

New alert:

Pool mainpool state is ONLINE: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state.

The following alert has been cleared:

Pool mainpool state is DEGRADED: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state.
The following devices are not healthy:
- Disk WDC_WD40EFPX-68C6CN0 WD-[redacted] is FAULTED

MESSAGE 6, 4:32

TrueNAS @ truenas

New alert:

Pool mainpool state is ONLINE: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state.
The following devices are not healthy:
- Disk ST4000VN006-3CW104 WW62ZGQ7 is FAULTED

The following alert has been cleared:

Pool mainpool state is ONLINE: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state.

↑ Those are two different drives

suhu · July 30, 2025, 3:17pm

I’ve made a major update to the initial post with a bunch of new information, check it again please. I still find this behavior very strange, so I hope someone could provide more information about what actually happened here

PaulDaisy · July 30, 2025, 3:35pm

Could it be a bad cable?

PK1048 · July 30, 2025, 3:45pm

This feels to me like a SATA port multiplier issue, but … see above.

I am dubious about simultaneous cable failures.

What is common between the 2 drives that reported issues? Are they in one cage and the other drives are in a different cage? Are they adjacent to each other? Mechanical vibrations may cause drive errors.

suhu · July 30, 2025, 4:07pm

Failed seagate was purchased October 12th 2024 and had audible noises and read/write errors before failure

WD that was previously marked faulted and now working fine has 4000 h on it (powered up since day of purchase, so about half a year)

suhu · July 30, 2025, 4:12pm

SMART data from about 20 hours ago

WD

admin@truenas[~]$ sudo smartctl -a /dev/sdb
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.6.44-production+truenas] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Red (CMR)
Device Model:     WDC WD40EFPX-68C6CN0
Serial Number:    WD-[redacted]
LU WWN Device Id: 5 0014ee 26ba6a6d3
Firmware Version: 81.00A81
User Capacity:    4,000,787,030,016 bytes [4.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database 7.3/5787
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 1.5 Gb/s)
Local Time is:    Wed Jul 30 04:48:43 2025 +07
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00)	Offline data collection activity
					was never started.
					Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0)	The previous self-test routine completed
					without error or no self-test has ever 
					been run.
Total time to complete Offline 
data collection: 		(40680) seconds.
Offline data collection
capabilities: 			 (0x7b) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					Offline surface scan supported.
					Self-test supported.
					Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   2) minutes.
Extended self-test routine
recommended polling time: 	 ( 424) minutes.
Conveyance self-test routine
recommended polling time: 	 (   5) minutes.
SCT capabilities: 	       (0x3039)	SCT Status supported.
					SCT Error Recovery Control supported.
					SCT Feature Control supported.
					SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   212   207   021    Pre-fail  Always       -       2358
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       32
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   095   095   000    Old_age   Always       -       4126
 10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       31
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       28
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       1724
194 Temperature_Celsius     0x0022   114   110   000    Old_age   Always       -       33
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%      4076         -
# 2  Extended offline    Completed without error       00%      4061         -
# 3  Short offline       Completed without error       00%      4028         -
# 4  Short offline       Completed without error       00%      4004         -
# 5  Short offline       Completed without error       00%      3980         -
# 6  Short offline       Completed without error       00%      3956         -
# 7  Short offline       Completed without error       00%      3932         -
# 8  Short offline       Completed without error       00%      3908         -
# 9  Extended offline    Completed without error       00%      3893         -
#10  Short offline       Completed without error       00%      3860         -
#11  Short offline       Completed without error       00%      3836         -
#12  Short offline       Completed without error       00%      3812         -
#13  Short offline       Completed without error       00%      3788         -
#14  Short offline       Completed without error       00%      3764         -
#15  Short offline       Completed without error       00%      3740         -
#16  Extended offline    Completed without error       00%      3729         -
#17  Short offline       Completed without error       00%      3693         -
#18  Short offline       Completed without error       00%      3669         -
#19  Short offline       Completed without error       00%      3645         -
#20  Short offline       Completed without error       00%      3623         -
#21  Short offline       Completed without error       00%      3599         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

Failed Seagate (brief moment where it was available)

admin@truenas[~]$ sudo smartctl -a /dev/sdf
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.6.44-production+truenas] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Seagate IronWolf
Device Model:     ST4000VN006-3CW104
Serial Number:    [redacted]
LU WWN Device Id: 5 000c50 0f6c2fa86
Firmware Version: SC60
User Capacity:    4,000,787,030,016 bytes [4.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database 7.3/5787
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 1.5 Gb/s)
Local Time is:    Wed Jul 30 05:15:00 2025 +07
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00)	Offline data collection activity
					was never started.
					Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0)	The previous self-test routine completed
					without error or no self-test has ever 
					been run.
Total time to complete Offline 
data collection: 		(    0) seconds.
Offline data collection
capabilities: 			 (0x73) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					No Offline surface scan supported.
					Self-test supported.
					Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   1) minutes.
Extended self-test routine
recommended polling time: 	 ( 476) minutes.
Conveyance self-test routine
recommended polling time: 	 (   2) minutes.
SCT capabilities: 	       (0x70bd)	SCT Status supported.
					SCT Error Recovery Control supported.
					SCT Feature Control supported.
					SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   083   064   006    Pre-fail  Always       -       204714872
  3 Spin_Up_Time            0x0003   098   095   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       64
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   084   060   045    Pre-fail  Always       -       233233554
  9 Power_On_Hours          0x0032   093   093   000    Old_age   Always       -       6963
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       64
183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always       -       0
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0 0 0
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   067   061   040    Old_age   Always       -       33 (Min/Max 33/33)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       77
193 Load_Cycle_Count        0x0032   094   094   000    Old_age   Always       -       12022
194 Temperature_Celsius     0x0022   033   040   000    Old_age   Always       -       33 (0 17 0 0 0)
195 Hardware_ECC_Recovered  0x001a   083   064   000    Old_age   Always       -       204714872
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       3613h+24m+19.236s
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       53959195008
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       112435003623

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Interrupted (host reset)      00%      6960         -
# 2  Short offline       Interrupted (host reset)      00%      6936         -
# 3  Short offline       Completed without error       00%      6912         -
# 4  Extended offline    Completed without error       00%      6898         -
# 5  Short offline       Completed without error       00%      6864         -
# 6  Short offline       Completed without error       00%      6840         -
# 7  Short offline       Completed without error       00%      6816         -
# 8  Short offline       Completed without error       00%      6792         -
# 9  Short offline       Completed without error       00%      6768         -
#10  Short offline       Completed without error       00%      6744         -
#11  Extended offline    Completed without error       00%      6730         -
#12  Short offline       Completed without error       00%      6696         -
#13  Short offline       Completed without error       00%      6672         -
#14  Short offline       Completed without error       00%      6648         -
#15  Short offline       Completed without error       00%      6624         -
#16  Short offline       Completed without error       00%      6600         -
#17  Short offline       Completed without error       00%      6576         -
#18  Extended offline    Completed without error       00%      6566         -
#19  Short offline       Completed without error       00%      6530         -
#20  Short offline       Completed without error       00%      6506         -
#21  Short offline       Completed without error       00%      6482         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

suhu · July 30, 2025, 4:21pm

Well, it may be. But it doesn’t explain why actually failed seagate drive became «ONLINE» and cleared of all errors after being marked as FAILED

I recorded a video with noises from failed seagate drive (but I think it’s failed based only on those noises, SMART says it’s fine for some reason…)

Protopia · July 30, 2025, 4:25pm

Multiple drive issues for no reason sounds like a non-drive hardware problem to me. Possible causes are:

Bad memory or memory card needs reseating - reseat the memory cards and run a memory test
Overheating or faulty SATA controller - if you have an HBA check the temps and add active cooling, run motherboard diagnostics.

But a noisy drive which is also disappearing sounds like a failed disk (even though it is less than 1 year old).

Get another 4TB NAS drive and swap out the Seagate which is making noises for later testing in a different system and / or sending it back under warranty.

joeschmuck · July 30, 2025, 5:03pm

That is fairly noisy and if it is making that kind of sound, I’d replace it. It should be under warranty.

With respect to the Seagate drive you posted the SMART data, you have a lot of head parking/loads for the amount of hours on the drive. You also have a few ECC corrections.

Your WD drive that you posted SMART for, a similar observation is quite a few load cycles for the amount of hours, not nearly the same as the Seagate drive, but it is up there.

Are you sleeping the drives? What are your power settings for the drives?

With all this said, I also think it is not just a drive failure. Sure the noisy drive is suspect of failing soon, but did it contribute to this problem? I don’t think so.

My suspicions are Power Supply, then MB (CPU/RAM/MB). You also never mentioned which version of TrueNAS you are running nor your system hardware other than the MB and a few drives.

Check the fans in your system, ensure they are all spinning properly. And with all that dust in the photo, hit it with some compressed air! Dust causes electrical problems and of course cooling problems.

My recommendations (process of elimination):

Backup any data you must retain.
Run Memtest86+ on the system and get 5 (five) Complete passes.
Run Prime95 (or similar) for 2 hours.
The first two steps will help identify if the system is stable. Just because it was stable when you built it, does not mean it still is.
Ensure you are tracking the drives by serial number or it will likely bite you.
Run a Scrub, post the results.
Fix the sda drive issue.

Please understand that none of us know you nor what you do or do not know and/or understand. I may ask some things that sounds stupid or obvious to you, however when dealing with things like this, assumptions cause problems. I will not assume you know anything, it is safer this way. You should do the same thing, do not assume we know what you are talking about. Be descriptive. It is okay to add a few extra words to explain something.

Also, were the drives “new” or “refurbished”? Just curious.

As for mounting the SSD, all you need to do is drill two small holes into the bottom of the case, diagonal and mount the drive to the bottom of the case. Or use a side wall. If you already have a few holes to take advantage of, use them. Always remove the electronics from the case if you are going to drill, then ensure you clean it out very well.

Cheers

suhu · July 31, 2025, 11:36am

No. HDD Standby = Always on, Advanced Power Management = Disabled (on all drives)

My full specs:
PSU Corsair TX650M, bought brand new, about 1,5 years of usage total
(only 1 8-pin EPS connector is connected)
MB Supermicro X9DRL-iF, old ~2012 motherboard, bought it used
CPU Intel Xeon E5-2667v2 (1 cpu), used
MEMORY 2×32GB Samsung m386b4g70dm0-cma3 (ECC LRDIMM), used
(Ran full memory test for 50 hours or so about a year ago) (No errors during testing or using)

I’m running TrueNAS SCALE 24.10.2.2

RAIDZ1, All drivies bought brand new from reputable stores
3×4 TB WD Red Plus
2×4 TB Seagate Ironwolf

(I took this screenshot when Seagate started making noises)

Now I’ve just reconnected my “Failed” Seagate drive and there is no weird noises or errors. Screenshot I took just now:

All 5 are spinning. 2 intake, 1 exhaust behind cpu exhaust and 2 more exhaust on top

Those are my temperatures:

I have specific sata SSD mounts in my case, I just don’t have spare power cable for it. I’ve already ordered one so I’ll mount it properly next week

I’m still backing up my data, so I haven’t run a scrub or any other tests yet. I’ll post an update when I get results

Thanks for the detailed reply!

joeschmuck · July 31, 2025, 1:25pm

@suhu
Good to get that data backed up, sooner the better.

I read that earlier but thought it may have been a typo since the drive is plugged in. Sounds good, you are all over that.

The MB temps look okay, not great in my opinion. I’d like to see the RAM and PCH temps a little lower but maybe that is normal for that board.

As I stated earlier (I think I did), even if the system tested good before, you should retest the RAM and CPU stress test. Components fail, they rarely just keep on working like Voyager 1 and Voyager 2, but they do have the benefit of extreme cooling due to outer space. And your system may be stable, but you should verify it when you can. It is one of those easy things to eliminate as the cause. With the problems you have, you need to eliminate as much as possible. Even if you replace say the power supply, rerun the stability checks.

Do not trust that drive to last very long, I’m sure you do not either. Did you have to run zpool clear on that pool to clear the CKSUM errors or did they clear on their own? If they cleared on their own, that is bizarre.

Then that makes me wonder about why the load cycles are so high. It was not due to power issues or you would have high power on retract counts, which you do not. Very odd. Were some of those drives used in a different system before? I’m just trying to figure out this puzzle.

suhu · July 31, 2025, 3:32pm

That’s the thing - they got cleared on their own! I think it is very weird. Maybe there is some logs I can see in order to discover why this happend?

No, I bought them specifically for this NAS and they run 24/7 (All power cycles in SMART data is me restarting server, no outages)

I don’t have a UPS, but I have very stable electricity here (not a single outage in 1.5 years). And my PSU seems pretty good to me

Absolutely, I will get to it once my backup is done. Thanks!

P.S. About backups. Backing up 10 TB isn’t cheap or fast. For now, I’ve rented a dedicated server for €42/month (with hourly billing, so I can cancel anytime). It’s an older machine, but it has 4×10 TB drives and was available instantly.
I’m still trying to figure out a reliable long-term backup solution. So far, only USB HDDs come to mind, but they’re slow and inconvenient. If you have any suggestions, I’d be glad to hear them. Building a second machine is too expensive for me right now Maybe there are some more affordable options?