I got a series of weird alert messages from truenas
MESSAGE 1, 03:50
New alerts:
-
Pool mainpool state is DEGRADED: One or more devices are faulted in response to persistent errors. Sufficient replicas exist for the pool to continue functioning in a degraded state.
The following devices are not healthy:- Disk 14009742386182016449 is FAULTED
MESSAGE 2, 03:53
New alert:
- Pool mainpool state is ONLINE: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected.
The following alert has been cleared:
-
Pool mainpool state is DEGRADED: One or more devices are faulted in response to persistent errors. Sufficient replicas exist for the pool to continue functioning in a degraded state.
The following devices are not healthy:- Disk 14009742386182016449 is FAULTED
MESSAGE 3, 03:54
New alert:
-
Pool mainpool state is DEGRADED: One or more devices are faulted in response to persistent errors. Sufficient replicas exist for the pool to continue functioning in a degraded state.
The following devices are not healthy:- Disk WDC_WD40EFPX-68C6CN0 WD-[redacted] is FAULTED
The following alert has been cleared:
- Pool mainpool state is ONLINE: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected.
MESSAGE 4, 03:56
The following alert has been cleared:
-
Pool mainpool state is DEGRADED: One or more devices are faulted in response to persistent errors. Sufficient replicas exist for the pool to continue functioning in a degraded state.
The following devices are not healthy:- Disk WDC_WD40EFPX-68C6CN0 WD-[redacted] is FAULTED
MESSAGE 5, 4:28
New alert:
- Pool mainpool state is ONLINE: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state.
The following alert has been cleared:
-
Pool mainpool state is DEGRADED: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state.
The following devices are not healthy:- Disk WDC_WD40EFPX-68C6CN0 WD-[redacted] is FAULTED
(// This WD drive is fine now and works OK without any errors)
- Disk WDC_WD40EFPX-68C6CN0 WD-[redacted] is FAULTED
MESSAGE 6, 4:32
New alert:
-
Pool mainpool state is ONLINE: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state.
The following devices are not healthy:- Disk ST4000VN006-3CW104 WW62ZGQ7 is FAULTED
(// THIS IS A DIFFERENT DRIVE - Seagate - that actually failed. it had about 1000 errors and grinding noises)
- Disk ST4000VN006-3CW104 WW62ZGQ7 is FAULTED
The following alert has been cleared:
- Pool mainpool state is ONLINE: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state.
MESSAGE 7, 5:10
New alert:
- Pool mainpool state is ONLINE: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state.
The following alert has been cleared:
-
Pool mainpool state is ONLINE: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state.
The following devices are not healthy:- Disk ST4000VN006-3CW104 WW62ZGQ7 is FAULTED
(// this failed seagate drive somehow became cleared of all errors and became online, despite me hearing strange noises and scratches and grinding. I offline and disconnected this drive)
- Disk ST4000VN006-3CW104 WW62ZGQ7 is FAULTED
Current alerts:
- Pool mainpool state is ONLINE: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state.
zpool status (before Seagate failure):
admin@truenas[~]$ sudo zpool status
pool: boot-pool
state: ONLINE
scan: scrub repaired 0B in 00:01:49 with 0 errors on Tue Jul 29 03:46:51 2025
config:
NAME STATE READ WRITE CKSUM
boot-pool ONLINE 0 0 0
sda3 ONLINE 0 0 0
errors: No known data errors
pool: mainpool
state: ONLINE
status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scan: resilver in progress since Wed Jul 30 03:56:27 2025
6.84T / 13.4T scanned at 10.2G/s, 0B / 7.14T issued
0B resilvered, 0.00% done, no estimated completion time
config:
NAME STATE READ WRITE CKSUM
mainpool ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
bff12767-9075-47ae-a5ce-5735cc528250 ONLINE 0 0 0
6798ec85-c57a-49f9-9e25-d7de4e7ebf8d ONLINE 0 0 0
f05c0713-0581-49c6-b99b-cd60b5941f43 ONLINE 0 0 0
a7672c77-b913-4921-a4dd-21d8932394ec ONLINE 0 0 0
d3a8290d-1a24-4c53-849e-6e144bd782ba ONLINE 0 0 0
zpool status now (with Seagate being offline and WD (which was marked as faulted several times) working fine)
admin@truenas[~]$ sudo zpool status
[sudo] password for admin:
pool: boot-pool
state: ONLINE
scan: scrub repaired 0B in 00:01:49 with 0 errors on Tue Jul 29 03:46:51 2025
config:
NAME STATE READ WRITE CKSUM
boot-pool ONLINE 0 0 0
sda3 ONLINE 0 0 0
errors: No known data errors
pool: mainpool
state: DEGRADED
status: One or more devices has been taken offline by the administrator.
Sufficient replicas exist for the pool to continue functioning in a
degraded state.
action: Online the device using 'zpool online' or replace the device with
'zpool replace'.
scan: resilvered 229M in 00:00:41 with 0 errors on Wed Jul 30 05:11:11 2025
config:
NAME STATE READ WRITE CKSUM
mainpool DEGRADED 0 0 0
raidz1-0 DEGRADED 0 0 0
bff12767-9075-47ae-a5ce-5735cc528250 ONLINE 0 0 0
6798ec85-c57a-49f9-9e25-d7de4e7ebf8d ONLINE 0 0 0
f05c0713-0581-49c6-b99b-cd60b5941f43 ONLINE 0 0 0
a7672c77-b913-4921-a4dd-21d8932394ec ONLINE 0 0 0
d3a8290d-1a24-4c53-849e-6e144bd782ba OFFLINE 0 0 0
errors: No known data errors
I have RAIDZ1, 5×4TB, couple Ironwolfs and others are WD Red Plus
Supermicro X9DRL-iF, drives connected directly to motherboard
UPDATE: (copied from my recent comment)
I have automatic SMART tests enabled: daily SHORT, weekly LONG. But both problematic drives (WD that was flagged as FAULTED and Seagate that actually FAILED) show a perfect status (all drives are less than a year old). No SMART errors at all
But the situation is very strange. As I mentioned in the first message, one of the drives (WD) was marked as FAULTED, then everything was fine, then FAULTED again, then fine again, thats in the span of 10 minutes. After that, I started hearing strange noises, and TrueNAS began reporting hundreds of errors on the disk. But this was a different disk - the Seagate one. There were over 1000 errors, and the disk status changed to FAILED. About 20 minutes later, TrueNAS cleared all the errors and the disk returned to OK status. I have no idea why that happened, especially since the disk was still making strange sounds and grinding noises. I manually offlined the Seagate drive and started making a fresh backup.
Why did the WD disk status flipped between FAULTED and OK and now this drive just works? Why did the Seagate disk go from FAILED back to HEALTHY with all errors cleared (from 1138 to 0) despite obvious problems? This is very strange behavior, so I hope someone will help me understand
P.S. Please ignore my boot ssd /dev/sda, I know it has pending sectors. This topic is about my main storage pool with HDDs
update 2: Photo of my drives
update 3: smartctl -a for both drives