Hi!
I have a system built in 2023 with a simple ZFS mirror pool of 2x14TB drives from Seagate. The pool is named rust. The two drives have serial numbers:
ZHZ3Q546. SMART log: smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.12.15-production+truenas] (local b - Pastebin.comWAINR7DV. SMART log: smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.12.15-production+truenas] (local b - Pastebin.com
I do not think it is relevant, but the system also has 4 SSDs: one for boot, 2 in mirror for important data, and 1 for L2ARC of the rust pool.
Yesterday, I got a notification from TrueNAS that one of the drives failed:
New alerts:
* Pool rust state is DEGRADED: One or more devices are faulted in response to persistent errors. Sufficient replicas exist for the pool to continue functioning in a degraded state.
The following devices are not healthy:
* Disk ST14000NE0008-2JK101 ZHZ3Q546 is FAULTED
After one minute:
New alert:
* Pool rust state is SUSPENDED: One or more devices are faulted in response to IO failures.
The following devices are not healthy:
* Disk ST14000NE0008-2JK101 ZHZ3Q546 is REMOVED
The following alert has been cleared:
* Pool rust state is DEGRADED: One or more devices are faulted in response to persistent errors. Sufficient replicas exist for the pool to continue functioning in a degraded state.
The following devices are not healthy:
* Disk ST14000NE0008-2JK101 ZHZ3Q546 is FAULTED
At this point, just to be on the safe side in case it was a fluke, I got home and rebooted.
The system notified:
* Pool rust state is ONLINE: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state.
These alerts have been cleared:
* Pool rust state is SUSPENDED: One or more devices are faulted in response to IO failures.
The following devices are not healthy:
* Disk ST14000NE0008-2JK101 ZHZ3Q546 is REMOVED
* Snapshot Task For Dataset "rust/data" failed: cannot open 'rust/data': pool I/O is currently suspended usage: snapshot [-r] [-o property=value] ... @ ... For the property list, run: zfs set|get For the delegated permission list, run: zfs allow|unallow For further help on a command or topic, run: zfs help []..
I think at this point I rebooted again, and now I got this:
New alert:
* Pool rust state is DEGRADED: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state.
The following devices are not healthy:
* Disk ST14000NE0008-2RX103 WAINR7DV is DEGRADED
The following alert has been cleared:
* Pool rust state is ONLINE: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state.
NOTICE THAT THE DEGRADED DISK WAS NOW THE OTHER ONE, NOT ZHZ3Q546!!! Very worrying… what might be happening? Did both drives die at the same time? They were sourced from 2 different batches.
The system started resilvering. It was SLOW!!! Even logging in via SSH or web shell proved extremely difficult. Once I got in, the load average was very high, even thought the CPU was fine. The system seemed in I/O starvation. Resilvering seemed very slow, with values indicated by zfs pool status -v as low as 600KB/s, and estimated termnation in 2 weeks:
pool: rust
state: DEGRADED
status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scan: resilver in progress since Sun Sep 14 17:40:32 2025
5.42T / 6.20T scanned, 6.57G / 869G issued at 644K/s
6.55G resilvered, 0.76% done, no estimated completion time
config:
NAME STATE READ WRITE CKSUM
rust DEGRADED 0 0 0
mirror-0 DEGRADED 0 0 0
74479132-8181-482e-82d0-1e823c43acb6 ONLINE 0 0 0 (resilvering) <------ this is ZHZ3Q546
57f25ff4-df8e-4958-b09e-8482aa339520 DEGRADED 0 475 0 too many errors <------ this is WAINR7DV
cache
sdb1 ONLINE 0 0 0
errors: No known data errors
After that I powered off the system and removed the WAINR7DV drive, imported it in another system, where everything looked fine.
However, the TrueNAS system… is now in a reboot loop with no additional info I can find: https://youtube.com/shorts/szzX1YWKbTg?feature=share
I also ran a scrub on WAINR7DV to check everything, and it did not find any issue, nor did it generate any additional SMART errors.
What is the suggested course of action now? I would like to keep the system functional, even with a single drive in degraded state, while I source another drive. The data on this pool is not critical, but the rest of the system is, so I cannot afford to keep it offline.
Also, any idea about what happened and what mistakes I did in the recovery process? What should I have done?
Thanks in advance!
EDIT:
I trusted the first issue, instead of what the SMART values seemed to indicate (1 1 1) in command_timeout, and instead placed WAINR7DV back in the system. The system boots, in degraded state. I started a long SMART test. Will update.