Hi all!
The problem is as follows. On the server 2 disks suddenly switched to the status DEGRADED. Auto-replacement for 2 spare disks started.
The scanning process has started.
Since the disk array is large (22 disks of 20 TB), the recovery time was 10 days. 8 days passed and a large number of errors appeared. Moreover, the storage was monitored and the data (I did not check everything) was read. After a short period of time, the system froze completely, I had to restart it and the process started over again, the data remained. A couple of days passed and errors appeared again and almost all disks are displayed as degraded.
Is there any point in me waiting for the end? What if everything freezes again…
root@shd[~]# zpool status
pool: SHD
state: DEGRADED
status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scan: resilver in progress since Tue Apr 15 08:18:59 2025
109T scanned at 1.71M/s, 105T issued at 1.65M/s, 301T total
11.2T resilvered, 34.80% done, no estimated completion time
config:
NAME STATE READ WRITE CKSUM
SHD DEGRADED 0 0 0
raidz3-0 DEGRADED 0 0 0
gptid/e6908c2d-0040-11ef-9262-3cecef93ed84 DEGRADED 0 0 1.57K too many errors
gptid/e6326f77-0040-11ef-9262-3cecef93ed84 DEGRADED 0 0 1.54K too many errors
gptid/e6a0e61f-0040-11ef-9262-3cecef93ed84 DEGRADED 0 0 1.53K too many errors
gptid/5edc97a7-036e-11ef-9262-3cecef93ed84 DEGRADED 0 0 1.49K too many errors
gptid/e63f5946-0040-11ef-9262-3cecef93ed84 DEGRADED 0 0 1.50K too many errors
gptid/e648a880-0040-11ef-9262-3cecef93ed84 DEGRADED 0 0 1.49K too many errors
gptid/e644b932-0040-11ef-9262-3cecef93ed84 DEGRADED 0 0 1.54K too many errors
gptid/e64d6b0b-0040-11ef-9262-3cecef93ed84 DEGRADED 0 0 1.52K too many errors
gptid/e5bc82af-0040-11ef-9262-3cecef93ed84 DEGRADED 0 0 1.58K too many errors
gptid/24f2167e-1007-11ef-9262-3cecef93ed84 DEGRADED 0 0 1.56K too many errors
gptid/e6acd1bc-0040-11ef-9262-3cecef93ed84 DEGRADED 0 0 1.69K too many errors
gptid/e655ecf7-0040-11ef-9262-3cecef93ed84 DEGRADED 0 0 1.68K too many errors
gptid/e6584567-0040-11ef-9262-3cecef93ed84 DEGRADED 0 0 1.91K too many errors
gptid/e68a209b-0040-11ef-9262-3cecef93ed84 ONLINE 0 0 1.93K (resilvering)
gptid/e6a655aa-0040-11ef-9262-3cecef93ed84 DEGRADED 0 0 1.91K too many errors
spare-15 ONLINE 0 0 1.92K
gptid/1f2e4ff2-10d1-11ef-9262-3cecef93ed84 ONLINE 0 0 0 (resilvering)
ada9p2 ONLINE 0 0 0 (resilvering)
gptid/e8141e7a-0040-11ef-9262-3cecef93ed84 DEGRADED 4.92K 0 1.91K too many errors
gptid/e84ac734-0040-11ef-9262-3cecef93ed84 DEGRADED 0 0 1.91K too many errors
spare-18 ONLINE 0 0 1.94K
gptid/a778750f-11b5-11ef-9262-3cecef93ed84 ONLINE 0 0 0 (resilvering)
ada20p2 ONLINE 0 0 0 (resilvering)
gptid/e83e3def-0040-11ef-9262-3cecef93ed84 DEGRADED 0 0 1.80K too many errors
gptid/e83e8140-0040-11ef-9262-3cecef93ed84 DEGRADED 0 0 1.69K too many errors
gptid/e8363e55-0040-11ef-9262-3cecef93ed84 DEGRADED 0 0 1.62K too many errors
spares
gptid/20108e05-0df0-11f0-9262-3cecef93ed84 INUSE currently in use
gptid/201bb553-0df0-11f0-9262-3cecef93ed84 INUSE currently in use
errors: 14350774 data errors, use ‘-v’ for a list
pool: boot-pool
state: ONLINE
scan: scrub repaired 0B in 00:00:06 with 0 errors on Fri Apr 18 03:45:06 2025
config:
NAME STATE READ WRITE CKSUM
boot-pool ONLINE 0 0 0
nvd0p2 ONLINE 0 0 0
Hi and welcome to the forums.
Really sorry to hear about your issues. Clearly something more than drive failures has happened here but hard to guess what atm without system specs.
Could you describe your hardware and in particular your chassis and HBA.
Yes, of course!
TrueNAS-13.0-U6.7
Supermicro x11ssh-ln4f Motherboard
CPU Intel(R) Pentium(R) CPU G4400 @ 3.30GHz
DIMM 4x16Gb ECC RAM
Pool 22 disk RAIDZ3, TOSHIBA MG10ACA20TE 20TB Sata
ASMedia ASM1064 SATA Controller
Server case ExeGate Pro 4U660-HS24
22 is too wide. And there’s no (good) way all disks are attached to a single ASM1064, so some bits of the description are still missing.
14M data errors… Do you have a backup?
No, there are no backups. There is too much data… (230 TiB). Pool 24/7 writes a video archive of 140 cameras, surprisingly, it worked for 1 year without problems and stops.
I found that ada17 has many errors according to SMART, so I replaced it, the recovery went faster and no new errors were added yet. Does it make sense to wait for the end of the recovery? What are the possible scenarios in this case? Maybe it is easier to replace the controller and reassemble the array, also replacing the problematic disks? Which controller can you recommend? I also thought about replacing the CPU…
Controller: A SAS HBA; given the number of drives, either a 9305-24i or (rather) a 9207/9300-8i and an expander (e.g. AEC-82885T).
Hopefully, taking out the poor SATA controller(s) will prevent further errors, but it will not cure errors which have crept into pool so you will lose some data—and if errors have crept into ZFS metadata you may need to destroy the pool to clear.
That is a failure mode I have seen, one SATA drive fails and many (if not all) that share the SATA Port Multiplier are seen as bad. You have found the truly bad drive, now you should be able to replace the failed drive and let the resilver fix what it can.
I would take the advice from etorix and move to a good SAS HBA ASAP.