In zfs3 all disks have changed status to DEGRADED

user432 · April 18, 2025, 2:14am

Hi all!
The problem is as follows. On the server 2 disks suddenly switched to the status DEGRADED. Auto-replacement for 2 spare disks started.
The scanning process has started.
Since the disk array is large (22 disks of 20 TB), the recovery time was 10 days. 8 days passed and a large number of errors appeared. Moreover, the storage was monitored and the data (I did not check everything) was read. After a short period of time, the system froze completely, I had to restart it and the process started over again, the data remained. A couple of days passed and errors appeared again and almost all disks are displayed as degraded.
Is there any point in me waiting for the end? What if everything freezes again…

user432 · April 18, 2025, 2:17am

root@shd[~]# zpool status
pool: SHD
state: DEGRADED
status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scan: resilver in progress since Tue Apr 15 08:18:59 2025
109T scanned at 1.71M/s, 105T issued at 1.65M/s, 301T total
11.2T resilvered, 34.80% done, no estimated completion time
config:

    NAME                                              STATE     READ WRITE CKSUM
    SHD                                               DEGRADED     0     0   0
      raidz3-0                                        DEGRADED     0     0   0
        gptid/e6908c2d-0040-11ef-9262-3cecef93ed84    DEGRADED     0     0 1.57K  too many errors
        gptid/e6326f77-0040-11ef-9262-3cecef93ed84    DEGRADED     0     0 1.54K  too many errors
        gptid/e6a0e61f-0040-11ef-9262-3cecef93ed84    DEGRADED     0     0 1.53K  too many errors
        gptid/5edc97a7-036e-11ef-9262-3cecef93ed84    DEGRADED     0     0 1.49K  too many errors
        gptid/e63f5946-0040-11ef-9262-3cecef93ed84    DEGRADED     0     0 1.50K  too many errors
        gptid/e648a880-0040-11ef-9262-3cecef93ed84    DEGRADED     0     0 1.49K  too many errors
        gptid/e644b932-0040-11ef-9262-3cecef93ed84    DEGRADED     0     0 1.54K  too many errors
        gptid/e64d6b0b-0040-11ef-9262-3cecef93ed84    DEGRADED     0     0 1.52K  too many errors
        gptid/e5bc82af-0040-11ef-9262-3cecef93ed84    DEGRADED     0     0 1.58K  too many errors
        gptid/24f2167e-1007-11ef-9262-3cecef93ed84    DEGRADED     0     0 1.56K  too many errors
        gptid/e6acd1bc-0040-11ef-9262-3cecef93ed84    DEGRADED     0     0 1.69K  too many errors
        gptid/e655ecf7-0040-11ef-9262-3cecef93ed84    DEGRADED     0     0 1.68K  too many errors
        gptid/e6584567-0040-11ef-9262-3cecef93ed84    DEGRADED     0     0 1.91K  too many errors
        gptid/e68a209b-0040-11ef-9262-3cecef93ed84    ONLINE       0     0 1.93K  (resilvering)
        gptid/e6a655aa-0040-11ef-9262-3cecef93ed84    DEGRADED     0     0 1.91K  too many errors
        spare-15                                      ONLINE       0     0 1.92K
          gptid/1f2e4ff2-10d1-11ef-9262-3cecef93ed84  ONLINE       0     0   0  (resilvering)
          ada9p2                                      ONLINE       0     0   0  (resilvering)
        gptid/e8141e7a-0040-11ef-9262-3cecef93ed84    DEGRADED 4.92K     0 1.91K  too many errors
        gptid/e84ac734-0040-11ef-9262-3cecef93ed84    DEGRADED     0     0 1.91K  too many errors
        spare-18                                      ONLINE       0     0 1.94K
          gptid/a778750f-11b5-11ef-9262-3cecef93ed84  ONLINE       0     0   0  (resilvering)
          ada20p2                                     ONLINE       0     0   0  (resilvering)
        gptid/e83e3def-0040-11ef-9262-3cecef93ed84    DEGRADED     0     0 1.80K  too many errors
        gptid/e83e8140-0040-11ef-9262-3cecef93ed84    DEGRADED     0     0 1.69K  too many errors
        gptid/e8363e55-0040-11ef-9262-3cecef93ed84    DEGRADED     0     0 1.62K  too many errors
    spares
      gptid/20108e05-0df0-11f0-9262-3cecef93ed84      INUSE     currently in use
      gptid/201bb553-0df0-11f0-9262-3cecef93ed84      INUSE     currently in use

errors: 14350774 data errors, use ‘-v’ for a list

pool: boot-pool
state: ONLINE
scan: scrub repaired 0B in 00:00:06 with 0 errors on Fri Apr 18 03:45:06 2025
config:

    NAME        STATE     READ WRITE CKSUM
    boot-pool   ONLINE       0     0     0
      nvd0p2    ONLINE       0     0     0

Johnny_Fartpants · April 18, 2025, 6:01am

Hi and welcome to the forums.

Really sorry to hear about your issues. Clearly something more than drive failures has happened here but hard to guess what atm without system specs.

Could you describe your hardware and in particular your chassis and HBA.

user432 · April 18, 2025, 6:42am

Yes, of course!
TrueNAS-13.0-U6.7
Supermicro x11ssh-ln4f Motherboard
CPU Intel(R) Pentium(R) CPU G4400 @ 3.30GHz
DIMM 4x16Gb ECC RAM
Pool 22 disk RAIDZ3, TOSHIBA MG10ACA20TE 20TB Sata
ASMedia ASM1064 SATA Controller
Server case ExeGate Pro 4U660-HS24

Farout · April 18, 2025, 7:59am

etorix · April 18, 2025, 8:34am

22 is too wide. And there’s no (good) way all disks are attached to a single ASM1064, so some bits of the description are still missing.

14M data errors… Do you have a backup?

user432 · April 18, 2025, 8:50am

No, there are no backups. There is too much data… (230 TiB). Pool 24/7 writes a video archive of 140 cameras, surprisingly, it worked for 1 year without problems and stops.
I found that ada17 has many errors according to SMART, so I replaced it, the recovery went faster and no new errors were added yet. Does it make sense to wait for the end of the recovery? What are the possible scenarios in this case? Maybe it is easier to replace the controller and reassemble the array, also replacing the problematic disks? Which controller can you recommend? I also thought about replacing the CPU…

etorix · April 18, 2025, 9:19am

Controller: A SAS HBA; given the number of drives, either a 9305-24i or (rather) a 9207/9300-8i and an expander (e.g. AEC-82885T).

Hopefully, taking out the poor SATA controller(s) will prevent further errors, but it will not cure errors which have crept into pool so you will lose some data—and if errors have crept into ZFS metadata you may need to destroy the pool to clear.

PK1048 · April 18, 2025, 4:01pm

That is a failure mode I have seen, one SATA drive fails and many (if not all) that share the SATA Port Multiplier are seen as bad. You have found the truly bad drive, now you should be able to replace the failed drive and let the resilver fix what it can.

I would take the advice from etorix and move to a good SAS HBA ASAP.