Endless resilvering then crash on Truenas CORE 13.0-U6.2

Hi,

My Truenas CORE server just started having issue after 4 years being rock solid. I have 7 vdevs of 16TB drives in mirror. But recently it started resilvering about 3 vdevs out of nowhere. Then around 85% completion it crashes and I get this message:

plugin dispatch_values: Low water nark reached. Dropping 100% of metrics.

Then I force reboot and after about 1-2h it happens again, the resilvering starts at around 79% every time. Here is a zpool status -v right after I reboot:

FreeBSD 13.1-RELEASE-p9 n245431-b8ec9bde091 TRUENAS
root@truenas[~]# zpool status -v
  pool: Data
 state: ONLINE
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Mon Jul 22 02:58:15 2024
        29.2T scanned at 40.1G/s, 29.0T issued at 40.0G/s, 36.6T total
        3.32G resilvered, 79.37% done, 00:03:13 to go
config:

        NAME                                            STATE     READ WRITE CKSUM
        Data                                            ONLINE       0     0 0
          mirror-0                                      ONLINE       0     0 0
            gptid/fff331a3-ceaf-11ee-a4b5-902b3450e752  ONLINE       0     0 0
            da11                                        ONLINE       0     0 0
          mirror-1                                      ONLINE       0     0 0
            gptid/7fc1acb0-e95a-11ed-b0f5-902b3450e752  ONLINE       0     0 0  (resilvering)
            gptid/0dd96bd8-e77d-11ed-ab34-902b3450e752  ONLINE       0     0 0
          mirror-2                                      ONLINE       0     0 0
            gptid/43426026-e77d-11ed-ab34-902b3450e752  ONLINE       0     0 0
            gptid/d2dab17e-ec2d-11ed-ade9-902b3450e752  ONLINE       0     0 0  (awaiting resilver)
          mirror-3                                      ONLINE       0     0 0
            gptid/6fe762fb-edc0-11ed-8ff9-902b3450e752  ONLINE       0     0 0
            gptid/6e721b6e-e77d-11ed-ab34-902b3450e752  ONLINE       0     0 0
          mirror-4                                      ONLINE       0     0 0
            gptid/a68f46df-edc0-11ed-8ff9-902b3450e752  ONLINE       0     0 0
            gptid/93162bf0-e77d-11ed-ab34-902b3450e752  ONLINE       0     0 0
          mirror-5                                      ONLINE       0     0 0
            gptid/bf33b99d-e95a-11ed-b0f5-902b3450e752  ONLINE       0     0 0  (resilvering)
            gptid/8e36489a-e6f3-11ed-8d23-902b3450e752  ONLINE       0     0 0
          mirror-6                                      ONLINE       0     0 0
            gptid/66699f5f-eb6b-11ed-a583-902b3450e752  ONLINE       0     0 0
            gptid/006fbaac-ec2e-11ed-ade9-902b3450e752  ONLINE       0     0 0

errors: Permanent errors have been detected in the following files:

        <0x1344>:<0x1c8da5>

  pool: boot-pool
 state: ONLINE
status: Some supported and requested features are not enabled on the pool.
        The pool can still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
        the pool may no longer be accessible by software that does not support
        the features. See zpool-features(7) for details.
  scan: scrub repaired 0B in 00:00:51 with 0 errors on Mon Jul 22 03:45:56 2024
config:

        NAME        STATE     READ WRITE CKSUM
        boot-pool   ONLINE       0     0     0
          ada0p2    ONLINE       0     0     0

errors: No known data errors

Any pointers? Thank you.

How are drives attached? How full are pools?

You should give us a full listing of hardware and pool setup. It’s not clear.

Hi @SmallBarky, thanks for your quick reply. Here are the details:

  • PSU: EVGA Supernova 750 G+
  • Motherboard: Gigabyte GA-7PESH2
  • HP 24 Bay 3GB SAS Expander Card
  • 128GB RAM
  • 14 x 16TB Seagate Ironwolf
  • 1 x 128GB SSD (boot-pool)

Drives are connected to the SAS expander card, with SFF-8087 to SFF-8482 cables. The ‘Data’ pool is only 34% full.

Is that a plain HBA card and not RAID? Only think I can think of is checking card, cables and chasis / backplane

All fans working well? Thinking possible overheating on something?

It’s a plain HBA card. I just checked every cables, replugged everything, fans are 100% at all time, temps are normal. I was able to get it to stay up for a few hours but the resilvering keep restarting when it reaches about 87%. I’m wondering if that could be the SAS card giving up?

It ended up finishing resilvering, and seems stable now. So it might have helped replugging everything but I feel like I could upgrade the hardware soon. Thanks for your help!