My pool appears to be in serious trouble

shaunterickson · September 30, 2025, 1:22am

I have TrueNAS installed on an Minisforum MS-01, with an 8-bay QNAP TL-D800S JBOD attached. It had a RAIDZ2 pool of 8 6TB disks on it. I do have backups of the data (that I only started doing 3 weeks ago, whew!).

A bit over a week ago, one of the drives started throwing errors, so I replaced it with a 14 TB disk. As soon as that resilver finished, two more disks started throwing errors (they were all from nearly the same manufacture date in 2016-17), so I replaced those, one at a time, with 14 TB disks. Then one of the 14 TB disks was failed out, a day later, due to having 7 uncorrectable read errors. So this afternoon I replaced it with another 14 TB disk. The only other issue, this morning, was that one 6 TB drive had 1 checksum error.

Tonight, the resilvering was nearly completed, then stuck at 99.93% done. It sat that way for a long time, and then I got a bunch of errors about how no S.M.A.R.T. tests could be run. I ran a zpool status and this is the mess it reported:

admin@tns-qnap-2[~]$ sudo zpool status -v
  pool: boot-pool
 state: ONLINE
  scan: scrub repaired 0B in 00:00:03 with 0 errors on Thu Sep 25 03:45:04 2025
config:

        NAME           STATE     READ WRITE CKSUM
        boot-pool      ONLINE       0     0     0
          mirror-0     ONLINE       0     0     0
            nvme1n1p3  ONLINE       0     0     0
            nvme0n1p3  ONLINE       0     0     0

errors: No known data errors

  pool: tns-qnap-2
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Mon Sep 29 14:04:53 2025
        20.2T / 20.2T scanned, 20.1T / 20.2T issued at 661M/s
        2.46T resilvered, 99.93% done, 00:00:21 to go
config:

        NAME                                        STATE     READ WRITE CKSUM
        tns-qnap-2                                  DEGRADED     0     0     0
          raidz2-0                                  DEGRADED     0     4     0
            12e3e46e-49c6-426e-9e12-2c3134062479    ONLINE       0     0     0
            a92c8bc4-4b86-4c35-83b9-128f0ae22c58    ONLINE       0     0     0
            replacing-2                             DEGRADED     0     0     0
              ad2d653a-0b4a-4eb2-bd83-b920e2f6ee72  REMOVED      0     0     0
              052fe9c6-4cf0-4539-b99b-db58a1812528  ONLINE       0     0     0  (resilvering)
            5313d473-44bf-4590-b392-182fbfbff735    ONLINE       0     0     0
            00a8dbb3-c67c-4a74-b240-a1f55a188f0f    ONLINE       3     4     0
            15d57333-0b20-4b2a-851a-0b0d1a05939e    ONLINE       3     4     1
            937669cf-0dfd-4c6c-b5c5-211b1195ecc7    ONLINE       3     7     0
            315bfe10-63da-47dc-91a5-0268fdda7630    ONLINE       3     8     0

errors: List of errors unavailable: pool I/O is currently suspended
admin@tns-qnap-2[~]$

Those four erroring drives are all but one of the remaining 6 TB drives.

Is there any way to save this pool, or is it toast and I’ll have to restore from backup (ZFS snapshots, replicated to a TrueNAS device dedicated to just storing backups) once all bad drives are replaced?

I haven’t touched it since getting that status report. I’m currently out of spares, but just happen to have two more 14 TB drives in-transit to me now, and I can order more.

What should I do now, and then once I have enough replacement drives in (assuming it’s even possible to save this)?

Edited to correct minor errors.

Farout · September 30, 2025, 10:46am

I am unfamiliar with the used QNAP HBA and if its “JBOD-Mode” is suitable for ZFS. Usually JBOD-Mode is not good enough. IT Mode is needed.

shaunterickson · September 30, 2025, 11:31am

I don’t believe the QNAP is the culprit. It’s been working fine for over a year, as well as the other three systems I have.

I’m asking for what steps to take to save my pool, if possible, given that it has so many drives with errors and all I/O to it is suspended.

Farout · September 30, 2025, 12:34pm

Unfortunatly that doesnt mean anything. We have seen here on the forums lots of cases where “everything was fine” - until it wasnt.

Did you check the smart test results of the drives ? Are really the drives all failing at the same time ?

NugentS · September 30, 2025, 12:40pm

Well the good news is that it isn’t a USB device.
Also - good news is that QNAP mean for it to be used with QuTS - which I believe uses ZFS.

Bad news - is that you appear to have some serious issues. The chksum error is usually (but not exclusively) cabling.

You need to run smart tests on all drives and look at the results

shaunterickson · September 30, 2025, 12:59pm

Smart tests cannot run when all I/O to the pool is suspended.

SmallBarky · September 30, 2025, 1:58pm

Have you tried powering it all off and reseating the drives and checking cabling?

shaunterickson · September 30, 2025, 4:30pm

No, as mentioned above, I have not touched it. I was concerned that there may be things I should do, or attempt to do, before powering it off, and was waiting to hear what you folks had to say.

bacon · September 30, 2025, 4:42pm

This is not my area of expertise, but I would probably start gathering debug logs.

In particular, I’d probably copy the output of these somewhere:

sudo cat /proc/spl/kstat/zfs/dbgmsg
sudo dmesg

Especially the non persistent ones (i.e. the ones which are lost after reboot) might be worth copying.

etorix · September 30, 2025, 4:47pm

What’s the controller card for the QNAP enclosure and how is it cooled in the MS-01?
What are the drives? Any SMR in there?

Farout · September 30, 2025, 5:30pm

The controller is decribed here

shaunterickson · September 30, 2025, 5:43pm

Controller card:
  QXP-800eS-A1164 - https://www.qnap.com/en-us/product/qxp-800es-a1164

Disks:
  WDC_WD60EFRX    - Western Digital 6TB WD Red Plus NAS Internal Hard Drive
  WUH721414ALE601 - Western Digital Ultrastar DC HC530 14 TB SATA Hard Drive

Cooling of the controller card is just the fans that come in the MS-01. I don’t think any of the disks are the SMR type.

shaunterickson · September 30, 2025, 5:52pm

I’ve saved the two you mentioned. Of the many logs in /var/log, I’m not certain which are ephemeral and which are not.

shaunterickson · September 30, 2025, 6:24pm

I saved a number of other logs, all with timestamps around when everything blew up yesterday.

I then shut the system down. It hung trying to do umounts, so I eventually just held the power button in.

I reseated the cables to the JBOD.

I powered the system back on. This is the current zpool status:

admin@tns-qnap-2[~]$ sudo zpool status -v 
  pool: boot-pool
 state: ONLINE
  scan: scrub repaired 0B in 00:00:03 with 0 errors on Thu Sep 25 03:45:04 2025
config:

        NAME           STATE     READ WRITE CKSUM
        boot-pool      ONLINE       0     0     0
          mirror-0     ONLINE       0     0     0
            nvme1n1p3  ONLINE       0     0     0
            nvme0n1p3  ONLINE       0     0     0

errors: No known data errors

  pool: tns-qnap-2
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Mon Sep 29 14:04:53 2025
        19.8T / 20.2T scanned at 42.8G/s, 13.7T / 20.2T issued at 535M/s
        1.68T resilvered, 67.97% done, 03:31:04 to go
config:

        NAME                                        STATE     READ WRITE CKSUM
        tns-qnap-2                                  DEGRADED     0     0     0
          raidz2-0                                  DEGRADED     0     0     0
            12e3e46e-49c6-426e-9e12-2c3134062479    ONLINE       0     0     0
            a92c8bc4-4b86-4c35-83b9-128f0ae22c58    ONLINE       0     0     0
            replacing-2                             DEGRADED     0     0     0
              ad2d653a-0b4a-4eb2-bd83-b920e2f6ee72  OFFLINE      0     0     0
              052fe9c6-4cf0-4539-b99b-db58a1812528  ONLINE       0     0     0  (resilvering)
            5313d473-44bf-4590-b392-182fbfbff735    ONLINE       0     0     0
            00a8dbb3-c67c-4a74-b240-a1f55a188f0f    ONLINE       0     0     0
            15d57333-0b20-4b2a-851a-0b0d1a05939e    ONLINE       0     0     0
            937669cf-0dfd-4c6c-b5c5-211b1195ecc7    ONLINE       0     0     0
            315bfe10-63da-47dc-91a5-0268fdda7630    ONLINE       0     0     0

errors: No known data errors
admin@tns-qnap-2[~]$

Looking at “Manage Devices”, under Storage → Topology, there are no drives showing any errors. It just shows the one drive being resilvered. Interestingly, the little spinner, that’s usually in the upper right corner of the GUI, which you can click to see the progress of the resilver, is not present. Storage → Disk Health is green. Scrutiny shows no drives with issues, but it hasn’t looked the drives today - it’s yesterday’s results. Not sure how to force it to scan them on-demand.

etorix · September 30, 2025, 6:44pm

It does not need to be the bundeled card. Any SATA or SAS controller with 8 external ports would do. Here it’s probably better not to use a SAS HBA because, if I remember the thread on STH correctly, the MS-01 has no cooling for a PCIe cards: Cards are supposed to provide their own, blower type, cooling if needed.

smartctl -t long /dev/sdX
But all drives are busy resilvering; I would let them do that and re-test later.

shaunterickson · September 30, 2025, 6:58pm

Give the current state of things, I now wonder if any of the drives I removed are actually bad. Will have to put them in a spare system and check them out …

shaunterickson · September 30, 2025, 7:15pm

Too late. I’d figured out how to adjust the cron entry in the Scrutiny container to run before Midnight. All the drives appear fine, according to it.

The resilver has about 50 minutes to go, it says.

shaunterickson · September 30, 2025, 8:11pm

admin@tns-qnap-2[~]$ sudo zpool status -v
  pool: boot-pool
 state: ONLINE
  scan: scrub repaired 0B in 00:00:03 with 0 errors on Thu Sep 25 03:45:04 2025
config:

        NAME           STATE     READ WRITE CKSUM
        boot-pool      ONLINE       0     0     0
          mirror-0     ONLINE       0     0     0
            nvme1n1p3  ONLINE       0     0     0
            nvme0n1p3  ONLINE       0     0     0

errors: No known data errors

  pool: tns-qnap-2
 state: ONLINE
  scan: resilvered 2.47T in 1 days 02:05:52 with 0 errors on Tue Sep 30 16:10:45 2025
config:

        NAME                                      STATE     READ WRITE CKSUM
        tns-qnap-2                                ONLINE       0     0     0
          raidz2-0                                ONLINE       0     0     0
            12e3e46e-49c6-426e-9e12-2c3134062479  ONLINE       0     0     0
            a92c8bc4-4b86-4c35-83b9-128f0ae22c58  ONLINE       0     0     0
            052fe9c6-4cf0-4539-b99b-db58a1812528  ONLINE       0     0     0
            5313d473-44bf-4590-b392-182fbfbff735  ONLINE       0     0     0
            00a8dbb3-c67c-4a74-b240-a1f55a188f0f  ONLINE       0     0     0
            15d57333-0b20-4b2a-851a-0b0d1a05939e  ONLINE       0     0     0
            937669cf-0dfd-4c6c-b5c5-211b1195ecc7  ONLINE       0     0     0
            315bfe10-63da-47dc-91a5-0268fdda7630  ONLINE       0     0     0

errors: No known data errors
admin@tns-qnap-2[~]$

shaunterickson · September 30, 2025, 8:39pm

Thank you, everyone, for your suggestions, and support. I was biting my nails, but knew that, worst case, I did have backups.

Arwen · October 1, 2025, 1:30am

@shaunterickson One other note. Sometimes less expensive SATA PCIe cards use SATA port multipliers. These are not recommended for TrueNAS or ZFS due to irregular behavior.

The good news is that QNAP did the “right thing” and avoided that problem. They use 2 x Asmedia ASM1164, (each with 4 SATA ports), driven by a PCIe switch. All good engineering.

So, while you do want cooling on the PCIe switch, (which is probably under the heat sink. The overall design of that SATA expansion card is good.