5 of 6 Drives in zpool Dropped Into Degraded State

This afternoon, I had 5 of my 6 Drives in a RAIDZ1 vdev drop into a degraded state pretty much simultaneously, each with a near identical number of Read-Errors. I’m nearly certain its not a drive-side issue, since they were all healthy as of yesterday, and earlier this week, a full scrub reported no issues:

truenas_admin@truenas[~]$ zpool status -v
pool: bulk
state: SUSPENDED
status: One or more devices are faulted in response to IO failures.
action: Make sure the affected devices are connected, then run ‘zpool clear’.
see: scan: scrub repaired 0B in 4 days 21:46:50 with 0 errors on Thu Jan 15 21:46:53 2026config:
    NAME                                      STATE     READ WRITE CKSUM
    bulk                                      ONLINE       0     0     0
      raidz1-0                                ONLINE     829     2     0
        fb824d60-21f2-4088-80a4-12a3cf58eaef  ONLINE       0     0     0
        9da156e5-3907-47c9-895e-3edc70c00479  DEGRADED   374     3     0  too many errors
        21ac9411-1bce-4af5-855d-cc3f8984e20d  DEGRADED   352     3     0  too many errors
        c9a7b6eb-37a7-4667-b3fa-0826b419b25b  DEGRADED   351     3     0  too many errors
        e504cf77-316f-49c4-b101-dfbef49e0208  DEGRADED   353     2     0  too many errors
        8884b9a5-e950-4365-b35e-b8e25776d2a7  DEGRADED   304     3     0  too many errors
errors: List of errors unavailable: pool I/O is currently suspended

Based on them all going down in the span of 5 minutes, and temperature history looking fine… To me, that behavior seems like a Controller Issue, Right? Anything else I should check before replacing the SATA Controller?

Any Tips/Advice on how to best shut everything down, swap the SATA Controller, and bring things back up and NOT accidentally obliterate 70TB of Data?

The Obligatory Details on “well we need to know what hardware you are running” first comment:

truenas_admin@truenas[~]$ lscpu
Architecture:                x86_64
Vendor ID:                   GenuineIntel
Model name:                12th Gen Intel(R) Core™ i7-12700K
truenas_admin@truenas[~]$ free -h
total        used        free      shared  buff/cache   availableMem:            31Gi        29Gi       1.6Gi       164Mi       1.3Gi       2.0Gi
truenas_admin@truenas[~]$ lsblk | grep disk
sda           8:0    0   1.8T  0 disk
sdb           8:16   0   3.6T  0 disk
sdc           8:32   0   3.6T  0 disk
sdd           8:48   0   3.6T  0 disk
sde           8:64   0  14.6T  0 disk
sdf           8:80   0   3.6T  0 disk
sdg           8:96   0  14.6T  0 disk
sdh           8:112  0  14.6T  0 disk
sdi           8:128  0  14.6T  0 disk
sdj           8:144  0  14.6T  0 disk
sdk           8:160  0  14.6T  0 disk
nvme0n1     259:0    0 931.5G  0 disk
truenas_admin@truenas[~]$ lspci | grep SATA
00:17.0 SATA controller: Intel Corporation Alder Lake-S PCH SATA Controller [AHCI Mode] (rev 11)
01:00.0 SATA controller: ASMedia Technology Inc. ASM1064 Serial ATA Controller (rev 02)
truenas_admin@truenas[~]$ lsscsi -g
[0:0:0:0]    disk    ATA      WDC WDS200T2B0B- 90WD  /dev/sda   /dev/sg1
[4:0:0:0]    disk    ATA      CT4000BX500SSD1  082   /dev/sdc   /dev/sg2
[5:0:0:0]    disk    ATA      CT4000BX500SSD1  082   /dev/sdb   /dev/sg3
[6:0:0:0]    disk    ATA      CT4000BX500SSD1  082   /dev/sdd   /dev/sg4
[7:0:0:0]    disk    ATA      CT4000BX500SSD1  082   /dev/sdf   /dev/sg5
[8:0:0:0]    enclosu AHCI     SGPIO Enclosure  2.00  -          /dev/sg0
[9:0:0:0]    disk    ATA      ST16000DM001-3Y4 DN01  /dev/sdg   /dev/sg6
[12:0:0:0]   disk    ATA      ST16000DM001-3Y4 DN01  /dev/sde   /dev/sg7
[29:0:0:0]   disk    ATA      ST16000DM001-3Y4 DN01  /dev/sdi   /dev/sg8
[30:0:0:0]   disk    ATA      ST16000DM001-3Y4 DN01  /dev/sdh   /dev/sg9
[31:0:0:0]   disk    ATA      ST16000DM001-3Y4 DN01  /dev/sdj   /dev/sg10
[32:0:0:0]   disk    ATA      ST16000DM001-3Y4 DN01  /dev/sdk   /dev/sg11
[N:0:6:1]    disk    Samsung SSD 980 PRO 1TB__1                 /dev/nvme0n1  -

Is that a port multiplier instead of a HBA?

What’s all the noise about HBA’s, and why can’t I use a RAID controller?

Multiply your problems with SATA Port Multipliers and cheap SATA controllers

2 Likes

Its not a port multiplier.
Just a SATA PCIe Expansion Card… Other than the reliability concerns, I didn’t need SAS, the scale, or the throughput of an HBA; considering this “server” was made with nothing but leftover/reused parts.

That’s probably the reason.

Statistically speaking, there’s no chance 5 different drives all start failing with I/O errors at the same time.

1 Like

Yeah, thats what I figure– particularly since they all dropped within the same 5 minute period.

Maybe my rambling from the initial post didn’t make root question very clear… My bad!
Since I’m not usually a TrueNAS / ZFS Guy, I was more interested in:

  • Whats the safe procedure for powering everything down, reseating/reconnecting everything, and bringing it back up (to see if its healthy)?
  • Assuming that doesn’t work due to the Cheap Expansion Card Dying (most likely); what is the procedure for rebuilding a vDev after I replace the card?

I’ve just not found good literature on the topic– replacing a failed drive yes… Not so much on the rebuild the vDev/pool from the (probably) good drives.

Any tips on that? Or got a decent guide to point to?

Thanks

At think point, I think you power the system down and get a HBA card like was linked along with setting up very good cooling air flow over the card or find a system that can take all your drives directly to the motherboard. The cards usually state something like needing 150-200 linear feet per minute air flow over them. Loud, rack server air flow.
I suggested a temp cooling solution in a previous post and another user made it a meme. I was serious for the blower fan on an open case or blowing through the system during recovery. Memes! TrueNAS, ZFS, and related | (Share your own!) - #425 by winnielinnie

You may already be past the point of being able to recover that pool. We would have to bring up that pool and all the drives on a good system and see if we could get it back online. It may have to be mounted read only, have transactions rolled back or unrecoverable without considerable effort or using something like Klennet Recovery and paying for a license. Browse the website for info on the product, tutorials, etc. That may be one option if you don’t have a backup source for all this data.

1 Like

Well; I would hope the data from the disks is fine; considering it wasn’t doing any writes at the time it died. It was just trying to do a Read when things went bad. So I would like to think the vDev is recoverable assuming ZFS can correctly identify the drives and rebuild the pool.

But I guess we will see.
Thanks for the info

In Case Anyone in the Future Bumps into this thread, in the event that their SATA Drive Controller dies and forces multiple drives into a degraded state…

- Take a Config Backup just to be safe ( System > Advanced Settings > Manage Configuration Dropdown > Download File)

  • Power Down The Server
  • Replace/Reseat The Controller
  • Reseat the Drives/Cables
  • Restart Server

If all goes well, TrueNAS should automatically redetect the vDev (although you may need to Import the pool manually @ Storage > Import Pool).

Post Reseating Everything:

truenas_admin@truenas[~]$ zpool status -v bulk
  pool: bulk
 state: ONLINE
  scan: scrub repaired 0B in 4 days 21:46:50 with 0 errors on Thu Jan 15 21:46:53 2026
config:

        NAME                                      STATE     READ WRITE CKSUM
        bulk                                      ONLINE       0     0     0
          raidz1-0                                ONLINE       0     0     0
            fb824d60-21f2-4088-80a4-12a3cf58eaef  ONLINE       0     0     0
            9da156e5-3907-47c9-895e-3edc70c00479  ONLINE       0     0     0
            21ac9411-1bce-4af5-855d-cc3f8984e20d  ONLINE       0     0     0
            c9a7b6eb-37a7-4667-b3fa-0826b419b25b  ONLINE       0     0     0
            e504cf77-316f-49c4-b101-dfbef49e0208  ONLINE       0     0     0
            8884b9a5-e950-4365-b35e-b8e25776d2a7  ONLINE       0     0     0

errors: No known data errors

Now we wait for another 5 Days of Scrubbing to Verify everything

ASM1064 is a SATA controller, sure (though an old one, and that’s already a possible concern), but the question is whether there is a port multiplier downstream of it—having more than 4 ports would indicate so.
Best replace it at the earliest opportunity… and a SAS HBA is the best candidate, even though you have only SATA drives.

3 Likes

This suggests that you have no backup. A raidz1 with 16 TB drives. And a dubious SATA controller.

Your data, your choice.

1 Like