Replacing a broken disk - server is often unresponsive

Hi all! I have a disk which is failing (some uncorrectable errors, but they keep growing) and since I have a spare already in place, I started the replace procedure.

From time to time, the server is very unresponsive though (I suppose when it’s trying to access the failing disk), and some basic commands are also not going through (ie. zpool status hangs indefinitely, some pages in the UI are not loading..).

Is there something I can do to keep the resilver going without having so much performance impact from that failing disk? Can I shutdown the server and phisically remove the disk without affecting the resilvering? (the disk are hotswappable, so the shutdown would be just to lower the risks)

Thanks!

Just to add, these are the “zpool events” logs from when this happened:

Mar 2 2026 09:11:22.504196688 ereport.fs.zfs.delay
Mar 2 2026 09:20:00.131489345 resource.fs.zfs.statechange
Mar 2 2026 09:20:00.131489345 resource.fs.zfs.removed
Mar 2 2026 09:20:00.643488645 sysevent.fs.zfs.config_sync
Mar 2 2026 09:20:30.171448292 sysevent.fs.zfs.vdev_attach
Mar 2 2026 09:20:31.375446647 sysevent.fs.zfs.resilver_start
Mar 2 2026 09:20:31.375446647 sysevent.fs.zfs.history_event
Mar 2 2026 09:21:00.155407317 sysevent.fs.zfs.config_sync
Mar 2 2026 09:21:10.263393503 sysevent.fs.zfs.history_event
Mar 2 2026 09:21:51.575337047 resource.fs.zfs.statechange
Mar 2 2026 09:21:53.119334939 sysevent.fs.zfs.vdev_online
Mar 2 2026 09:28:22.334802154 ereport.fs.zfs.deadman
Mar 2 2026 09:29:23.774718014 ereport.fs.zfs.deadman
Mar 2 2026 09:30:25.214633880 ereport.fs.zfs.deadman
Mar 2 2026 09:31:26.654549747 ereport.fs.zfs.deadman
Mar 2 2026 09:32:28.094465607 ereport.fs.zfs.deadman
[.. a lot more deadman lines ..]

Truenas is ElectricEel-24.10.2.2.

You have to give a bit more information about your system, starting from you pool layout. I would suggest against removing the failing drive before the resilvering finishes; if your pool is resilient enough TN is able to decide whether the disk is dead enough and kick it out autonomously.

Do note that you can setup resilvering priority. Managing Scrub Tasks | TrueNAS Documentation Hub

Yes, more information on hardware, especially disks:

  • Make and exact model of the disks
  • How the disks are wired to the computer:
    • Built in SATA
    • Add-on SAS card
    • NVMe
    • USB?

Plus, software configuration, like:

  • Which sharing protocols do you use, SMB?
  • Do you run Apps?

Thanks for your replies @Arwen @Davvo, in the end something changed and I solved my issue. I’ll post what happened to keep track of the issue for future users and to express some concerns with the web interface.

My setup is 4xVDEV with 10 22TB disk each, protected with RAIDZ2.

After the disk started showing errors, I immediately issued a replace using the web interface. That worked, but the system was still very unresponsive from time to time (I suppose when the kernel was trying to access the disk and failing).

One big concern is: the web interface, specifically the “Storage” tab, was completely unusable all the time, it just kept loading forever to the point that after 5 minutes I got disconnected from the session (and I did try this multiple times). I had to do all of that using the CLI.

After a while “zpool status” started working again and the system started behaving correctly (just a bit slower than usual). Both the replaced and the replacing disks showed ONLINE in the zpool status, the “scanned” speed was good enough but the “issued” speed was abysmal, like 100KB/s.

At some point everything became sluggish again, and I decided to “zpool offline” the failing disk. I waited more than 1 hour for the command to finish, then I tried to stop it using CTRL+C, didn’t work, so I waited more and at some point it actually exited. I confirmed that the status of the failing disk was actually OFFLINE.

The resilvering then started working as intended, it’s at 25% now after around 12 hours of work, expected to finish in around 2 days.

I usually work from the command line, so it wasn’t a big issue for me. However, the fact that the “Storage” section of the web interface was constantly stuck isn’t great.

2 Likes

This is not expected behaviour. How are you connecting all those drives? More info on your system is required to try understanding what happened, start from Arwen’s list.

Glad you identified the problem and fixed it.

Sometimes the “replace in place” disk replacement method does not work well. If the bad / source disk is really bad, which it sounds like it was in this case, then it creates slow downs and problems. In which case, the normal off-lining the bad disk, and recover from redundancy is the better method.

For those that don’t know, the “replace in place” is / was sort of a ZFS unique feature. Other RAID schemes may have it today, but back 20 years ago when ZFS was new, “replace in place” was a newish concept and feature.

As for how “replace in place” works, is that if the disk to be replace is not totally dead, (or you want larger storage), you can cause ZFS to temporarily Mirror the disk to be replaced. Whence the data is copied over completely, the old / bad disk is removed and the pool should be healthy. (At least healthy from this specific issue…)

During the temporary Mirror process, if their is a bad block from the bad / source disk, AND redundancy is available, ZFS will read the redundancy and use it to restore the missing block. Then write out the block to the replacement disk.

This is different from an offline disk replacement, which uses vDev redundancy 100% of the time to re-create the disk to be replaced. In the ideal situation, “replace in place” is much faster, since it is an easy Mirror re-sync. Yet, in certain cases “replace in place” does not work well.

  • Server: Supermicro SSG-540P-E1CTR45L with no add-on cards

  • Disks: 42x Seagate Exos X22 SATA 22TB - ST22000NM001E-3HM103

  • Server has just one main pool composed by 4xVDEV of 10 disks each in RAIDZ2

  • The data is exported via NFS to some servers and there is a virtual machine running Linux Debian which mounts NFS locally and exports Samba (access via Samba is seldom used).