Replace HDD and fix associated fails that worked for me with some help from AI

TrueNAS CORE: Replacing a Failed Disk in a Degraded ZFS Pool
(with Fix for “Operation not permitted /dev/adaX”)


This guide covers replacing a failed disk in a degraded but still operational ZFS pool (any topology: mirror, RAIDZ1/2/3).
It also resolves the common error: Operation not permitted: /dev/adaX
This occurs when FreeBSD GEOM blocks disk writes during replacement.

While this example uses RAIDZ2, the process applies to any ZFS topology where sufficient redundancy remains to keep the pool online. RAIDZ2 simply provides additional fault tolerance, allowing safer troubleshooting while the system remains fully accessible.
However, the same issue can occur in any ZFS configuration when replacing a disk under GEOM protection.

This procedure was developed while repairing my own system after a failed disk replacement attempt that introduced additional complications. Some symptoms described here may not appear in every case, but the overall solution remains applicable.

My hardware for this process was an HP N36L Home Server with 6 HDDs, 8 GB RAM, AMD Turion II CPU, running TrueNAS CORE 13.3-U1.2 from a USB boot device.


Use this procedure when:

• GUI disk wipe fails
• CLI wipe fails with: Operation not permitted
• Disk replacement fails or behaves inconsistently
• ZFS shows missing or UNAVAIL device
• Replacement disk is detected but unusable


  1. Confirm degraded pool
    GUI: Storage → Pools → Status
    CLI: zpool status

  1. Replace physical disk

• Shutdown system
• Replace failed disk
• Boot system


  1. Verify new disk detected
    GUI: Storage → Disks

  1. Run initial SMART tests (NEW DISK)
    GUI: Storage → Disks → SMART Tests

Run:
• Conveyance test
• Short test

If either fails → replace the disk before continuing.


  1. Enable SSH (if not already enabled)
    GUI: Services → SSH

• Start Automatically
• Allow Password Authentication
• Permit root login with password

Required for CLI steps below.


  1. Enable GEOM override (REQUIRED)
    CLI: sysctl kern.geom.debugflags=0x10

  1. Create correct partition layout
    CLI: gpart backup ada0 | gpart restore -F ada3

  1. Verify ZFS partition exists
    CLI: glabel status | grep ada3

Expected:
gptid/… ada3p2

CRITICAL: ada3p2 must exist before continuing.


  1. Replace disk in pool
    CLI: zpool replace -f <OLD_ID> gptid/<NEW_ID>

  1. Monitor resilver
    GUI or CLI: zpool status

  1. Wait for completion (monitor progress)

Look for:
• Resilver completes
• Pool becomes ONLINE
• ETA displayed during process


  1. Run extended SMART test (AFTER resilver)
    GUI: Storage → Disks → SMART Tests

Run:
• LONG test

Note: Running the LONG S.M.A.R.T. test after resilver avoids unnecessary I/O contention and confirms disk reliability under real workload conditions.


  1. Reboot system (IMPORTANT)

This:
• clears GEOM override
• normalizes disk naming
• resolves GUI inconsistencies


  1. Verify final state after reboot

CLI: zpool status
CLI: sysctl kern.geom.debugflags

Expected:
• pool ONLINE
• debugflags = 0


  1. Recreate SMB shares (if needed)
    GUI: Sharing → Windows Shares (SMB)

Example:
/mnt//


  1. Backup configuration (CRITICAL)

GUI: System → General → Save Config

• Include secret seed
• Save .tar file intact


If adaXp2 does NOT exist

CLI: gpart backup ada0 | gpart restore -F adaX

Then verify again.


Common Failures (Summary)

• GUI wipe fails
• CLI wipe fails
• Replace fails
• Disk not selectable

Cause: GEOM write protection


Notes

• Pool Export/Import may help earlier with stale entries
• Not required once system is stable
• GUI and CLI naming differences are normal


Expected Result

• Pool ONLINE
• All disks properly mapped
• SMB restored
• System stable


Quick Version

Fix: TrueNAS CORE disk replace fails with “Operation not permitted”

  1. Enable SSH

  2. sysctl kern.geom.debugflags=0x10

  3. gpart backup ada0 | gpart restore -F adaX

  4. glabel status → confirm adaXp2

  5. zpool replace -f gptid/

  6. Wait for resilver

  7. Run LONG SMART test

  8. Reboot

  9. Backup config