Failed disk, replacement not working

Hello Everyone

I’m running FreeNAS-11.3-U5 2TB drives x4, only 5.6 TB was in use.

One drive failed and is reported as removed on the Pool. I tried to replace with the new clen drive already in the system but I’m getting error “Error: [EZFS_POOLUNAVAIL] pool I/O is currently suspended“
Tried to reboot the system and failed drive goes online for about a minute and than again Pool UNAVAIL.
If I keep failed drive connected I will lose access to the UI which is weird

I powered off the server and tried to replace failed drive with new one, and now I got Pool UNKNOWN message, status shows nothing.
I’m trying to fix it for over a day now. I tried everything i could find on this and other forums, so far no luck.

Funny thing is that if I connect failed drive to my windows machine I can see the disk in the disk manager and using data recovery tool I can see and brows zfs partition (empty)

Any suggestions will be greatly appreciated, running out of ideas :frowning:

zpool status -v output when failed disk is connected:

root@freenas:~ # zpool status -v
pool: NasStorage
state: UNAVAIL
status: One or more devices are faulted in response to IO failures.
action: Make sure the affected devices are connected, then run ‘zpool clear’.
scan: resilvered 0 in 0 days 00:00:08 with 0 errors on Sat Dec 27 11:32:46 2025
config:

    NAME                                          STATE     READ WRITE CKSUM
    NasStorage                                    UNAVAIL      1    39     0
      gptid/f695ba15-0312-11e9-867e-f04da2fb2f63  ONLINE       0     0     0
      972476780144021897                          REMOVED      0     0     0  was /dev/gptid/f75ccb2b-0312-11e9-867e-f04da2fb2f63
      gptid/f86c0718-0312-11e9-867e-f04da2fb2f63  ONLINE       0     0     0
      gptid/f96a2c60-0312-11e9-867e-f04da2fb2f63  ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

    <metadata>:<0x0>
    <metadata>:<0x115>
    <metadata>:<0x215>
    <metadata>:<0x319>
    <metadata>:<0x339>
    <metadata>:<0x148>
    <metadata>:<0x15d>
    <metadata>:<0x183>
    <metadata>:<0x29e>
    <metadata>:<0x2a2>
    <metadata>:<0x1a9>
    <metadata>:<0x1bd>
    <metadata>:<0x1d6>
    <metadata>:<0x2e5>
    <metadata>:<0x2ea>
    <metadata>:<0xf3>
    <metadata>:<0xfb>
    NasStorage/.system/rrd-a421eaccddb44c098d96b72146b5211d:<0x7a>
    NasStorage/iocage/jails/plexserver/root:<0x1d743>
    NasStorage/iocage/jails/plexserver/root:<0x1e953>

pool: freenas-boot
state: ONLINE
scan: scrub repaired 0 in 0 days 00:25:58 with 0 errors on Wed Dec 24 04:10:58 2025

It would appear as if you have a 4 disk stripe, i.e. a pool with no redundancy.
With a striped pool, if one drive fails it’s all over. The pool is lost and any data on it unusable.

The only way you can recover from this is if you somehow manage to get your failed drive back up and running again. Tinkering with the drive at this stage is risky because it’s easy to make things worse, but if you do, your priority should be to make an exact clone of it before it dies permanently.

Is the data important? Do you have a backup?

Thank you for your answer, I was afraid that will be the case

Data wasn’t too important that’s why I set it up this way few years back. But recently I temporary shifted some files I would like to get back, but disk failed before I could copy it back :sob:

I will try to get the failed disk online, perhaps getting new electronics can help

Replacing the electronics is beyond my skillset. I recommend trying a different cable and maybe a different port first. Those are the relatively easy troubleshooting steps. Typically you would see checksum errors if it’s related to the cable, but you won’t know until you try…

Will try that, thank you for suggestion :+1:

I think, it’s not a good idea to connect a ZFS disk, or a disk that was part of a ZFS pool, to a Windows system. Windows might write metadata like System Volume Information or $Recycle.Bin, which can overwrite blocks ZFS uses for pool metadata and potentially corrupt the pool.

If it is not a cable -related issue, but a failing disk, you maybe could try to clone it first using ddrescue on a Linux system and work with the clone instead of the original.

1 Like

That sounds interesting. I don’t have linux box set up atm, but I will look into that now.

Would you suggest spinning a VM with linux onboard or using something else…

I myself would try it on a bare-metal Linux system rather than a VM.

Running ddrescue now using live usb linux distro.

14h passed 13% recovered 3 read errors so far, estimated time left 1800d :rofl:
Any way to speed it up?

If I stop it and run reverse, will that help?

I’m not experienced with ddrescue.

Don’t touch it.

Thanks!

I will leave it running than :+1:

The total time depends on the drive size and read speed, but the rate you’re seeing is quite low.
If 13 percent took 14 hours, a full copy could easily ! take 100 plus hours.

If you use ddrescue with a logfile, for example by creating it first with
sudo touch /root/ddrescue.log

and then running (for example !)
sudo ddrescue -d -r3 -f /dev/sda /dev/sdb /root/ddrescue.log

you could stop the process anytime with Ctrl+C. When you restart with the same command and logfile, ddrescue should continue from where it left off, only trying the remaining or problematic blocks. But as @neofusion already stated, let it run.
Note:
If possible, try to ensure good cooling for the HDD. If the disk is really failing, you want to provide the most optimal conditions possible to avoid further damage during the rescue.

That’s great, I will leave it running for now. HDD I try to recover is 2TB cloning to another drive, perhaps cloning to image would be faster? This is the command I used.

sudo ddrescue -f /dev/sda /dev/sdb/ home/live/Desktop/log1.log

It did create the log file on the desktop

-r3 retries bad sector 3 times - didn’t know it may be necessary. I didn’t use -d trigger either. Lack of experience I guess :slight_smile:

The logfile should not become very large and its size mainly depends on the number of errors found. The -d parameter enables direct disk access. As far as I know, -d is recommended for failing or dying disks, as it bypasses the OS cache and allows ddrescue to read blocks more accurately. Using -d makes the overall process slower, since caching and readahead are skipped.

Let’s see what kind of miracle ddrescue can achieve… in the end, it all comes down to what can be rescued 1:1 and whether ZFS can reassemble the pool from it.

2 Likes