Issue replacing disk (freenas11.3u5)

hello all,

Never had an issue with replacing a disk yet in my encrypted pool. But now I’m stuck.

After pulling the disk with smart errors and replacing it with a new one (refurbished though), in the GUI I replaced the drive with the new one but it returned an error. Forgot what it was, something in the line of failure to write …

Tried to wipe it as I read that refurbished drives aren’t wiped in a state that freenas is expecting. That failed as well (permission error). Wiped it in another machine and tried to replace again in the GUI. This worked without errors but now the old .eli are still there and the raid isn’t healthy after resilver. How do I get the pool back to healthy?

You can assume I’m an idiot and don’t know what I’m doing. Thanks for your help. let me know if you need other info from logs or commands

zpool status:
pool: Raid_A
state: DEGRADED
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: illumos FMA Message Registry
scan: scrub repaired 0 in 0 days 07:17:51 with 7 errors on Sat Sep 14 20:34:14 2024
config:

NAME                                                  STATE     READ WRITE CKSUM
Raid_A                                                DEGRADED     0     0 66.1K
  raidz1-0                                            DEGRADED     0     0  220K
    replacing-0                                       DEGRADED     0     0 75.5M
      15286113899740930804                            UNAVAIL      0     0     0  was /dev/gptid/6aa31491-b342-11e6-8b36-d05099c19a75.eli
      1563577997621378361                             UNAVAIL      0     0     0  was /dev/gptid/f56edf84-714a-11ef-9ac0-d05099c3a25c.eli
      gptid/1c4b57cf-71c2-11ef-8eb2-d05099c3a25c.eli  ONLINE       0     0     0
    gptid/05798f7f-43b0-11ef-975a-d05099c3a25c.eli    DEGRADED     0     0     0  
    gptid/7ebbacae-f251-11e9-b2e6-d05099c3a25c.eli    ONLINE       0     0     0
    gptid/33dcceb7-309f-11e9-8edb-d05099c3a25c.eli    DEGRADED     0     0     0  
    gptid/13bd90a9-2442-11ef-a821-d05099c3a25c.eli    DEGRADED     0     0     0  

Zpool history (snip of replacement of disk):
2024-09-12.20:12:12 zpool import 4712614360291001932 Raid_A
2024-09-12.20:12:12 zpool set cachefile=/data/zfs/zpool.cache Raid_A
2024-09-12.23:04:19 zpool set cachefile=/data/zfs/zpool.cache Raid_A
2024-09-12.23:07:46 zpool replace Raid_A 15286113899740930804 /dev/gptid/f56edf84-714a-11ef-9ac0-d05099c3a25c.eli
2024-09-13.13:16:21 zpool set cachefile=/data/zfs/zpool.cache Raid_A
2024-09-13.13:20:34 zpool replace Raid_A 15286113899740930804 /dev/gptid/1c4b57cf-71c2-11ef-8eb2-d05099c3a25c.eli
2024-09-13.13:32:48 zpool set cachefile=/data/zfs/zpool.cache Raid_A

Really want to help with this one but your zpool status is hurting my head :face_with_head_bandage:

Could you please provide your hardware specs especially how your drives are connected.

Thanks

Did you offline the drive first out of interest?

So it looks like you had a 5 disk Z1 correct? However you currently have 3 drives degraded and two online. I’m sure you already know but a Z1 can only tolerate 1 drive failing and you currently have 3.

FreeNAS version: FreeNAS-11.3-U5
General hardware information:
ASRock C2550D4I
2x Crucial 8GB DDR3L 1600MHz PC3L-12800 ECC

Raid_A is raidz1, 5 SATA hard disk drives (2x3TB, 3x12TB (slowing upgrading the raid to 5x12tb) connected to the asrock onboard sata ports. I just replaced one 3TB to one 12TB.

I followed this blog for a setup a while ago:

Do you have enough info? thanks!

I did not but haven’t done that in the past either with previous replacements.

I’m assuming the degraded state is due to several errors that are in my jails. Cleaning that up:

errors: Permanent errors have been detected in the following files:

    Raid_A/iocage/jails/transmission/root:<0x0>
    Raid_A/iocage/jails/radarr/root:<0x0>
    <0xffffffffffffffff>:<0x0>

If I fix these error, then the degraded state should be resolved?

3 of your 5 drives are DEGRADED which means they are no longer in your pool. A Z1 creates single distributed parity allowing for one drive to fail and still reconstruct your data. With 3 missing it is no longer able to do this on ALL data in the pool not just your jails.

Your best hope here is a cable issue therefore I’d suggest you power down your machine and check all cables. Power up and then let’s see if we’re still missing 3 drives.

Always good practice if you can to offline the failing drive first it’s a bit like a graceful shutdown as opposed to just pulling the plug.

I checked cables. Everything looks fine.

All the disks are connected, no? 10 disks + flash for OS ( I have 2 raids running, 2 x 5)

ada0 at ahcich0 bus 0 scbus0 target 0 lun 0
ada0: ACS-3 ATA SATA 3.x device
ada0: Serial Number ZJV4XEN3
ada0: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 512bytes)
ada0: Command Queueing enabled
ada0: 11444224MB (23437770752 512 byte sectors)
ada1 at ahcich1 bus 0 scbus1 target 0 lun 0
ada1: ACS-3 ATA SATA 3.x device
ada1: Serial Number ZJV5DQ37
ada1: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 8192bytes)
ada1: Command Queueing enabled
ada1: 11444224MB (23437770752 512 byte sectors)
ada2 at ahcich3 bus 0 scbus3 target 0 lun 0
ada2: ACS-3 ATA SATA 3.x device
ada2: Serial Number ZGY6S2KF
ada2: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 8192bytes)
ada2: Command Queueing enabled
ada2: 3815447MB (7814037168 512 byte sectors)
ada3 at ahcich5 bus 0 scbus5 target 0 lun 0
ada3: ATA8-ACS SATA 3.x device
ada3: Serial Number 69AA8ZNAS
ada3: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 8192bytes)
ada3: Command Queueing enabled
ada3: 2861588MB (5860533168 512 byte sectors)
ada4 at ahcich10 bus 0 scbus10 target 0 lun 0
ada4: ACS-3 ATA SATA 3.x device
ada4: Serial Number ZDH8NYQR
ada4: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 8192bytes)
ada4: Command Queueing enabled
ada4: 3815447MB (7814037168 512 byte sectors)
ada5 at ahcich11 bus 0 scbus11 target 0 lun 0
ada5: <ST4000DM004-2CV104 0001> ACS-3 ATA SATA 3.x device
ada5: Serial Number WFN1JYWH
ada5: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 8192bytes)
ada5: Command Queueing enabled
ada5: 3815447MB (7814037168 512 byte sectors)
ada5: quirks=0x1<4K>
ada6 at ahcich12 bus 0 scbus12 target 0 lun 0
ada6: ACS-3 ATA SATA 3.x device
ada6: Serial Number ZGY7334J
ada6: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 8192bytes)
ada6: Command Queueing enabled
ada6: 3815447MB (7814037168 512 byte sectors)
ada7 at ahcich13 bus 0 scbus13 target 0 lun 0
ada7: ACS-3 ATA SATA 3.x device
ada7: Serial Number ZDH95N9R
ada7: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 8192bytes)
ada7: Command Queueing enabled
ada7: 3815447MB (7814037168 512 byte sectors)
ada8 at ahcich14 bus 0 scbus14 target 0 lun 0
ada8: ATA8-ACS SATA 3.x device
ada8: Serial Number 98J0GRPAS
ada8: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 8192bytes)
ada8: Command Queueing enabled
ada8: 2861588MB (5860533168 512 byte sectors)
ada9 at ahcich15 bus 0 scbus15 target 0 lun 0
ada9: ACS-4 ATA SATA 3.x device
ada9: Serial Number ZTN0HR4A
ada9: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 512bytes)
ada9: Command Queueing enabled
ada9: 11444224MB (23437770752 512 byte sectors)
da0 at umass-sim0 bus 0 scbus17 target 0 lun 0
da0: <SanDisk Ultra Fit 1.00> Removable Direct Access SPC-4 SCSI device
da0: Serial Number 4C531001410814116442
da0: 40.000MB/s transfers
da0: 14663MB (30031250 512 byte sectors)
da0: quirks=0x2<NO_6_BYTE>

pool status:
pool: Raid_A
state: DEGRADED
status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scan: resilver in progress since Sun Sep 15 13:07:48 2024
1.25T scanned at 1.50G/s, 172G issued at 207M/s, 11.9T total
14.4M resilvered, 1.41% done, 0 days 16:28:18 to go
config:

NAME                                                  STATE     READ WRITE CKSUM
Raid_A                                                DEGRADED     0     0   387
  raidz1-0                                            DEGRADED     0     0 1.51K
    replacing-0                                       DEGRADED     0     0     0
      15286113899740930804                            UNAVAIL      0     0     0  was /dev/gptid/6aa31491-b342-11e6-8b36-d05099c19a75.eli
      1563577997621378361                             UNAVAIL      0     0     0  was /dev/gptid/f56edf84-714a-11ef-9ac0-d05099c3a25c.eli
      gptid/1c4b57cf-71c2-11ef-8eb2-d05099c3a25c.eli  ONLINE       0     0     0
    gptid/05798f7f-43b0-11ef-975a-d05099c3a25c.eli    DEGRADED     0     0     0  too many errors
    gptid/7ebbacae-f251-11e9-b2e6-d05099c3a25c.eli    ONLINE       0     0     0
    gptid/33dcceb7-309f-11e9-8edb-d05099c3a25c.eli    DEGRADED     0     0     0  too many errors
    gptid/13bd90a9-2442-11ef-a821-d05099c3a25c.eli    DEGRADED     0     0     0  too many errors

errors: Permanent errors have been detected in the following files:

    <0x205>:<0x0>
    <0x126>:<0x0>
    <0xffffffffffffffff>:<0x0>

Not entirely sure what’s happened here but you’re currently 3 drives down in a 5 disk Z1 which means there is not a lot we can do.

When you started replacing this most recent drive was the pool in a good state? My only thoughts are that during that rebuild process it put strain on your other drives and that was enough to tip them over the edge.

yes, pool was healthy except for one drive showing smart errors. Which is why I was replacing it. My files are still available in that raid. I can browse them. If I’m down 3 out of 5, I wouldn’t be able to do that? Sorry about these relatively noob questions…

This sounds really odd. Did you say you have two zpools? Can you share the unfiltered output of zpool status please.

sure. I’ve deleted the 3 files in error in raid A.

pool: Raid_A
state: DEGRADED
status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scan: resilver in progress since Sun Sep 15 13:07:48 2024
1.84T scanned at 448M/s, 876G issued at 209M/s, 11.9T total
71.9M resilvered, 7.22% done, 0 days 15:19:46 to go
config:

NAME                                                  STATE     READ WRITE CKSUM
Raid_A                                                DEGRADED     0     0 1.87K
  raidz1-0                                            DEGRADED     0     0 7.49K
    replacing-0                                       DEGRADED     0     0     0
      15286113899740930804                            UNAVAIL      0     0     0  was /dev/gptid/6aa31491-b342-11e6-8b36-d05099c19a75.eli
      1563577997621378361                             UNAVAIL      0     0     0  was /dev/gptid/f56edf84-714a-11ef-9ac0-d05099c3a25c.eli
      gptid/1c4b57cf-71c2-11ef-8eb2-d05099c3a25c.eli  ONLINE       0     0     0
    gptid/05798f7f-43b0-11ef-975a-d05099c3a25c.eli    DEGRADED     0     0     0  too many errors
    gptid/7ebbacae-f251-11e9-b2e6-d05099c3a25c.eli    ONLINE       0     0     0
    gptid/33dcceb7-309f-11e9-8edb-d05099c3a25c.eli    DEGRADED     0     0     0  too many errors
    gptid/13bd90a9-2442-11ef-a821-d05099c3a25c.eli    DEGRADED     0     0     0  too many errors

errors: Permanent errors have been detected in the following files:

    <0x205>:<0x0>
    <0x126>:<0x0>
    <0xffffffffffffffff>:<0x0>

pool: Raid_B
state: ONLINE
status: Some supported features are not enabled on the pool. The pool can
still be used, but some features are unavailable.
action: Enable all features using ‘zpool upgrade’. Once this is done,
the pool may no longer be accessible by software that does not support
the features. See zpool-features(7) for details.
scan: scrub repaired 0 in 0 days 08:46:44 with 0 errors on Sun Sep 15 08:46:46 2024
config:

NAME                                            STATE     READ WRITE CKSUM
Raid_B                                          ONLINE       0     0     0
  raidz1-0                                      ONLINE       0     0     0
    gptid/d4ac1f3a-c741-11ea-864c-d05099c3a25c  ONLINE       0     0     0
    gptid/dfb37b4f-3261-11eb-a427-d05099c3a25c  ONLINE       0     0     0
    gptid/977f5942-0fa2-11eb-a15d-d05099c3a25c  ONLINE       0     0     0
    gptid/98d08c21-9b43-11ea-b7ca-d05099c3a25c  ONLINE       0     0     0
    gptid/6be46d78-5a3c-11e9-9833-d05099c3a25c  ONLINE       0     0     0

errors: No known data errors

pool: freenas-boot
state: ONLINE
scan: scrub repaired 0 in 0 days 00:08:02 with 0 errors on Fri Sep 13 03:53:03 2024
config:

NAME                                          STATE     READ WRITE CKSUM
freenas-boot                                  ONLINE       0     0     0
  gptid/52e20026-b27a-11e6-afbf-d05099c19a75  ONLINE       0     0     0

errors: No known data errors

And you can still access data in Raid-A?

Yes, everything looks normal. Opened files here and there.

What’s the name of the dataset you’re working in? Can you share the output of zfs list?

sure. output of ‘zfs list’:

NAME                                                      USED  AVAIL  REFER  MOUNTPOINT
Raid_A                                                   9.47T  1.08T  9.21T  /mnt/Raid_A
Raid_A/.system                                            546M  1.08T   450M  /mnt/Raid_A/.system
Raid_A/.system/configs-810048d7feed436fae88f4409435135f  78.2M  1.08T  78.2M  /mnt/Raid_A/.system/configs-810048d7feed436fae88f4409435135f
Raid_A/.system/cores                                     1.49M  1.08T  1.49M  /mnt/Raid_A/.system/cores
Raid_A/.system/rrd-810048d7feed436fae88f4409435135f       153K  1.08T   153K  /mnt/Raid_A/.system/rrd-810048d7feed436fae88f4409435135f
Raid_A/.system/samba4                                     582K  1.08T   582K  /mnt/Raid_A/.system/samba4
Raid_A/.system/syslog-810048d7feed436fae88f4409435135f   16.2M  1.08T  16.2M  /mnt/Raid_A/.system/syslog-810048d7feed436fae88f4409435135f
Raid_A/iocage                                            1.58G  1.08T  8.87M  /mnt/Raid_A/iocage
Raid_A/iocage/download                                    272M  1.08T   141K  /mnt/Raid_A/iocage/download
Raid_A/iocage/download/11.2-RELEASE                       272M  1.08T   272M  /mnt/Raid_A/iocage/download/11.2-RELEASE
Raid_A/iocage/images                                      141K  1.08T   141K  /mnt/Raid_A/iocage/images
Raid_A/iocage/jails                                       141K  1.08T   141K  /mnt/Raid_A/iocage/jails
Raid_A/iocage/log                                         141K  1.08T   141K  /mnt/Raid_A/iocage/log
Raid_A/iocage/releases                                   1.30G  1.08T   141K  /mnt/Raid_A/iocage/releases
Raid_A/iocage/releases/11.2-RELEASE                      1.30G  1.08T   141K  /mnt/Raid_A/iocage/releases/11.2-RELEASE
Raid_A/iocage/releases/11.2-RELEASE/root                 1.30G  1.08T  1.27G  /mnt/Raid_A/iocage/releases/11.2-RELEASE/root
Raid_A/iocage/templates                                   141K  1.08T   141K  /mnt/Raid_A/iocage/templates
Raid_A/jails                                              560M  1.08T   185K  /mnt/Raid_A/jails
Raid_A/jails/.warden-template-pluginjail-10.3-x64         560M  1.08T   549M  /mnt/Raid_A/jails/.warden-template-pluginjail-10.3-x64
Raid_A/unsorted                                           252G  1.08T   252G  /mnt/Raid_A/unsorted
Raid_B                                                   9.35T  4.68T  9.34T  /mnt/Raid_B
Raid_B/.system                                            267M  4.68T   594K  legacy
Raid_B/.system/configs-810048d7feed436fae88f4409435135f   222M  4.68T   222M  legacy
Raid_B/.system/cores                                     3.07M  4.68T  3.07M  legacy
Raid_B/.system/rrd-810048d7feed436fae88f4409435135f      41.4M  4.68T  41.4M  legacy
Raid_B/.system/samba4                                     601K  4.68T   601K  legacy
Raid_B/.system/syslog-810048d7feed436fae88f4409435135f    141K  4.68T   141K  legacy
Raid_B/.system/webui                                      141K  4.68T   141K  legacy
Raid_B/iso                                                141K  4.68T   141K  /mnt/Raid_B/iso
Raid_B/ubuntu                                            10.2G  4.69T  3.10G  -
freenas-boot                                             8.69G  5.12G    31K  none
freenas-boot/ROOT                                        8.64G  5.12G    25K  none
freenas-boot/ROOT/11.1-RELEASE                           10.4M  5.12G   838M  /
freenas-boot/ROOT/11.1-U1                                11.0M  5.12G   841M  /
freenas-boot/ROOT/11.1-U6                                8.97M  5.12G   852M  /
freenas-boot/ROOT/11.1-U7                                15.4M  5.12G   761M  /
freenas-boot/ROOT/11.2-U3                                18.7M  5.12G   781M  /
freenas-boot/ROOT/11.2-U4.1                              12.1M  5.12G   774M  /
freenas-boot/ROOT/11.2-U6                                17.4M  5.12G   781M  /
freenas-boot/ROOT/11.2-U8                                16.7M  5.12G   781M  /
freenas-boot/ROOT/11.3-U5                                8.51G  5.12G  1.02G  /
freenas-boot/ROOT/9.10.2-U6                              12.4M  5.12G   650M  /
freenas-boot/ROOT/Initial-Install                           1K  5.12G   622M  legacy
freenas-boot/ROOT/default                                14.9M  5.12G   637M  legacy
freenas-boot/grub                                        6.95M  5.12G  6.95M  legacy

You can use the GUI to detach the two unavailable drives, which should be the refurbished drive and its ghost from the first attempt.
But the pool has grown errors in metadata, and this cannot be repaired.

Prepare to make up a new pool—possibly a safer raidz2, and with native ZFS encryption rather than ELI—and restore Raid_A from a known good backup.

ok thanks, detaching gives an error. Tried it before. But I was already thinking of starting with a clean pool, and as you say, go for a z2 instead.

Error: concurrent.futures.process._RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/middlewared/plugins/zfs.py", line 247, in __zfs_vdev_operation
    op(target, *args)
  File "libzfs.pyx", line 369, in libzfs.ZFS.__exit__
  File "/usr/local/lib/python3.7/site-packages/middlewared/plugins/zfs.py", line 247, in __zfs_vdev_operation
    op(target, *args)
  File "/usr/local/lib/python3.7/site-packages/middlewared/plugins/zfs.py", line 256, in <lambda>
    self.__zfs_vdev_operation(name, label, lambda target: target.detach())
  File "libzfs.pyx", line 1774, in libzfs.ZFSVdev.detach
libzfs.ZFSException: no valid replicas

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/concurrent/futures/process.py", line 239, in _process_worker
    r = call_item.fn(*call_item.args, **call_item.kwargs)
  File "/usr/local/lib/python3.7/site-packages/middlewared/worker.py", line 97, in main_worker
    res = loop.run_until_complete(coro)
  File "/usr/local/lib/python3.7/asyncio/base_events.py", line 587, in run_until_complete
    return future.result()
  File "/usr/local/lib/python3.7/site-packages/middlewared/worker.py", line 53, in _run
    return await self._call(name, serviceobj, methodobj, params=args, job=job)
  File "/usr/local/lib/python3.7/site-packages/middlewared/worker.py", line 45, in _call
    return methodobj(*params)
  File "/usr/local/lib/python3.7/site-packages/middlewared/worker.py", line 45, in _call
    return methodobj(*params)
  File "/usr/local/lib/python3.7/site-packages/middlewared/schema.py", line 965, in nf
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/middlewared/plugins/zfs.py", line 256, in detach
    self.__zfs_vdev_operation(name, label, lambda target: target.detach())
  File "/usr/local/lib/python3.7/site-packages/middlewared/plugins/zfs.py", line 249, in __zfs_vdev_operation
    raise CallError(str(e), e.code)
middlewared.service_exception.CallError: [EZFS_NOREPLICAS] no valid replicas
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/middlewared/main.py", line 130, in call_method
    io_thread=False)
  File "/usr/local/lib/python3.7/site-packages/middlewared/main.py", line 1084, in _call
    return await methodobj(*args)
  File "/usr/local/lib/python3.7/site-packages/middlewared/schema.py", line 961, in nf
    return await f(*args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/middlewared/plugins/pool.py", line 1198, in detach
    await self.middleware.call('zfs.pool.detach', pool['name'], found[1]['guid'])
  File "/usr/local/lib/python3.7/site-packages/middlewared/main.py", line 1141, in call
    app=app, pipes=pipes, job_on_progress_cb=job_on_progress_cb, io_thread=True,
  File "/usr/local/lib/python3.7/site-packages/middlewared/main.py", line 1081, in _call
    return await self._call_worker(name, *args)
  File "/usr/local/lib/python3.7/site-packages/middlewared/main.py", line 1101, in _call_worker
    return await self.run_in_proc(main_worker, name, args, job)
  File "/usr/local/lib/python3.7/site-packages/middlewared/main.py", line 1036, in run_in_proc
    return await self.run_in_executor(self.__procpool, method, *args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/middlewared/main.py", line 1010, in run_in_executor
    return await loop.run_in_executor(pool, functools.partial(method, *args, **kwargs))
middlewared.service_exception.CallError: [EZFS_NOREPLICAS] no valid replicas