A bumpy road into a tight corner

dukhat2259 · November 24, 2024, 10:02am

The important question is how to replace a failed raidz1 member without the UI?

The story so far…
I started out on Core and it has proven very reliable and stable.
A few months ago my onboard network card died (Asus B550-F) and has remained very dead ever since. I plugged in a usb-network dongle and after a quick configuration it just worked!
Well, after a month of use it hung in the middle of the night, but recovered when unplugged and re-inserted.

I then wanted to expand a raidz with an additional drive and that functionality was evidently on the horizon on Scale.
I took the plunge and upgraded to 24.04.2.5.

After a short while (half an hour-ish) the network was immensely unreliable. The console was barfing up messages of the network chip being restarted over and over again.
It was obviously a Reatek chip in the usb dongle. The switch from FreeBSD to linux had introduced the old nemesis The Reatek drivers under linux. For well over 20 years I have never had a Reatek chip that has worked under linux, be it yellowdog, fedora, centos, debian… They have never ever worked reliably!
So I popped open the machine and inserted an intel nic, powered up and configured - no more problems with network.

Still no expansion of radiz1 in the UI and now 24.10.0.2 was available so I did another upgrade and this time there were bugs somewhere:
As soon as I enter the Storage Dashboard there is a pop-up showing

File "/usr/lib/python3/dist-packages/middlewared/plugins/disk_/availability.py", line 36, in details_impl
    in_use_disks_imported[in_use_disk] = info['pool_name']
                                         ~~~~^^^^^^^^^^^^^
KeyError: 'pool_name'

and the list of disks is unavailable!
I thought I’d take a look at the source code, but so far not.

And tonight one of the members in a one raidz1 pool went completely belly-up. No contact - just I/O-errors.
But with the ‘pool_name’ bug I can’t use the UI to replace the disk. There is an extra disk available that I had intended to expand the
raidz1 with, but there seem to be a better use of it now.

So, is there a known fix for the ‘pool_name’ problem or can the disk be reliably replaced from the command line?

Protopia · November 24, 2024, 11:57am

You can definitely replace the disk from the command line.

Can you start by showing us the output of zpool status -v?

dukhat2259 · November 24, 2024, 12:23pm

Thanks for the response!

That would be

# zpool status -v hot
  pool: hot
 state: ONLINE
status: One or more devices are faulted in response to persistent errors.
        Sufficient replicas exist for the pool to continue functioning in a
        degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the device
        repaired.
  scan: scrub repaired 0B in 11:14:22 with 0 errors on Sun Oct 27 10:14:23 2024
config:

        NAME        STATE     READ WRITE CKSUM
        hot         ONLINE       0     0     0
          raidz1-0  ONLINE       0     0     0
            sdj2    FAULTED      3 2.65K     0  too many errors
            sdg2    ONLINE       0     0     0
            sdk2    ONLINE       0     0     0
        logs
          sdi1      ONLINE       0     0     0

errors: No known data errors

The flawed disk:

# fdisk -l /dev/sdj
fdisk: cannot open /dev/sdj: Input/output error

Sibling disk:

# fdisk -l /dev/sdg
Disk /dev/sdg: 7.28 TiB, 8001563222016 bytes, 15628053168 sectors
Disk model: WDC WD80EFBX-68A
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: gpt
Disk identifier: 80B5804A-7C28-11EC-B134-E03F49A22F42

Device       Start         End     Sectors  Size Type
/dev/sdg1      128     4194431     4194304    2G FreeBSD swap
/dev/sdg2  4194432 15628053127 15623858696  7.3T FreeBSD ZFS

Unused disk:

# fdisk -l /dev/sdl
Disk /dev/sdl: 7.28 TiB, 8001563222016 bytes, 15628053168 sectors
Disk model: TOSHIBA HDWG480
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: gpt
Disk identifier: 3428DDDC-734F-7E4C-8EAB-D673FC2EC473

dan · November 24, 2024, 12:24pm

dukhat2259 · November 24, 2024, 3:41pm

Thanks for the link!

Looking at the zpool status output, I’m somewhat stumped. All the uuids that where dancing all over the screen in CORE has been replaced with direct device partitions in the SCALE that became of the upgrade. In fact, the commands refuse to deal with partition uuids.

In the end I did a few things differently:

Since neither parted nor fdisk managed to start the swap partition (first partition) at 128 but rather 2048, partition 2 ended up being too small. So I worked around that by creating partition 2 first and simply copied partition 2 start/end from one of the operational drives. After that I created partition 1 in the leftover empty space.
The suggested partition types in the link did not match the existing ones on the other drive so I used fdisk type 140 & 142 to get identical partitions.
And as noted above, all the uuids are gone from the system. I did the zfs replace hot /dev/sdj2 /dev/sdl2 instead. Still almost 12 hours to go…

Protopia · November 24, 2024, 3:53pm

Sounds like you are on the way to fixing this. Fingers crossed for you.

P.S. I see you have an SLOG on this pool. Are you doing synchronous writes that will use it?

dukhat2259 · November 24, 2024, 7:04pm

The SLOG looks like it is being used with 87TB written so far (and 0GB read).

SmallBarky · November 24, 2024, 8:15pm

I don’t think the SLOG only gets used for synchronous writes only, but all writes to the pool you have it assigned to?

Stux · November 24, 2024, 9:23pm

Did you try force refreshing the page? Or clearing your browsers cache?

Stux · November 24, 2024, 9:27pm

That appears to be a side effect of the core to scale upgrade.

When you replace a disk on scale (via ui) it will use a PARTUUID and I resolved permanently in my case by replacing all the disks with a spare, circling the now replaced disk into the next.

Stux · November 24, 2024, 9:28pm

I think it’s only sync writes.

dukhat2259 · November 25, 2024, 7:09am

The ‘pool_name’ bug is very persistent. I find it using firefox, chrome, safari on windows, linux, ios…
Can it have something to do with the new note that I find on the boot-pool? The

One or more features are enabled on the pool despite not being requested by the ‘compatibility’ property.

Never saw that one on CORE

Protopia · November 25, 2024, 12:45pm

That means you are definitely doing synchronous writes. Without any details of what is creating them it is impossible to say whether they are needed in your specific use case, but if they are not needed then you would likely see a significant performance increase if you turned them off. (If they are for a VM for example, they are probably necessary. Ditto if they are for a database of some sort. If they are from e.g. NFS, then it is possible that you can switch to asynchronous writes for these.

This is not the case as far as I am aware (but I may be wrong).

No - that functionality is a recent ZFS enhancement, and unless you apply compatibility restrictions yourself to your pools via CLI is only used in TrueNAS scale to protect the boot-pool from being upgraded with new ZFS capabilities (which if done effectively trashes your boot pool).

It is a Python error apparently caused by the O/S not giving it a value for pool_name in some sort of query. It may have nothing to do with your hot pool but rather some other pool.

If you could now give us the output of sudo zpool status -v as previously requested without limiting it to the single hot pool with the disk error, we can see if we can spot the cause.

dukhat2259 · November 25, 2024, 4:06pm

For completeness: sdp and sdn are sitting on usb. And it doesn’t like the dead sdj any longer (not capable of SMART self-check). But the problem existed before the sdj died.

# zpool status -v
  pool: boot-pool
 state: ONLINE
status: One or more features are enabled on the pool despite not being
        requested by the 'compatibility' property.
action: Consider setting 'compatibility' to an appropriate value, or
        adding needed features to the relevant file in
        /etc/zfs/compatibility.d or /usr/share/zfs/compatibility.d.
  scan: scrub repaired 0B in 00:00:27 with 0 errors on Tue Nov 19 03:45:28 2024
config:

        NAME        STATE     READ WRITE CKSUM
        boot-pool   ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            sda2    ONLINE       0     0     0
            sdh2    ONLINE       0     0     0

errors: No known data errors

  pool: cool
 state: ONLINE
  scan: scrub repaired 0B in 10:46:10 with 0 errors on Sun Oct 27 09:46:12 2024
config:

        NAME        STATE     READ WRITE CKSUM
        cool        ONLINE       0     0     0
          raidz1-0  ONLINE       0     0     0
            sdd2    ONLINE       0     0     0
            sde2    ONLINE       0     0     0
            sdb2    ONLINE       0     0     0
            sdc2    ONLINE       0     0     0
        logs
          sdf1      ONLINE       0     0     0

errors: No known data errors

  pool: disks
 state: ONLINE
  scan: scrub repaired 0B in 05:31:59 with 0 errors on Sat Nov 16 16:24:42 2024
config:

        NAME        STATE     READ WRITE CKSUM
        disks       ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            sdp2    ONLINE       0     0     0
            sdn2    ONLINE       0     0     0

errors: No known data errors

  pool: fast
 state: ONLINE
  scan: scrub repaired 0B in 00:06:07 with 0 errors on Sun Oct 27 00:06:09 2024
config:

        NAME        STATE     READ WRITE CKSUM
        fast        ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            sdm2    ONLINE       0     0     0
            sdo2    ONLINE       0     0     0

errors: No known data errors

  pool: hot
 state: ONLINE
  scan: resilvered 4.29T in 15:41:53 with 0 errors on Mon Nov 25 07:57:30 2024
config:

        NAME        STATE     READ WRITE CKSUM
        hot         ONLINE       0     0     0
          raidz1-0  ONLINE       0     0     0
            sdl2    ONLINE       0     0     0
            sdg2    ONLINE       0     0     0
            sdk2    ONLINE       0     0     0
        logs
          sdi1      ONLINE       0     0     0

errors: No known data errors

Protopia · November 25, 2024, 9:54pm

Nothing jumps out at me that would cause the ‘pool_name’ python exception.

You need to raise a ticket with ix and provide a debug file.

dukhat2259 · November 26, 2024, 9:35am

I’ve now looked at the code. There seems to be a genuine bug:

    async def details_impl(self, data):
        # see `self.details` for arguments and their meaning
        in_use_disks_imported = {}
        for in_use_disk, info in (
            await self.middleware.call('zpool.status', {'real_paths': True})
        )['disks'].items():
            in_use_disks_imported[in_use_disk] = info['pool_name']

The call to ‘zpool.status’ should return a dict with the pool names as keys and so the specifcation of [‘disks’] should return nothing (as far as I can test). But I have a pool called ‘disks’ so it continues and does not find ‘pool_name’

SmallBarky · November 26, 2024, 9:43am

Put in a Bug Ticket and debug dump and specifically mention your pool name and what you believe is wrong. They may need to add in checks to prevent this or change the variable to something nobody would every name a pool.

Top right of your GUI should be a . Feedback / Report a Bug.

You can also do it at the top right of this forum, Report a Bug

Stux · November 26, 2024, 1:48pm

If only it looked like a bug report icon

etorix · November 26, 2024, 5:26pm

Never, never ever underestimate human perversity…