The important question is how to replace a failed raidz1 member without the UI?
The story so far…
I started out on Core and it has proven very reliable and stable.
A few months ago my onboard network card died (Asus B550-F) and has remained very dead ever since. I plugged in a usb-network dongle and after a quick configuration it just worked!
Well, after a month of use it hung in the middle of the night, but recovered when unplugged and re-inserted.
I then wanted to expand a raidz with an additional drive and that functionality was evidently on the horizon on Scale.
I took the plunge and upgraded to 24.04.2.5.
After a short while (half an hour-ish) the network was immensely unreliable. The console was barfing up messages of the network chip being restarted over and over again.
It was obviously a Reatek chip in the usb dongle. The switch from FreeBSD to linux had introduced the old nemesis The Reatek drivers under linux. For well over 20 years I have never had a Reatek chip that has worked under linux, be it yellowdog, fedora, centos, debian… They have never ever worked reliably!
So I popped open the machine and inserted an intel nic, powered up and configured - no more problems with network.
Still no expansion of radiz1 in the UI and now 24.10.0.2 was available so I did another upgrade and this time there were bugs somewhere:
As soon as I enter the Storage Dashboard there is a pop-up showing
File "/usr/lib/python3/dist-packages/middlewared/plugins/disk_/availability.py", line 36, in details_impl
in_use_disks_imported[in_use_disk] = info['pool_name']
~~~~^^^^^^^^^^^^^
KeyError: 'pool_name'
and the list of disks is unavailable!
I thought I’d take a look at the source code, but so far not.
And tonight one of the members in a one raidz1 pool went completely belly-up. No contact - just I/O-errors.
But with the ‘pool_name’ bug I can’t use the UI to replace the disk. There is an extra disk available that I had intended to expand the
raidz1 with, but there seem to be a better use of it now.
So, is there a known fix for the ‘pool_name’ problem or can the disk be reliably replaced from the command line?
# zpool status -v hot
pool: hot
state: ONLINE
status: One or more devices are faulted in response to persistent errors.
Sufficient replicas exist for the pool to continue functioning in a
degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the device
repaired.
scan: scrub repaired 0B in 11:14:22 with 0 errors on Sun Oct 27 10:14:23 2024
config:
NAME STATE READ WRITE CKSUM
hot ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
sdj2 FAULTED 3 2.65K 0 too many errors
sdg2 ONLINE 0 0 0
sdk2 ONLINE 0 0 0
logs
sdi1 ONLINE 0 0 0
errors: No known data errors
The flawed disk:
# fdisk -l /dev/sdj
fdisk: cannot open /dev/sdj: Input/output error
Looking at the zpool status output, I’m somewhat stumped. All the uuids that where dancing all over the screen in CORE has been replaced with direct device partitions in the SCALE that became of the upgrade. In fact, the commands refuse to deal with partition uuids.
In the end I did a few things differently:
Since neither parted nor fdisk managed to start the swap partition (first partition) at 128 but rather 2048, partition 2 ended up being too small. So I worked around that by creating partition 2 first and simply copied partition 2 start/end from one of the operational drives. After that I created partition 1 in the leftover empty space.
The suggested partition types in the link did not match the existing ones on the other drive so I used fdisk type 140 & 142 to get identical partitions.
And as noted above, all the uuids are gone from the system. I did the zfs replace hot /dev/sdj2 /dev/sdl2 instead. Still almost 12 hours to go…
That appears to be a side effect of the core to scale upgrade.
When you replace a disk on scale (via ui) it will use a PARTUUID and I resolved permanently in my case by replacing all the disks with a spare, circling the now replaced disk into the next.
The ‘pool_name’ bug is very persistent. I find it using firefox, chrome, safari on windows, linux, ios…
Can it have something to do with the new note that I find on the boot-pool? The
One or more features are enabled on the pool despite not being requested by the ‘compatibility’ property.
That means you are definitely doing synchronous writes. Without any details of what is creating them it is impossible to say whether they are needed in your specific use case, but if they are not needed then you would likely see a significant performance increase if you turned them off. (If they are for a VM for example, they are probably necessary. Ditto if they are for a database of some sort. If they are from e.g. NFS, then it is possible that you can switch to asynchronous writes for these.
This is not the case as far as I am aware (but I may be wrong).
No - that functionality is a recent ZFS enhancement, and unless you apply compatibility restrictions yourself to your pools via CLI is only used in TrueNAS scale to protect the boot-pool from being upgraded with new ZFS capabilities (which if done effectively trashes your boot pool).
It is a Python error apparently caused by the O/S not giving it a value for pool_name in some sort of query. It may have nothing to do with your hot pool but rather some other pool.
If you could now give us the output of sudo zpool status -v as previously requested without limiting it to the single hot pool with the disk error, we can see if we can spot the cause.
For completeness: sdp and sdn are sitting on usb. And it doesn’t like the dead sdj any longer (not capable of SMART self-check). But the problem existed before the sdj died.
# zpool status -v
pool: boot-pool
state: ONLINE
status: One or more features are enabled on the pool despite not being
requested by the 'compatibility' property.
action: Consider setting 'compatibility' to an appropriate value, or
adding needed features to the relevant file in
/etc/zfs/compatibility.d or /usr/share/zfs/compatibility.d.
scan: scrub repaired 0B in 00:00:27 with 0 errors on Tue Nov 19 03:45:28 2024
config:
NAME STATE READ WRITE CKSUM
boot-pool ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
sda2 ONLINE 0 0 0
sdh2 ONLINE 0 0 0
errors: No known data errors
pool: cool
state: ONLINE
scan: scrub repaired 0B in 10:46:10 with 0 errors on Sun Oct 27 09:46:12 2024
config:
NAME STATE READ WRITE CKSUM
cool ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
sdd2 ONLINE 0 0 0
sde2 ONLINE 0 0 0
sdb2 ONLINE 0 0 0
sdc2 ONLINE 0 0 0
logs
sdf1 ONLINE 0 0 0
errors: No known data errors
pool: disks
state: ONLINE
scan: scrub repaired 0B in 05:31:59 with 0 errors on Sat Nov 16 16:24:42 2024
config:
NAME STATE READ WRITE CKSUM
disks ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
sdp2 ONLINE 0 0 0
sdn2 ONLINE 0 0 0
errors: No known data errors
pool: fast
state: ONLINE
scan: scrub repaired 0B in 00:06:07 with 0 errors on Sun Oct 27 00:06:09 2024
config:
NAME STATE READ WRITE CKSUM
fast ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
sdm2 ONLINE 0 0 0
sdo2 ONLINE 0 0 0
errors: No known data errors
pool: hot
state: ONLINE
scan: resilvered 4.29T in 15:41:53 with 0 errors on Mon Nov 25 07:57:30 2024
config:
NAME STATE READ WRITE CKSUM
hot ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
sdl2 ONLINE 0 0 0
sdg2 ONLINE 0 0 0
sdk2 ONLINE 0 0 0
logs
sdi1 ONLINE 0 0 0
errors: No known data errors
I’ve now looked at the code. There seems to be a genuine bug:
async def details_impl(self, data):
# see `self.details` for arguments and their meaning
in_use_disks_imported = {}
for in_use_disk, info in (
await self.middleware.call('zpool.status', {'real_paths': True})
)['disks'].items():
in_use_disks_imported[in_use_disk] = info['pool_name']
The call to ‘zpool.status’ should return a dict with the pool names as keys and so the specifcation of [‘disks’] should return nothing (as far as I can test). But I have a pool called ‘disks’ so it continues and does not find ‘pool_name’
Put in a Bug Ticket and debug dump and specifically mention your pool name and what you believe is wrong. They may need to add in checks to prevent this or change the variable to something nobody would every name a pool.
Top right of your GUI should be a . Feedback / Report a Bug.
You can also do it at the top right of this forum, Report a Bug