Pool degraded after update from 24.10.2.1 to 25.04.0

Henrik_Bo_Andersen · May 17, 2025, 4:59pm

After the update I discovered that the pool has degraded because one of four disk is marked FAULTED under Storage>Pool Devices It has no ZFS errors and SMART scan returns no errors.
When trying to set disk Online or add the disk to the pool I get:

More info reveals this:

Traceback (most recent call last):
File “/usr/lib/python3/dist-packages/middlewared/job.py”, line 515, in run
await self.future
File “/usr/lib/python3/dist-packages/middlewared/job.py”, line 560, in run_body
rv = await self.method(*args)
^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/lib/python3/dist-packages/middlewared/service/crud_service.py”, line 287, in nf
rv = await func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/lib/python3/dist-packages/middlewared/schema/processor.py”, line 48, in nf
res = await f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/lib/python3/dist-packages/middlewared/schema/processor.py”, line 174, in nf
return await func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3/dist-packages/middlewared/plugins/pool/pool.py", line 762, in do_update
await self.middleware.call(‘pool.format_disks’, job, disks, 0, 80)
File “/usr/lib/python3/dist-packages/middlewared/main.py”, line 977, in call
return await self.call(
^^^^^^^^^^^^^^^^^
File “/usr/lib/python3/dist-packages/middlewared/main.py”, line 692, in call
return await methodobj(*prepared_call.args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3/dist-packages/middlewared/plugins/pool/format_disks.py", line 27, in format_disks
await asyncio_map(format_disk, disks.items(), limit=16)
File "/usr/lib/python3/dist-packages/middlewared/utils/asyncio.py", line 19, in asyncio_map
return await asyncio.gather(*futures)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3/dist-packages/middlewared/utils/asyncio.py", line 16, in func
return await real_func(arg)
^^^^^^^^^^^^^^^^^^^^
File “/usr/lib/python3/dist-packages/middlewared/plugins/pool_/format_disks.py”, line 22, in format_disk
await self.middleware.call(‘disk.format’, disk, config.get(‘size’))
File “/usr/lib/python3/dist-packages/middlewared/main.py”, line 977, in call
return await self._call(
^^^^^^^^^^^^^^^^^
File “/usr/lib/python3/dist-packages/middlewared/main.py”, line 703, in call
return await self.run_in_executor(prepared_call.executor, methodobj, *prepared_call.args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/lib/python3/dist-packages/middlewared/main.py”, line 596, in run_in_executor
return await loop.run_in_executor(pool, functools.partial(method, *args, **kwargs))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/lib/python3.11/concurrent/futures/thread.py”, line 58, in run
result = self.fn(*self.args, **self.kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3/dist-packages/middlewared/plugins/disk/format.py", line 33, in format
self.middleware.call_sync(‘disk.wipe’, disk, ‘QUICK’, False).wait_sync(raise_error=True)
File “/usr/lib/python3/dist-packages/middlewared/job.py”, line 487, in wait_sync
raise CallError(self.error)
middlewared.service_exception.CallError: [EFAULT] [Errno 5] Input/output error
Any ideas what has gone wrong?

Robin_Portwood · May 22, 2025, 6:54pm

Bumping for awareness, I am also having this exact issue.

HoneyBadger · May 22, 2025, 8:01pm

“I/O Error” is rather ambiguous, and could be for any number of reasons.

Under Storage → Disks, does the device show as present as a sdX device?

@Robin_Portwood You may want to spin up your own troubleshooting thread, so that readers don’t get confused between the two system configurations. Even though you might be receiving the same symptom, you may have a different root cause.

Henrik_Bo_Andersen · May 23, 2025, 12:07pm

Thx for your input
The disc is in the list, but marked N/A. Also the only disc that shows the Wipe button.

HoneyBadger · May 23, 2025, 3:15pm

Let’s start at the bottom of the model and work up - first, checking physical drive health.

Please get the raw SMART output for the “N/A” drive with a shell command:

smartctl -A /dev/sdd

Post the output inside the “Code” formatting (Ctrl+E, the </> button on the text formatting bar, or encased in triple-backticks on the line before and after - ``` )

etorix · May 23, 2025, 6:17pm

@HoneyBadger I don’t want to overload you BUT… there have been quite a few threads about loosing a pool after an update/sidegrade to 24.10 or 25.04. It would be somewhat reassuring if there could be a common root cause—like unsolved issues with the disk partitioning code since iX removed the swap partition and buffer.

HoneyBadger · May 23, 2025, 6:28pm

I’m actually tracking along with the thread of “Updated from Core to Scale now missing pool” as @DjP-iX jumped in there to try to track down the process - as the first step to resolution is reproducing the problem.

I know there were partition sizing and offset changes done in middleware as a result of the CORE/SCALE/swap, but those paths won’t trigger on their own as part of a system or even a pool upgrade - that would be provoked by something like an attempt to resize/expand a pool. Changing swap enablement would only impact newly created drives - existing partitions shouldn’t be touched, and just ignored as the pool import code looks for the PARTUUID/ZFS label, not an offset.

But at this point once we rule out hardware I’m likely going to be opening up an inbound dropsite link and asking a couple folks to dd me the first 32MB of their drives/partitions, so I can carve 'em up with a hex editor and see if there’s a consistent pattern.

Henrik_Bo_Andersen · May 23, 2025, 6:29pm

root@truenas[~]# smartctl -A /dev/sdd
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.12.15-production+truenas] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org

Read SMART Data failed: scsi error device not ready

=== START OF READ SMART DATA SECTION ===
root@truenas[~]#

HoneyBadger · May 23, 2025, 6:48pm

Have you powered down completely and checked the connections to the drive?

Henrik_Bo_Andersen · May 23, 2025, 7:03pm

Yes. Powered down, removed power cable, reseated the SATA connectors on all drives. No luck.

neofusion · May 23, 2025, 7:18pm

Can you please verify that the drive in question is still labled sdd? It can change during boot.

Henrik_Bo_Andersen · May 23, 2025, 7:46pm

Nope! It has changed to sde:

neofusion · May 23, 2025, 7:51pm

Okay, and still the same “Read SMART Data failed: scsi error device not ready” error when you run smartctl on sde I take it?

To confirm, this drive is attached directly to a SATA port on the motherboard?

Henrik_Bo_Andersen · May 23, 2025, 8:07pm

Correct, SMART fails.
Drives are directly connected to MB SATA.

neofusion · May 23, 2025, 8:15pm

Do you have a different SATA cable you could try?
At this point it’s looking more and more like a (half) dead drive.

Henrik_Bo_Andersen · May 23, 2025, 8:34pm

I will try a different cable.
Would connecting the drive, via an USB-C dock, to a Win11 PC make it possible to run diagnostic tools or would ZFS make that impossible?

neofusion · May 23, 2025, 9:28pm

You could try, but attaching a drive using USB is one possible reason for smartctl failing, which is why I was asking questions as to how the drive was interfacing with the motherboard earlier.

The USB->SATA conversion often (but not always) prevents smartctl from getting the access it needs.

Henrik_Bo_Andersen · May 24, 2025, 2:59pm

I have now run SeaTools (on Windows) on both HDD (identical). And one of them gives og some rather harrowing click noise and will not report its health to SeaTolls.
Thx for the inputs to my troubles to all of you.

winnielinnie · May 26, 2025, 12:48am

Here is another one, @etorix.

Captain_Morgan · May 27, 2025, 4:43am

@etorix @winnielinnie

Thanks for all the warnings about these misbehaviors.

I found one other instance in our bug tickets
https://ixsystems.atlassian.net/browse/NAS-135061

It was inconclusive and was fixed by moving to 24.10.2.1

Ideally, if we have a user that follows a simple upgrade process and sees a problem, we should get a bug report ticket made.

The transition from 24.04 (and Core) to 24.10 is pretty major… both ZFS 2.3 and linux kernel. It looks like all the cases you mentioned were upgrading to 24.10. If we could confirm that it would be useful.