Pool degraded after update from 24.10.2.1 to 25.04.0

After the update I discovered that the pool has degraded because one of four disk is marked FAULTED under Storage>Pool Devices It has no ZFS errors and SMART scan returns no errors.
When trying to set disk Online or add the disk to the pool I get:


More info reveals this:

Traceback (most recent call last):
File “/usr/lib/python3/dist-packages/middlewared/job.py”, line 515, in run
await self.future
File “/usr/lib/python3/dist-packages/middlewared/job.py”, line 560, in run_body
rv = await self.method(*args)
^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/lib/python3/dist-packages/middlewared/service/crud_service.py”, line 287, in nf
rv = await func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/lib/python3/dist-packages/middlewared/schema/processor.py”, line 48, in nf
res = await f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/lib/python3/dist-packages/middlewared/schema/processor.py”, line 174, in nf
return await func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3/dist-packages/middlewared/plugins/pool
/pool.py", line 762, in do_update
await self.middleware.call(‘pool.format_disks’, job, disks, 0, 80)
File “/usr/lib/python3/dist-packages/middlewared/main.py”, line 977, in call
return await self.call(
^^^^^^^^^^^^^^^^^
File “/usr/lib/python3/dist-packages/middlewared/main.py”, line 692, in call
return await methodobj(*prepared_call.args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3/dist-packages/middlewared/plugins/pool
/format_disks.py", line 27, in format_disks
await asyncio_map(format_disk, disks.items(), limit=16)
File "/usr/lib/python3/dist-packages/middlewared/utils/asyncio
.py", line 19, in asyncio_map
return await asyncio.gather(*futures)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3/dist-packages/middlewared/utils/asyncio
.py", line 16, in func
return await real_func(arg)
^^^^^^^^^^^^^^^^^^^^
File “/usr/lib/python3/dist-packages/middlewared/plugins/pool_/format_disks.py”, line 22, in format_disk
await self.middleware.call(‘disk.format’, disk, config.get(‘size’))
File “/usr/lib/python3/dist-packages/middlewared/main.py”, line 977, in call
return await self._call(
^^^^^^^^^^^^^^^^^
File “/usr/lib/python3/dist-packages/middlewared/main.py”, line 703, in call
return await self.run_in_executor(prepared_call.executor, methodobj, *prepared_call.args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/lib/python3/dist-packages/middlewared/main.py”, line 596, in run_in_executor
return await loop.run_in_executor(pool, functools.partial(method, *args, **kwargs))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/lib/python3.11/concurrent/futures/thread.py”, line 58, in run
result = self.fn(*self.args, **self.kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3/dist-packages/middlewared/plugins/disk
/format.py", line 33, in format
self.middleware.call_sync(‘disk.wipe’, disk, ‘QUICK’, False).wait_sync(raise_error=True)
File “/usr/lib/python3/dist-packages/middlewared/job.py”, line 487, in wait_sync
raise CallError(self.error)
middlewared.service_exception.CallError: [EFAULT] [Errno 5] Input/output error
Any ideas what has gone wrong?

Bumping for awareness, I am also having this exact issue.

“I/O Error” is rather ambiguous, and could be for any number of reasons.

Under Storage → Disks, does the device show as present as a sdX device?

@Robin_Portwood You may want to spin up your own troubleshooting thread, so that readers don’t get confused between the two system configurations. Even though you might be receiving the same symptom, you may have a different root cause.

2 Likes

Thx for your input :+1:
The disc is in the list, but marked N/A. Also the only disc that shows the Wipe button.

Let’s start at the bottom of the model and work up - first, checking physical drive health.

Please get the raw SMART output for the “N/A” drive with a shell command:

smartctl -A /dev/sdd

Post the output inside the “Code” formatting (Ctrl+E, the </> button on the text formatting bar, or encased in triple-backticks on the line before and after - ``` )

1 Like

@HoneyBadger I don’t want to overload you BUT… there have been quite a few threads about loosing a pool after an update/sidegrade to 24.10 or 25.04. It would be somewhat reassuring if there could be a common root cause—like unsolved issues with the disk partitioning code since iX removed the swap partition and buffer.

2 Likes

I’m actually tracking along with the thread of “Updated from Core to Scale now missing pool” as @DjP-iX jumped in there to try to track down the process - as the first step to resolution is reproducing the problem.

I know there were partition sizing and offset changes done in middleware as a result of the CORE/SCALE/swap, but those paths won’t trigger on their own as part of a system or even a pool upgrade - that would be provoked by something like an attempt to resize/expand a pool. Changing swap enablement would only impact newly created drives - existing partitions shouldn’t be touched, and just ignored as the pool import code looks for the PARTUUID/ZFS label, not an offset.

But at this point once we rule out hardware I’m likely going to be opening up an inbound dropsite link and asking a couple folks to dd me the first 32MB of their drives/partitions, so I can carve 'em up with a hex editor and see if there’s a consistent pattern.

4 Likes
root@truenas[~]# smartctl -A /dev/sdd
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.12.15-production+truenas] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org

Read SMART Data failed: scsi error device not ready

=== START OF READ SMART DATA SECTION ===
root@truenas[~]# 

Have you powered down completely and checked the connections to the drive?

Yes. Powered down, removed power cable, reseated the SATA connectors on all drives. No luck.

Can you please verify that the drive in question is still labled sdd? It can change during boot.

Nope! It has changed to sde:

Okay, and still the same “Read SMART Data failed: scsi error device not ready” error when you run smartctl on sde I take it?

To confirm, this drive is attached directly to a SATA port on the motherboard?

Correct, SMART fails.
Drives are directly connected to MB SATA.

Do you have a different SATA cable you could try?
At this point it’s looking more and more like a (half) dead drive.

I will try a different cable.
Would connecting the drive, via an USB-C dock, to a Win11 PC make it possible to run diagnostic tools or would ZFS make that impossible?

You could try, but attaching a drive using USB is one possible reason for smartctl failing, which is why I was asking questions as to how the drive was interfacing with the motherboard earlier.

The USB->SATA conversion often (but not always) prevents smartctl from getting the access it needs.

I have now run SeaTools (on Windows) on both HDD (identical). And one of them gives og some rather harrowing click noise and will not report its health to SeaTolls.
Thx for the inputs to my troubles to all of you.

3 Likes

Here is another one, @etorix.

@etorix @winnielinnie

Thanks for all the warnings about these misbehaviors.

I found one other instance in our bug tickets
https://ixsystems.atlassian.net/browse/NAS-135061

It was inconclusive and was fixed by moving to 24.10.2.1

Ideally, if we have a user that follows a simple upgrade process and sees a problem, we should get a bug report ticket made.

The transition from 24.04 (and Core) to 24.10 is pretty major… both ZFS 2.3 and linux kernel. It looks like all the cases you mentioned were upgrading to 24.10. If we could confirm that it would be useful.

1 Like