I need help urgently. Sadly, I am quite a noob and would be very thankful if any of you could help. Please be patient with me if I don’t understand something — I am not an expert, but I am very willing to learn and find my mistakes, because I have put myself in quite a quagmire. Let me explain.
I built this TrueNAS server two and a half years ago after superficially learning from YouTube videos (Level1Techs, etc.). I actually only use it as a movie server and as a secondary backup for some of my important data, and until now everything worked absolutely flawlessly. I don’t have a backup of all the data on the pool, but none of it is of paramount importance — still, I would love to not lose it, and I am willing to do everything to achieve this goal.
I know you need to know my hardware and software layout for any help, so: I am running TrueNAS 25.10.4. Motherboard is an ASRock B650M Pro RS, CPU is an AMD Ryzen 7700, RAM is Crucial Pro DDR5 96GB (2x48GB) 5600MHz, HBA is an LSI 9300-16i (refurbished, but from a reputable seller of used server gear — I know this is not ideal hardware, but it’s what I had lying around or what was very cheap two years ago).
Pool layout: RAIDZ2, 1 VDEV, 10x Seagate Exos ST16000NM000H-3KW103 16TB (all refurbished, bought from Amazon), and one Seagate ST18000NM000J (bought new in 2023, which I added later in 2025 using the expand function after it became available), plus one M.2 1TB drive as cache. All HDDs are attached via the HBA through the SATA backplane of a Jonsbo N5 case. The boot pool consists of two cheap SanDisk SSDs in a mirrored pool, attached directly to the motherboard.
Now for my problem: three weeks ago, the Seagate Exos ST18000NM000J spontaneously showed roughly 5,000 read and write errors and faulted. My last scrub had been roughly two weeks prior and completed fine, showing no problems. I was annoyed but not really concerned, since it’s a RAIDZ2 pool. This drive was still under warranty, so I was happy I could file a claim with Seagate, which they immediately accepted — a relief, since buying a replacement drive is, as you’re certainly painfully aware, extremely expensive right now. So I offlined the drive without further investigation (I know, another mistake) and mailed it to Seagate. The pool was degraded but everything functioned perfectly fine.
This Tuesday, I received the replacement from Seagate (a factory-refurbished drive). I put it in the same empty bay the original drive had occupied, and it showed up fine. I went ahead and started a replace task in the afternoon (no, I didn’t run a SMART test or burn-in first — I know, another mistake, but I didn’t know any better). Everything seemed to work fine; the resilver started and progressed normally, estimating roughly 24 hours. I was confident and went to bed.
I awoke to an email notification that the new drive I’d added had also faulted, producing roughly 5,000 read and write errors. I thought this was a very weird coincidence and let the resilver finish, which it did in roughly 24 hours. I then offlined the new replacement drive and decided to check whether the SATA cable to the backplane for that drive was seated correctly. So I shut down the system, unplugged and replugged the SATA cable on the backplane.
I turned the system back on, and the drive showed up fine. I onlined it, and it reported no errors, so I figured maybe it was just a loose SATA connection, and I started the replace task again. But the same thing happened: the resilver progressed, and after roughly 20%, the replacement drive faulted again with a few thousand read and write errors. Now I started getting concerned, but from here things really went downhill.
The resilver continued, progressing at the same speed as before. I was confused about what was happening, as the GUI showed the drive as faulted — meaning there should have been no write activity — while all other drives showed heavy read activity. I decided to shut the system down, not knowing this wouldn’t stop the resilver. The resilver was then at roughly 33%, estimating roughly 16 hours remaining (shutting down was, I guess, another terrible mistake). I shut down because I wanted to check whether, for example, the fan I’d mounted on the HBA had failed and the HBA was overheating, or whether any of the SFF-8643-to-4x-SATA cables had come slightly loose.
I opened up the NAS and checked everything, but it all seemed fine. So I rebooted the system. It took a very long time to reboot, and once I could finally access the web interface, I was shocked to see that two other drives were now reporting read and write errors — one faulted, the other degraded — and all other drives showed 21,000 checksum errors, with the following error message:
“Pool Volume1 state is DEGRADED: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state. The following devices are not healthy:
-
Disk ST16000NM000H-3KW103 ZYD01C31 is DEGRADED
-
Disk ST16000NM000H-3KW103 ZYD00P5Y is FAULTED”
Interestingly, the 18TB drive I was trying to use as a replacement for the original failed drive no longer showed any read or write errors and was online. The resilver task stopped appearing on the dashboard, and in the Storage tab I could see the resilver was continuing but had ground nearly to a halt, with the estimated time remaining already at 4 weeks.
I panicked and shut down the system. I opened it up again and noticed that the two drives now reporting errors were physically right next to the bay where the original failed drive — and now the replacement drive — sat, meaning all three shared the same SFF-8643-to-4x-SATA cable to the HBA. So I decided to unplug all SFF-8643 connectors from the HBA and replug them, still hoping I was only dealing with a bad connection.
I then rebooted the server. This time it took slightly less time than before, but still longer than usual. I was greeted with the following notification:
“Pool Volume1 state is ONLINE: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state.”
All drives showed up and were online, and none were reporting read and write errors — only Disk ST16000NM000H-3KW103 ZYD00P5Y was reporting one checksum error. So I thought this had done the trick and was relieved. The resilver continued, but it was no longer at 31% — it had dropped to roughly 25% — and was progressing at roughly the same speed as before, though it was no longer visible from the dashboard, only from the Storage tab.
I thought everything might turn out fine, but after roughly 20 minutes, the same disk that had reported the one checksum error faulted again, and the resilver ground to a halt. Even worse, the usage section of the storage dashboard suddenly vanished, and when I went to the Datasets dashboard, I was greeted with an error message stating: “Volume1: pool I/O is currently suspended.”
I panicked again and shut down the system. I then decided to try swapping the physical location of the disk ST16000NM000H-3KW103 ZYD00P5Y, since it seemed to be the one causing this behavior whenever it faulted. So I put that disk in a different bay and moved the disk that had been in that bay into ZYD00P5Y’s original bay.
I rebooted the server one more time, and the exact same behavior occurred as before: a long boot time, a notification that one or more devices were being resilvered, all drives showing up and online, no errors reported except for one checksum error from ST16000NM000H-3KW103 ZYD00P5Y. The resilver continued at its normal speed, managing roughly 2% in 20 minutes, before disk ST16000NM000H-3KW103 ZYD00P5Y degraded again and the resilver ground to a halt. The time remaining only grew, and the pool was once again I/O suspended. I received the following error messages:
Error Name: EINVAL
Error Code: 22
Reason: [EZFS_POOLUNAVAIL]: zfs_open() failed - cannot open 'Volume1': pool I/O is currently suspended
Error Class: ZFSException
Trace: Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/middlewared/api/base/server/ws_handler/rpc.py", line 361, in process_method_call
result = await method.call(app, id_, params)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3/dist-packages/middlewared/api/base/server/method.py", line 57, in call
result = await self.middleware.call_with_audit(self.name, self.serviceobj, methodobj, params, app,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3/dist-packages/middlewared/main.py", line 956, in call_with_audit
result = await self._call(method, serviceobj, methodobj, params, app=app,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3/dist-packages/middlewared/main.py", line 773, in _call
return await methodobj(*prepared_call.args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3/dist-packages/middlewared/api/base/decorator.py", line 108, in wrapped
result = await func(*args)
^^^^^^^^^^^^^^^^^
File "/usr/lib/python3/dist-packages/middlewared/plugins/pool_/dataset_quota.py", line 163, in get_quota
quota_list = await self.middleware.call(
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3/dist-packages/middlewared/main.py", line 1053, in call
return await self._call(
^^^^^^^^^^^^^^^^^
File "/usr/lib/python3/dist-packages/middlewared/main.py", line 784, in _call
return await self.run_in_executor(prepared_call.executor, methodobj, *prepared_call.args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3/dist-packages/middlewared/main.py", line 667, in run_in_executor
return await loop.run_in_executor(pool, functools.partial(method, *args, **kwargs))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/concurrent/futures/thread.py", line 58, in run
result = self.fn(*self.args, **self.kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3/dist-packages/middlewared/plugins/pool_/dataset_quota.py", line 116, in get_quota_impl
rsrc = tls.lzh.open_resource(name=ds)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
truenas_pylibzfs.ZFSException: [EZFS_POOLUNAVAIL]: zfs_open() failed - cannot open 'Volume1': pool I/O is currently suspended
I then tried taking only the degraded disk and the one I had tried to replace offline, but I only received the following error messages:
Error Name: EZFS_NOREPLICAS
Error Code: 2019
Reason: [EZFS_NOREPLICAS] cannot offline /dev/disk/by-partuuid/4bef9b6b-b96a-4698-85e2-8ba9705a5450: no valid replicas
Error Class: CallError
Trace: concurrent.futures.process._RemoteTraceback:
"""
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/middlewared/plugins/zfs_/pool_actions.py", line 66, in __zfs_vdev_operation
with libzfs.ZFS() as zfs:
File "libzfs.pyx", line 562, in libzfs.ZFS.__exit__
File "/usr/lib/python3/dist-packages/middlewared/plugins/zfs_/pool_actions.py", line 71, in __zfs_vdev_operation
op(target, *args)
File "/usr/lib/python3/dist-packages/middlewared/plugins/zfs_/pool_actions.py", line 100, in <lambda>
self.__zfs_vdev_operation(name, label, lambda target: target.offline())
^^^^^^^^^^^^^^^^
File "libzfs.pyx", line 2432, in libzfs.ZFSVdev.offline
libzfs.ZFSException: cannot offline /dev/disk/by-partuuid/4bef9b6b-b96a-4698-85e2-8ba9705a5450: no valid replicas
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/lib/python3.11/concurrent/futures/process.py", line 261, in _process_worker
r = call_item.fn(*call_item.args, **call_item.kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3/dist-packages/middlewared/worker.py", line 115, in main_worker
res = MIDDLEWARE._run(*call_args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3/dist-packages/middlewared/worker.py", line 48, in _run
return self._call(name, serviceobj, methodobj, args, job=job)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3/dist-packages/middlewared/worker.py", line 42, in _call
return methodobj(*params)
^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3/dist-packages/middlewared/plugins/zfs_/pool_actions.py", line 100, in offline
self.__zfs_vdev_operation(name, label, lambda target: target.offline())
File "/usr/lib/python3/dist-packages/middlewared/plugins/zfs_/pool_actions.py", line 73, in __zfs_vdev_operation
raise CallError(str(e), e.code)
middlewared.service_exception.CallError: [EZFS_NOREPLICAS] cannot offline /dev/disk/by-partuuid/4bef9b6b-b96a-4698-85e2-8ba9705a5450: no valid replicas
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/middlewared/api/base/server/ws_handler/rpc.py", line 361, in process_method_call
result = await method.call(app, id_, params)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3/dist-packages/middlewared/api/base/server/method.py", line 57, in call
result = await self.middleware.call_with_audit(self.name, self.serviceobj, methodobj, params, app,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3/dist-packages/middlewared/main.py", line 956, in call_with_audit
result = await self._call(method, serviceobj, methodobj, params, app=app,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3/dist-packages/middlewared/main.py", line 773, in _call
return await methodobj(*prepared_call.args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3/dist-packages/middlewared/api/base/decorator.py", line 108, in wrapped
result = await func(*args)
^^^^^^^^^^^^^^^^^
File "/usr/lib/python3/dist-packages/middlewared/plugins/pool_/pool_disk_operations.py", line 113, in offline
await self.middleware.call('zfs.pool.offline', pool['name'], found[1]['guid'])
File "/usr/lib/python3/dist-packages/middlewared/main.py", line 1053, in call
return await self._call(
^^^^^^^^^^^^^^^^^
File "/usr/lib/python3/dist-packages/middlewared/main.py", line 781, in _call
return await self._call_worker(name, *prepared_call.args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3/dist-packages/middlewared/main.py", line 787, in _call_worker
return await self.run_in_proc(main_worker, name, args, job)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3/dist-packages/middlewared/main.py", line 683, in run_in_proc
return await self.run_in_executor(self.__procpool, method, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3/dist-packages/middlewared/main.py", line 667, in run_in_executor
return await loop.run_in_executor(pool, functools.partial(method, *args, **kwargs))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
middlewared.service_exception.CallError: [EZFS_NOREPLICAS] cannot offline /dev/disk/by-partuuid/4bef9b6b-b96a-4698-85e2-8ba9705a5450: no valid replicas
So I have had to come to terms with the fact that I am way out of my depth, being a total noob, and that I have probably messed this whole thing up pretty badly. I decided to shut the system down one last time and write this long post for the forum instead.
I would be very thankful if anyone could help or tell me how I should proceed in a risk-averse manner. I would be very grateful for any hints on how to fix this mess, whether it’s fixable at all, and — if possible — for someone to point out all the mistakes I’ve made that I haven’t yet realized.