Bad Resilver problem and getting worse

Max_Margreiter · June 20, 2026, 3:01pm

I need help urgently. Sadly, I am quite a noob and would be very thankful if any of you could help. Please be patient with me if I don’t understand something — I am not an expert, but I am very willing to learn and find my mistakes, because I have put myself in quite a quagmire. Let me explain.

I built this TrueNAS server two and a half years ago after superficially learning from YouTube videos (Level1Techs, etc.). I actually only use it as a movie server and as a secondary backup for some of my important data, and until now everything worked absolutely flawlessly. I don’t have a backup of all the data on the pool, but none of it is of paramount importance — still, I would love to not lose it, and I am willing to do everything to achieve this goal.

I know you need to know my hardware and software layout for any help, so: I am running TrueNAS 25.10.4. Motherboard is an ASRock B650M Pro RS, CPU is an AMD Ryzen 7700, RAM is Crucial Pro DDR5 96GB (2x48GB) 5600MHz, HBA is an LSI 9300-16i (refurbished, but from a reputable seller of used server gear — I know this is not ideal hardware, but it’s what I had lying around or what was very cheap two years ago).

Pool layout: RAIDZ2, 1 VDEV, 10x Seagate Exos ST16000NM000H-3KW103 16TB (all refurbished, bought from Amazon), and one Seagate ST18000NM000J (bought new in 2023, which I added later in 2025 using the expand function after it became available), plus one M.2 1TB drive as cache. All HDDs are attached via the HBA through the SATA backplane of a Jonsbo N5 case. The boot pool consists of two cheap SanDisk SSDs in a mirrored pool, attached directly to the motherboard.

Now for my problem: three weeks ago, the Seagate Exos ST18000NM000J spontaneously showed roughly 5,000 read and write errors and faulted. My last scrub had been roughly two weeks prior and completed fine, showing no problems. I was annoyed but not really concerned, since it’s a RAIDZ2 pool. This drive was still under warranty, so I was happy I could file a claim with Seagate, which they immediately accepted — a relief, since buying a replacement drive is, as you’re certainly painfully aware, extremely expensive right now. So I offlined the drive without further investigation (I know, another mistake) and mailed it to Seagate. The pool was degraded but everything functioned perfectly fine.

This Tuesday, I received the replacement from Seagate (a factory-refurbished drive). I put it in the same empty bay the original drive had occupied, and it showed up fine. I went ahead and started a replace task in the afternoon (no, I didn’t run a SMART test or burn-in first — I know, another mistake, but I didn’t know any better). Everything seemed to work fine; the resilver started and progressed normally, estimating roughly 24 hours. I was confident and went to bed.

I awoke to an email notification that the new drive I’d added had also faulted, producing roughly 5,000 read and write errors. I thought this was a very weird coincidence and let the resilver finish, which it did in roughly 24 hours. I then offlined the new replacement drive and decided to check whether the SATA cable to the backplane for that drive was seated correctly. So I shut down the system, unplugged and replugged the SATA cable on the backplane.

I turned the system back on, and the drive showed up fine. I onlined it, and it reported no errors, so I figured maybe it was just a loose SATA connection, and I started the replace task again. But the same thing happened: the resilver progressed, and after roughly 20%, the replacement drive faulted again with a few thousand read and write errors. Now I started getting concerned, but from here things really went downhill.

The resilver continued, progressing at the same speed as before. I was confused about what was happening, as the GUI showed the drive as faulted — meaning there should have been no write activity — while all other drives showed heavy read activity. I decided to shut the system down, not knowing this wouldn’t stop the resilver. The resilver was then at roughly 33%, estimating roughly 16 hours remaining (shutting down was, I guess, another terrible mistake). I shut down because I wanted to check whether, for example, the fan I’d mounted on the HBA had failed and the HBA was overheating, or whether any of the SFF-8643-to-4x-SATA cables had come slightly loose.

I opened up the NAS and checked everything, but it all seemed fine. So I rebooted the system. It took a very long time to reboot, and once I could finally access the web interface, I was shocked to see that two other drives were now reporting read and write errors — one faulted, the other degraded — and all other drives showed 21,000 checksum errors, with the following error message:

“Pool Volume1 state is DEGRADED: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state. The following devices are not healthy:

Disk ST16000NM000H-3KW103 ZYD01C31 is DEGRADED
Disk ST16000NM000H-3KW103 ZYD00P5Y is FAULTED”

Interestingly, the 18TB drive I was trying to use as a replacement for the original failed drive no longer showed any read or write errors and was online. The resilver task stopped appearing on the dashboard, and in the Storage tab I could see the resilver was continuing but had ground nearly to a halt, with the estimated time remaining already at 4 weeks.

I panicked and shut down the system. I opened it up again and noticed that the two drives now reporting errors were physically right next to the bay where the original failed drive — and now the replacement drive — sat, meaning all three shared the same SFF-8643-to-4x-SATA cable to the HBA. So I decided to unplug all SFF-8643 connectors from the HBA and replug them, still hoping I was only dealing with a bad connection.

I then rebooted the server. This time it took slightly less time than before, but still longer than usual. I was greeted with the following notification:

“Pool Volume1 state is ONLINE: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state.”

All drives showed up and were online, and none were reporting read and write errors — only Disk ST16000NM000H-3KW103 ZYD00P5Y was reporting one checksum error. So I thought this had done the trick and was relieved. The resilver continued, but it was no longer at 31% — it had dropped to roughly 25% — and was progressing at roughly the same speed as before, though it was no longer visible from the dashboard, only from the Storage tab.

I thought everything might turn out fine, but after roughly 20 minutes, the same disk that had reported the one checksum error faulted again, and the resilver ground to a halt. Even worse, the usage section of the storage dashboard suddenly vanished, and when I went to the Datasets dashboard, I was greeted with an error message stating: “Volume1: pool I/O is currently suspended.”

I panicked again and shut down the system. I then decided to try swapping the physical location of the disk ST16000NM000H-3KW103 ZYD00P5Y, since it seemed to be the one causing this behavior whenever it faulted. So I put that disk in a different bay and moved the disk that had been in that bay into ZYD00P5Y’s original bay.

I rebooted the server one more time, and the exact same behavior occurred as before: a long boot time, a notification that one or more devices were being resilvered, all drives showing up and online, no errors reported except for one checksum error from ST16000NM000H-3KW103 ZYD00P5Y. The resilver continued at its normal speed, managing roughly 2% in 20 minutes, before disk ST16000NM000H-3KW103 ZYD00P5Y degraded again and the resilver ground to a halt. The time remaining only grew, and the pool was once again I/O suspended. I received the following error messages:

Error Name: EINVAL

Error Code: 22

Reason: [EZFS_POOLUNAVAIL]: zfs_open() failed - cannot open 'Volume1': pool I/O is currently suspended

Error Class: ZFSException

Trace: Traceback (most recent call last):

  File "/usr/lib/python3/dist-packages/middlewared/api/base/server/ws_handler/rpc.py", line 361, in process_method_call

    result = await method.call(app, id_, params)

             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/lib/python3/dist-packages/middlewared/api/base/server/method.py", line 57, in call

    result = await self.middleware.call_with_audit(self.name, self.serviceobj, methodobj, params, app,

             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/lib/python3/dist-packages/middlewared/main.py", line 956, in call_with_audit

    result = await self._call(method, serviceobj, methodobj, params, app=app,

             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/lib/python3/dist-packages/middlewared/main.py", line 773, in _call

    return await methodobj(*prepared_call.args)

           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/lib/python3/dist-packages/middlewared/api/base/decorator.py", line 108, in wrapped

    result = await func(*args)

             ^^^^^^^^^^^^^^^^^

  File "/usr/lib/python3/dist-packages/middlewared/plugins/pool_/dataset_quota.py", line 163, in get_quota

    quota_list = await self.middleware.call(

                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/lib/python3/dist-packages/middlewared/main.py", line 1053, in call

    return await self._call(

           ^^^^^^^^^^^^^^^^^

  File "/usr/lib/python3/dist-packages/middlewared/main.py", line 784, in _call

    return await self.run_in_executor(prepared_call.executor, methodobj, *prepared_call.args)

           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/lib/python3/dist-packages/middlewared/main.py", line 667, in run_in_executor

    return await loop.run_in_executor(pool, functools.partial(method, *args, **kwargs))

           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/lib/python3.11/concurrent/futures/thread.py", line 58, in run

    result = self.fn(*self.args, **self.kwargs)

             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/lib/python3/dist-packages/middlewared/plugins/pool_/dataset_quota.py", line 116, in get_quota_impl

    rsrc = tls.lzh.open_resource(name=ds)

           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

truenas_pylibzfs.ZFSException: [EZFS_POOLUNAVAIL]: zfs_open() failed - cannot open 'Volume1': pool I/O is currently suspended

I then tried taking only the degraded disk and the one I had tried to replace offline, but I only received the following error messages:

Error Name: EZFS_NOREPLICAS

Error Code: 2019

Reason: [EZFS_NOREPLICAS] cannot offline /dev/disk/by-partuuid/4bef9b6b-b96a-4698-85e2-8ba9705a5450: no valid replicas

Error Class: CallError

Trace: concurrent.futures.process._RemoteTraceback:

"""

Traceback (most recent call last):

  File "/usr/lib/python3/dist-packages/middlewared/plugins/zfs_/pool_actions.py", line 66, in __zfs_vdev_operation

    with libzfs.ZFS() as zfs:

  File "libzfs.pyx", line 562, in libzfs.ZFS.__exit__

  File "/usr/lib/python3/dist-packages/middlewared/plugins/zfs_/pool_actions.py", line 71, in __zfs_vdev_operation

    op(target, *args)

  File "/usr/lib/python3/dist-packages/middlewared/plugins/zfs_/pool_actions.py", line 100, in <lambda>

    self.__zfs_vdev_operation(name, label, lambda target: target.offline())

                                                          ^^^^^^^^^^^^^^^^

  File "libzfs.pyx", line 2432, in libzfs.ZFSVdev.offline

libzfs.ZFSException: cannot offline /dev/disk/by-partuuid/4bef9b6b-b96a-4698-85e2-8ba9705a5450: no valid replicas



During handling of the above exception, another exception occurred:



Traceback (most recent call last):

  File "/usr/lib/python3.11/concurrent/futures/process.py", line 261, in _process_worker

    r = call_item.fn(*call_item.args, **call_item.kwargs)

        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/lib/python3/dist-packages/middlewared/worker.py", line 115, in main_worker

    res = MIDDLEWARE._run(*call_args)

          ^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/lib/python3/dist-packages/middlewared/worker.py", line 48, in _run

    return self._call(name, serviceobj, methodobj, args, job=job)

           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/lib/python3/dist-packages/middlewared/worker.py", line 42, in _call

    return methodobj(*params)

           ^^^^^^^^^^^^^^^^^^

  File "/usr/lib/python3/dist-packages/middlewared/plugins/zfs_/pool_actions.py", line 100, in offline

    self.__zfs_vdev_operation(name, label, lambda target: target.offline())

  File "/usr/lib/python3/dist-packages/middlewared/plugins/zfs_/pool_actions.py", line 73, in __zfs_vdev_operation

    raise CallError(str(e), e.code)

middlewared.service_exception.CallError: [EZFS_NOREPLICAS] cannot offline /dev/disk/by-partuuid/4bef9b6b-b96a-4698-85e2-8ba9705a5450: no valid replicas

"""



The above exception was the direct cause of the following exception:



Traceback (most recent call last):

  File "/usr/lib/python3/dist-packages/middlewared/api/base/server/ws_handler/rpc.py", line 361, in process_method_call

    result = await method.call(app, id_, params)

             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/lib/python3/dist-packages/middlewared/api/base/server/method.py", line 57, in call

    result = await self.middleware.call_with_audit(self.name, self.serviceobj, methodobj, params, app,

             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/lib/python3/dist-packages/middlewared/main.py", line 956, in call_with_audit

    result = await self._call(method, serviceobj, methodobj, params, app=app,

             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/lib/python3/dist-packages/middlewared/main.py", line 773, in _call

    return await methodobj(*prepared_call.args)

           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/lib/python3/dist-packages/middlewared/api/base/decorator.py", line 108, in wrapped

    result = await func(*args)

             ^^^^^^^^^^^^^^^^^

  File "/usr/lib/python3/dist-packages/middlewared/plugins/pool_/pool_disk_operations.py", line 113, in offline

    await self.middleware.call('zfs.pool.offline', pool['name'], found[1]['guid'])

  File "/usr/lib/python3/dist-packages/middlewared/main.py", line 1053, in call

    return await self._call(

           ^^^^^^^^^^^^^^^^^

  File "/usr/lib/python3/dist-packages/middlewared/main.py", line 781, in _call

    return await self._call_worker(name, *prepared_call.args)

           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/lib/python3/dist-packages/middlewared/main.py", line 787, in _call_worker

    return await self.run_in_proc(main_worker, name, args, job)

           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/lib/python3/dist-packages/middlewared/main.py", line 683, in run_in_proc

    return await self.run_in_executor(self.__procpool, method, *args, **kwargs)

           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/lib/python3/dist-packages/middlewared/main.py", line 667, in run_in_executor

    return await loop.run_in_executor(pool, functools.partial(method, *args, **kwargs))

           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

middlewared.service_exception.CallError: [EZFS_NOREPLICAS] cannot offline /dev/disk/by-partuuid/4bef9b6b-b96a-4698-85e2-8ba9705a5450: no valid replicas

So I have had to come to terms with the fact that I am way out of my depth, being a total noob, and that I have probably messed this whole thing up pretty badly. I decided to shut the system down one last time and write this long post for the forum instead.

I would be very thankful if anyone could help or tell me how I should proceed in a risk-averse manner. I would be very grateful for any hints on how to fix this mess, whether it’s fixable at all, and — if possible — for someone to point out all the mistakes I’ve made that I haven’t yet realized.

winnielinnie · June 20, 2026, 3:32pm

Time to rule out some things. This could be an issue or combination of issues with the HBA, cables, temperatures, or RAM.

Did you actually try new cables or did you keep reusing the same ones? I know you replugged them, but it sounds like you’re using the same cables.

If possible, run a short SMART selftest on all 11 drives at the same time. I suggest this to be done when there’s no disk activity or use of the system.

If all 11 pass their short tests, you’re not in the clear yet. They need to pass long tests too. If a drive failed a short test, consider it as “failed” and no longer eligible to be used in the pool. (It can remain temporarily, until you can replace it.)

To save time, I would boot into a live USB to run multiple memtest passes to rule out any memory issues. I would do this before running the long SMART tests, since those will take a very long time on 18 TB drives.

As for the HBA, I’ll leave that to others with more experience on troubleshooting HBAs.

Max_Margreiter · June 20, 2026, 3:41pm

Thanks for the suggestion no I didn’t use new cables because I don’t have any. I would need to order some (which I will certainly do) but the thing is as soon as I reboot the system I guess it will immediately try to continue the resilver that was in progress … so there will be lots of activity on the pool … so should I run the smart test nevertheless

winnielinnie · June 20, 2026, 3:43pm

You can boot into a live Linux ISO from a USB and run the short SMART selftests in that session. The same can be done for the long tests too.

The latest GParted Live should come with smartmontools preinstalled.

For memtest, there’s a bootable USB for that too.

Samuel_Tai · June 20, 2026, 4:04pm

I think things might be past the point of recovery with this pool. Once you’ve entered this death spiral of continuously failing and restarting resilvers, the only way out I’m aware of is to destroy the pool and start over, because there’s no mechanism in ZFS to stop or cancel resilvers.

A 10-wide RAIDZ2 VDEV is really stretching things. I would go with a 2-wide stripe of 4-wide RAIDZ2 VDEVs + 2 spares.

Max_Margreiter · June 20, 2026, 4:28pm

Ok first thank you so much for your help… I made bootable USB sticks for those to isos and will try to do as you said …. should I start with the short smart test or with the memtest one …. once again thank you very much for your patience and helping such a noob like me

Max_Margreiter · June 20, 2026, 4:28pm

h, that sounds really bad, but I don’t understand — can’t I somehow offline the one drive that seems to be causing problems ST16000NM000H-3KW103 ZYD00P5Y which causes the resilver to grind to a halt as soon as it degrades, since it’s a RAIDZ2 pool, and then try to get as much data as possible off the pool, hoping that the resilver continues and finishes just like the first one did, even if I don’t understand what it did when the drive that was being replaced failed?

Samuel_Tai · June 20, 2026, 4:31pm

Offlining the bad drive doesn’t stop the resilver, but it potentially buys you time to offload data off the pool.

Max_Margreiter · June 20, 2026, 4:51pm

Also, thank you for your help. I mean, offlining that drive was what I wanted to do yesterday but couldn’t, as the error message I sent showed — which I sadly don’t understand, as I have no clue. I would like to think that getting as much data off the pool and then destroying it are measures of last resort, so I will resort to them if necessary. But isn’t there any way to understand what is actually happening? The behavior I’m describing, at least from my uneducated perspective, seems quite weird and unintuitive. Or is this how pools typically fail — and if so, what have I actually done wrong, except for the things I already identified, and not choosing another pool layout — either RAIDZ3, or, as you suggested, a 2-wide stripe of 4-wide RAIDZ2 vdevs + 2 spares, which I’d guess would also have halved the usable capacity.

winnielinnie · June 20, 2026, 5:16pm

Short tests first since they should complete within a few minutes. I would run all 11 drives at the same time to maximize the power draw.

If there are no errors, reboot into the memtest boot and run at least 4 full passes with zero errors.

The long tests will take a long time so I would run them overnight or when you don’t need to try anything else to troubleshoot the system.

Samuel_Tai · June 20, 2026, 6:34pm

This sort of failure mode is what you see when you have an HBA going south (multiple drives acting squirrelly) or a power supply starting to die (multiple devices doing funny things as they react to micro voltage sags). Less common is a deteriorating motherboard, but not outside the realm of possibility.

Max_Margreiter · June 20, 2026, 8:06pm

I did not manage to boot into GParted Live. It told me — once I got to the boot option selection — that it stalled on:

Begin: Running /scripts/live-bottom

Running scripts/init-bottom Timed out while waiting for udev queue to empty.

done

and then refused to do anything for roughly ten minutes.

Is this any hint as to what might be wrong with my system or just normal flakey behaviour.

I will now try another Linux distro for running the SMART tests. At the moment I am running MemTest86, and the first two passes have completed fine.

But the behavior is getting even stranger, at least from my uneducated perspective. I mistakenly booted into TrueNAS again when I meant to boot into the Linux image to run the SMART test.

As always, I was first greeted by the message:

Pool Volume1 state is ONLINE: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state.

All 12 disks show up and report no errors, but it now also shows the disk ST16000NM000H-3KW103 ZYD00P5Y as unassigned.

The resilver continues as always at the normal speed, then after ten minutes I get the message:

Pool Volume1 state is DEGRADED: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state. The following devices are not healthy:



Disk ST16000NM000H-3KW103 ZYD00P5Y is REMOVED

and then, one minute later:

Pool Volume1 state is ONLINE: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state. The following devices are not healthy:



Disk ST16000NM000H-3KW103 ZYD00P5Y is DEGRADED

with the disk once again showing lots of read and write errors.

Then the same thing as before happens: the resilver grinds to a halt, the usage section of the storage dashboard vanishes, and the datasets dashboard greets me with the error message stating that “Volume1: pool I/O is currently suspended.”

I shut the system down again.

I am very thankful for your support and helping me thoroughly and the positive thing here is that I think I am learning a lot. I ask for help again after i managed to run the smart tests

Arwen · June 21, 2026, 2:53am

Okay, we need to cover some back ground.

ZFS was designed to be always consistent on disk. This means their is no file system check at boot or pool import. This is what a Scrub is supposed to do, (scheduled at the SysAdmin’s preference, not at boot). ZFS will both assume the pool is good, (until it finds out otherwise), and can verify data being read via checksum.

Generally when a ZFS pool becomes I/O suspended, there are several causes:

Too many disks now report bad blocks or failures, such that ZFS deems the vDev potentially dead.
Reading a piece of critical metadata has failed. All copies are corrupt, (which their are at least 3 by default, even on RAID-Zx), and all redundancy, (which allows 2 column failures in RAID-Z2), is not able to recover the critical data.

Start FIRST with the memory test. This is how some critical metadata corruption occurs. and potentially continued corruption. (Meaning the critical metadata was damaged before the ZFS checksum was done, thus written to the pool bad…)

Next, heat on LSI HBAs is known to cause problems, so is yours properly cooled?
Perhaps with its own fan?

Now the way out may be to verify your pool’s integrity. This can be done by removing the disk that is re-silvering. Thus, stopping the behavior that is leading to I/O suspended state. Whence that is done, you can run a scrub. If you get another I/O suspended, we have proof you have serious pool problems.

If you can’t get the scrub to finish, perhaps the same I/O suspended, then you last resort might be import the pool Read/Only and copy the data off. I know many people don’t have additional storage for that copy off, I can’t help that…

Max_Margreiter · June 21, 2026, 6:10pm

First off I am terribly sorry if anything I wrote sounded entitled, I am very, very thankful for your support, and I appreciate it a lot that you are helping me learn and understand the problems that I am facing.

I set up monthly scrub tasks, as I already read something about this when I first built this server. Is this too long an interval between scrubs?

I did that, and it completed in 10 hours — four passes without any errors. So memory could be fine, I guess.

Now I have run short SMART tests using GSmartControl on all 11 drives of the pool. They all completed successfully, including all the drives that showed the behavior I told you about. Should I now run long SMART tests on all drives?

I read about this quite a lot on the old forum when I built this server. I know that HBAs are intended for server chassis and their airflow patterns. So I stuck a Noctua 80mm fan running at 65% speed onto the heatsink, and another 120mm Noctua fan is blowing fresh air from the front of the chassis toward the HBA. I don’t know if this is sufficient, but I think I can’t get more air to it with my Jonsbo N5 chassis.

So what do you mean by “remove”? Just physically detach the drive? Because when I tried to offline it, I got the error message that I posted in my first message.

So does that mean that even if the pool is, for whatever reason, beyond repair, I might still be able to get some or all of the data off it — even if it’s annoying, difficult, and expensive to do so? That would be great news!

etorix · June 21, 2026, 7:26pm

That can’t hurt. Note that you may run the test sin TrueNAS, even while the pool is working.
You may also post some test results here (sudo smartctl -x /dev/sdZ where ‘Z’ is the appropriate drive letter, and the paste the result here and format the whole thing in one go with the </> button).

What’s the firmware version of your HBA? (I have confused your thread with another, where another 9300-16i was found to be on an outdated version.)
sudo sas3flash -listall

Yes, so let us know if you have enough available storage at hand to backup the pool (or if you have backups!), as that would influence further advices—backup first, repair later!

Max_Margreiter · June 21, 2026, 7:54pm

Ok would that be preferable to doing it like I did before booting into a systemrescue iso and running them in GSmartControl there… because as soon as I boot into truenas it tries to do the resilver and there is heavy read activity on all drives …. should I do the command that you wrote down for all drives or just for the ones that behaved strangely.

no sadly I don’t have a backup and don’t have enough storage to to so and I think buying enough is rather expensive at the moment ….. I bought those drives at roughly 120euros refurbished two years ago (when refurbed exos where incredibly cheap in europe) so 1200 euros for the whole pool … now the cheapest drives that I can find are 28tb exos drives refurbished at 600euros so I would need to spend at least a few thousand to get the data off to new pool with sufficient redundancy so quite an investment

etorix · June 21, 2026, 8:24pm

OK, so we need to be careful—or consider whether there is some very important data in there which you could rescue, short of doing a full backup.

Whatever works for you…
I leave it to your discretion to post only the output for the main offender, for some drives or for the whole set of eleven reports.
I think that @Arwen suggested to disconnect the misbehaving drive, but let her confirm that.

For now, I’m most curious about the HBA firmware, and also about the length of your cables?
If the answers are “P16” and “50 cm or less”, all seems fine there, and we may be looking at hardware failure. If not, some upgrade is in order.

Oh, and what’s the “1 TB cache”? L2ARC or SLOG? (It seems too large for either role anyway.)

Max_Margreiter · June 21, 2026, 8:48pm

Adapter Selected is a Avago SAS: SAS3008(C0)

Num   Ctlr            FW Ver        NVDATA        x86-BIOS         PCI Addr
----------------------------------------------------------------------------

0  SAS3008(C0)  16.00.12.00    0e.01.00.07    08.37.02.00     00:03:00:00
1  SAS3008(C0)  16.00.12.00    0e.01.00.07    08.37.02.00     00:05:00:00

        Finished Processing Commands Successfully.
        Exiting SAS3Flash.
admin@truenas[~]$ 








=== START OF INFORMATION SECTION ===
Device Model:     ST16000NM000H-3KW103
Serial Number:    ZYD00P5Y
LU WWN Device Id: 5 000c50 0e7fa6d6d
Firmware Version: EN01
User Capacity:    16,000,900,661,248 bytes [16.0 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        Not in smartctl database 7.3/5528
ATA Version is:   ACS-5 (minor revision not indicated)
SATA Version is:  SATA 3.3, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Sun Jun 21 22:36:13 2026 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
AAM feature is:   Unavailable
APM feature is:   Unavailable
Rd look-ahead is: Enabled
Write cache is:   Enabled
DSN feature is:   Disabled
ATA Security is:  Disabled, NOT FROZEN [SEC1]
Write SCT (Get) Feature Control Command failed: scsi error aborted command
Wt Cache Reorder: Unknown (SCT Feature Control command failed)

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever 
                                        been run.
Total time to complete Offline 
data collection:                (  567) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine 
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        (1374) minutes.
Conveyance self-test routine
recommended polling time:        (   3) minutes.
SCT capabilities:              (0x50bd) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
  1 Raw_Read_Error_Rate     POSR--   070   064   044    -    9546809
  3 Spin_Up_Time            PO----   091   090   000    -    0
  4 Start_Stop_Count        -O--CK   100   100   020    -    435
  5 Reallocated_Sector_Ct   PO--CK   100   100   010    -    0
  7 Seek_Error_Rate         POSR--   087   060   045    -    520270291
  9 Power_On_Hours          -O--CK   092   092   000    -    7630
 10 Spin_Retry_Count        PO--C-   100   100   097    -    0
 12 Power_Cycle_Count       -O--CK   100   100   020    -    435
 18 Unknown_Attribute       PO-R--   100   100   050    -    0
187 Reported_Uncorrect      -O--CK   100   100   000    -    0
188 Command_Timeout         -O--CK   100   100   000    -    0
190 Airflow_Temperature_Cel -O---K   066   036   000    -    34 (Min/Max 34/34)
192 Power-Off_Retract_Count -O--CK   100   100   000    -    262
193 Load_Cycle_Count        -O--CK   097   097   000    -    7440
194 Temperature_Celsius     -O---K   034   064   000    -    34 (0 21 0 0 0)
197 Current_Pending_Sector  -O--C-   100   100   000    -    0
198 Offline_Uncorrectable   ----C-   100   100   000    -    0
199 UDMA_CRC_Error_Count    -OSRCK   200   200   000    -    0
200 Multi_Zone_Error_Rate   PO---K   100   100   001    -    0
240 Head_Flying_Hours       ------   100   100   000    -    4344 (219 130 0)
241 Total_LBAs_Written      ------   100   253   000    -    73966167809
242 Total_LBAs_Read         ------   100   253   000    -    558276191732
                            ||||||_ K auto-keep
                            |||||__ C event count
                            ||||___ R error rate
                            |||____ S speed/performance
                            ||_____ O updated online
                            |______ P prefailure warning

General Purpose Log Directory Version 1
SMART           Log Directory Version 1 [multi-sector log support]
Address    Access  R/W   Size  Description
0x00       GPL,SL  R/O      1  Log Directory
0x01           SL  R/O      1  Summary SMART error log
0x02           SL  R/O      5  Comprehensive SMART error log
0x03       GPL     R/O      5  Ext. Comprehensive SMART error log
0x04       GPL     R/O    256  Device Statistics log
0x04       SL      R/O      8  Device Statistics log
0x06           SL  R/O      1  SMART self-test log
0x07       GPL     R/O      1  Extended self-test log
0x08       GPL     R/O      2  Power Conditions log
0x09           SL  R/W      1  Selective self-test log
0x0a       GPL     R/W      8  Device Statistics Notification
0x0c       GPL     R/O   2048  Pending Defects log
0x0f       GPL     R/O      2  Sense Data for Successful NCQ Cmds log
0x10       GPL     R/O      1  NCQ Command Error log
0x11       GPL     R/O      1  SATA Phy Event Counters log
0x13       GPL     R/O      1  SATA NCQ Send and Receive log
0x21       GPL     R/O      1  Write stream error log
0x22       GPL     R/O      1  Read stream error log
0x24       GPL     R/O    768  Current Device Internal Status Data log
0x2f       GPL     R/O      1  Set Sector Configuration
0x30       GPL,SL  R/O      9  IDENTIFY DEVICE data log
0x80-0x9f  GPL,SL  R/W     16  Host vendor specific log
0xa1       GPL,SL  VS     160  Device vendor specific log
0xa2       GPL     VS   16320  Device vendor specific log
0xa4       GPL,SL  VS     160  Device vendor specific log
0xa6       GPL     VS     192  Device vendor specific log
0xa8-0xa9  GPL,SL  VS     136  Device vendor specific log
0xab       GPL     VS       1  Device vendor specific log
0xad       GPL     VS      16  Device vendor specific log
0xb1       GPL,SL  VS     160  Device vendor specific log
0xb4       GPL,SL  VS      16  Device vendor specific log
0xb6       GPL     VS    1920  Device vendor specific log
0xbe-0xbf  GPL     VS   65535  Device vendor specific log
0xc1       GPL,SL  VS       8  Device vendor specific log
0xc3       GPL,SL  VS      32  Device vendor specific log
0xc6       GPL     VS    5184  Device vendor specific log
0xc7       GPL,SL  VS       8  Device vendor specific log
0xc9       GPL,SL  VS       8  Device vendor specific log
0xca       GPL,SL  VS      16  Device vendor specific log
0xcd       GPL,SL  VS       1  Device vendor specific log
0xce       GPL     VS       1  Device vendor specific log
0xcf       GPL     VS     512  Device vendor specific log
0xd1       GPL     VS     656  Device vendor specific log
0xd2       GPL     VS   10256  Device vendor specific log
0xd4       GPL     VS    2048  Device vendor specific log
0xda       GPL,SL  VS       1  Device vendor specific log
0xe0       GPL,SL  R/W      1  SCT Command/Status
0xe1       GPL,SL  R/W      1  SCT Data Transfer

SMART Extended Comprehensive Error Log Version: 1 (5 sectors)
No Errors Logged

SMART Extended Self-test Log Version: 1 (1 sectors)
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%      7630         -
# 2  Extended offline    Interrupted (host reset)      00%      7507         -
# 3  Extended offline    Interrupted (host reset)      00%      7464         -
# 4  Short offline       Completed without error       00%      7330         -
# 5  Extended offline    Interrupted (host reset)      00%      7311         -
# 6  Short offline       Completed without error       00%      7242         -
# 7  Extended offline    Completed without error       00%      7165         -
# 8  Short offline       Completed without error       00%      7040         -
# 9  Extended offline    Interrupted (host reset)      00%      6994         -
#10  Short offline       Completed without error       00%      6971         -
#11  Extended offline    Interrupted (host reset)      00%      6959         -
#12  Short offline       Completed without error       00%      6902         -
#13  Extended offline    Completed without error       00%      6875         -
#14  Extended offline    Interrupted (host reset)      00%      6784         -
#15  Short offline       Completed without error       00%      6596         -
#16  Extended offline    Interrupted (host reset)      00%      6535         -
#17  Extended offline    Completed without error       00%      6492         -
#18  Extended offline    Completed without error       00%      6451         -
#19  Short offline       Completed without error       00%      6379         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

SCT Status Version:                  3
SCT Version (vendor specific):       522 (0x020a)
Device State:                        Active (0)
Current Temperature:                    34 Celsius
Power Cycle Min/Max Temperature:     34/34 Celsius
Lifetime    Min/Max Temperature:     21/64 Celsius
Specified Max Operating Temperature:    60 Celsius
Under/Over Temperature Limit Count:   0/0
SMART Status:                        0xc24f (PASSED)
Vendor specific:
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00

SCT Temperature History Version:     2
Temperature Sampling Period:         5 minutes
Temperature Logging Interval:        59 minutes
Min/Max recommended Temperature:     10/40 Celsius
Min/Max Temperature Limit:            5/60 Celsius
Temperature History Size (Index):    128 (18)

Index    Estimated Time   Temperature Celsius
  19    2026-06-16 16:57     ?  -
  20    2026-06-16 17:56    26  *******
  21    2026-06-16 18:55    37  ******************
  22    2026-06-16 19:54     ?  -
  23    2026-06-16 20:53    26  *******
  24    2026-06-16 21:52    36  *****************
  25    2026-06-16 22:51    36  *****************
  26    2026-06-16 23:50    33  **************
  27    2026-06-17 00:49    33  **************
  28    2026-06-17 01:48     ?  -
  29    2026-06-17 02:47    26  *******
  30    2026-06-17 03:46    43  ************************
  31    2026-06-17 04:45    45  **************************
  32    2026-06-17 05:44    44  *************************
  33    2026-06-17 06:43    40  *********************
  34    2026-06-17 07:42     ?  -
  35    2026-06-17 08:41    26  *******
  36    2026-06-17 09:40    37  ******************
  37    2026-06-17 10:39    48  *****************************
  38    2026-06-17 11:38    50  *******************************
  39    2026-06-17 12:37    49  ******************************
  40    2026-06-17 13:36    48  *****************************
  41    2026-06-17 14:35    48  *****************************
  42    2026-06-17 15:34    49  ******************************
  43    2026-06-17 16:33    50  *******************************
  44    2026-06-17 17:32    49  ******************************
  45    2026-06-17 18:31    49  ******************************
  46    2026-06-17 19:30    49  ******************************
  47    2026-06-17 20:29    50  *******************************
  48    2026-06-17 21:28    49  ******************************
  49    2026-06-17 22:27    50  *******************************
  50    2026-06-17 23:26    50  *******************************
  51    2026-06-18 00:25    49  ******************************
 ...    ..(  3 skipped).    ..  ******************************
  55    2026-06-18 04:21    49  ******************************
  56    2026-06-18 05:20    48  *****************************
  57    2026-06-18 06:19    49  ******************************
 ...    ..(  2 skipped).    ..  ******************************
  60    2026-06-18 09:16    49  ******************************
  61    2026-06-18 10:15    48  *****************************
  62    2026-06-18 11:14    38  *******************
  63    2026-06-18 12:13    38  *******************
  64    2026-06-18 13:12    47  ****************************
  65    2026-06-18 14:11    51  ********************************
  66    2026-06-18 15:10    51  ********************************
  67    2026-06-18 16:09    50  *******************************
 ...    ..(  2 skipped).    ..  *******************************
  70    2026-06-18 19:06    50  *******************************
  71    2026-06-18 20:05    49  ******************************
  72    2026-06-18 21:04     ?  -
  73    2026-06-18 22:03    33  **************
  74    2026-06-18 23:02     ?  -
  75    2026-06-19 00:01    36  *****************
  76    2026-06-19 01:00     ?  -
  77    2026-06-19 01:59    27  ********
  78    2026-06-19 02:58     ?  -
  79    2026-06-19 03:57    31  ************
  80    2026-06-19 04:56     ?  -
  81    2026-06-19 05:55    32  *************
  82    2026-06-19 06:54     ?  -
  83    2026-06-19 07:53    28  *********
  84    2026-06-19 08:52     ?  -
  85    2026-06-19 09:51    32  *************
  86    2026-06-19 10:50     ?  -
  87    2026-06-19 11:49    32  *************
  88    2026-06-19 12:48     ?  -
  89    2026-06-19 13:47    33  **************
  90    2026-06-19 14:46     ?  -
  91    2026-06-19 15:45    34  ***************
  92    2026-06-19 16:44     ?  -
  93    2026-06-19 17:43    34  ***************
  94    2026-06-19 18:42     ?  -
  95    2026-06-19 19:41    29  **********
  96    2026-06-19 20:40     ?  -
  97    2026-06-19 21:39    33  **************
  98    2026-06-19 22:38     ?  -
  99    2026-06-19 23:37    35  ****************
 100    2026-06-20 00:36     ?  -
 101    2026-06-20 01:35    35  ****************
 102    2026-06-20 02:34     ?  -
 103    2026-06-20 03:33    39  ********************
 104    2026-06-20 04:32     ?  -
 105    2026-06-20 05:31    39  ********************
 106    2026-06-20 06:30     ?  -
 107    2026-06-20 07:29    39  ********************
 108    2026-06-20 08:28     ?  -
 109    2026-06-20 09:27    40  *********************
 110    2026-06-20 10:26     ?  -
 111    2026-06-20 11:25    37  ******************
 112    2026-06-20 12:24     ?  -
 113    2026-06-20 13:23    39  ********************
 114    2026-06-20 14:22     ?  -
 115    2026-06-20 15:21    40  *********************
 116    2026-06-20 16:20    37  ******************
 117    2026-06-20 17:19    37  ******************
 118    2026-06-20 18:18    37  ******************
 119    2026-06-20 19:17     ?  -
 120    2026-06-20 20:16    37  ******************
 121    2026-06-20 21:15     ?  -
 122    2026-06-20 22:14    39  ********************
 123    2026-06-20 23:13     ?  -
 124    2026-06-21 00:12    28  *********
 125    2026-06-21 01:11     ?  -
 126    2026-06-21 02:10    34  ***************
 127    2026-06-21 03:09     ?  -
   0    2026-06-21 04:08    34  ***************
   1    2026-06-21 05:07     ?  -
   2    2026-06-21 06:06    35  ****************
   3    2026-06-21 07:05     ?  -
   4    2026-06-21 08:04    38  *******************
   5    2026-06-21 09:03     ?  -
   6    2026-06-21 10:02    39  ********************
   7    2026-06-21 11:01     ?  -
   8    2026-06-21 12:00    38  *******************
   9    2026-06-21 12:59     ?  -
  10    2026-06-21 13:58    27  ********
  11    2026-06-21 14:57     ?  -
  12    2026-06-21 15:56    34  ***************
  13    2026-06-21 16:55     ?  -
  14    2026-06-21 17:54    36  *****************
  15    2026-06-21 18:53     ?  -
  16    2026-06-21 19:52    31  ************
  17    2026-06-21 20:51     ?  -
  18    2026-06-21 21:50    34  ***************

SCT Error Recovery Control:
           Read:     70 (7.0 seconds)
          Write:     70 (7.0 seconds)

Device Statistics (GP Log 0x04)
Page  Offset Size        Value Flags Description
0x01  =====  =               =  ===  == General Statistics (rev 1) ==
0x01  0x008  4             435  ---  Lifetime Power-On Resets
0x01  0x010  4            7630  ---  Power-on Hours
0x01  0x018  6     72175362799  ---  Logical Sectors Written
0x01  0x020  6      1578807047  ---  Number of Write Commands
0x01  0x028  6    544071278135  ---  Logical Sectors Read
0x01  0x030  6      1773116978  ---  Number of Read Commands
0x01  0x038  6               -  ---  Date and Time TimeStamp
0x03  =====  =               =  ===  == Rotating Media Statistics (rev 1) ==
0x03  0x008  4            7415  ---  Spindle Motor Power-on Hours
0x03  0x010  4            4344  ---  Head Flying Hours
0x03  0x018  4            7440  ---  Head Load Events
0x03  0x020  4               0  ---  Number of Reallocated Logical Sectors
0x03  0x028  4               0  ---  Read Recovery Attempts
0x03  0x030  4               0  ---  Number of Mechanical Start Failures
0x03  0x038  4               0  ---  Number of Realloc. Candidate Logical Sectors
0x03  0x040  4             262  ---  Number of High Priority Unload Events
0x04  =====  =               =  ===  == General Errors Statistics (rev 1) ==
0x04  0x008  4               0  ---  Number of Reported Uncorrectable Errors
0x04  0x010  4               0  ---  Resets Between Cmd Acceptance and Completion
0x04  0x018  4               0  -D-  Physical Element Status Changed
0x05  =====  =               =  ===  == Temperature Statistics (rev 1) ==
0x05  0x008  1              34  ---  Current Temperature
0x05  0x010  1              48  ---  Average Short Term Temperature
0x05  0x018  1              40  ---  Average Long Term Temperature
0x05  0x020  1              63  ---  Highest Temperature
0x05  0x028  1              24  ---  Lowest Temperature
0x05  0x030  1              60  ---  Highest Average Short Term Temperature
0x05  0x038  1              34  ---  Lowest Average Short Term Temperature
0x05  0x040  1              45  ---  Highest Average Long Term Temperature
0x05  0x048  1              38  ---  Lowest Average Long Term Temperature
0x05  0x050  4            3916  ---  Time in Over-Temperature
0x05  0x058  1              60  ---  Specified Maximum Operating Temperature
0x05  0x060  4               0  ---  Time in Under-Temperature
0x05  0x068  1               5  ---  Specified Minimum Operating Temperature
0x06  =====  =               =  ===  == Transport Statistics (rev 1) ==
0x06  0x008  4             288  ---  Number of Hardware Resets
0x06  0x010  4              35  ---  Number of ASR Events
0x06  0x018  4               0  ---  Number of Interface CRC Errors
0xff  =====  =               =  ===  == Vendor Specific Statistics (rev 1) ==
0xff  0x008  7               0  ---  Vendor Specific
0xff  0x010  7               0  ---  Vendor Specific
0xff  0x018  7               0  ---  Vendor Specific
                                |||_ C monitored condition met
                                ||__ D supports DSN
                                |___ N normalized value

Pending Defects log (GP Log 0x0c)
No Defects Logged

SATA Phy Event Counters (GP Log 0x11)
ID      Size     Value  Description
0x000a  2            1  Device-to-host register FISes sent due to a COMRESET
0x0001  2            0  Command failed due to ICRC error
0x0003  2            0  R_ERR response for device-to-host data FIS
0x0004  2            0  R_ERR response for host-to-device data FIS
0x0006  2            0  R_ERR response for device-to-host non-data FIS
0x0007  2            0  R_ERR response for host-to-device non-data FIS
0x000b  2            0  CRC errors within host-to-device FIS
0x000d  2            0  Non-CRC errors within host-to-device FIS

Seagate FARM log (GP Log 0xa6) supported [try: -l farm]

admin@truenas[~]$ 



=== START OF INFORMATION SECTION ===
Device Model:     ST18000NM000J-2TV103
Serial Number:    ZR51QRFY
LU WWN Device Id: 5 000c50 0db2f7b9c
Firmware Version: SN06
User Capacity:    18,000,207,937,536 bytes [18.0 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        Not in smartctl database 7.3/5528
ATA Version is:   ACS-4 (minor revision not indicated)
SATA Version is:  SATA 3.3, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Sun Jun 21 22:37:42 2026 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
AAM feature is:   Unavailable
APM feature is:   Unavailable
Rd look-ahead is: Enabled
Write cache is:   Enabled
DSN feature is:   Disabled
ATA Security is:  Disabled, NOT FROZEN [SEC1]
Write SCT (Get) Feature Control Command failed: scsi error aborted command
Wt Cache Reorder: Unknown (SCT Feature Control command failed)

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever 
                                        been run.
Total time to complete Offline 
data collection:                (  559) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine 
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        (1500) minutes.
Conveyance self-test routine
recommended polling time:        (   2) minutes.
SCT capabilities:              (0x70bd) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
  1 Raw_Read_Error_Rate     POSR--   082   064   044    -    171676181
  3 Spin_Up_Time            PO----   090   090   000    -    0
  4 Start_Stop_Count        -O--CK   100   100   020    -    19
  5 Reallocated_Sector_Ct   PO--CK   100   100   010    -    0
  7 Seek_Error_Rate         POSR--   060   060   045    -    1097016
  9 Power_On_Hours          -O--CK   100   100   000    -    60
 10 Spin_Retry_Count        PO--C-   100   100   097    -    0
 12 Power_Cycle_Count       -O--CK   100   100   020    -    19
 18 Unknown_Attribute       PO-R--   100   100   050    -    0
187 Reported_Uncorrect      -O--CK   100   100   000    -    0
188 Command_Timeout         -O--CK   100   100   000    -    0
190 Airflow_Temperature_Cel -O---K   065   050   000    -    35 (Min/Max 31/35)
192 Power-Off_Retract_Count -O--CK   100   100   000    -    11
193 Load_Cycle_Count        -O--CK   100   100   000    -    71
194 Temperature_Celsius     -O---K   035   050   000    -    35 (0 26 0 0 0)
197 Current_Pending_Sector  -O--C-   100   100   000    -    0
198 Offline_Uncorrectable   ----C-   100   100   000    -    0
199 UDMA_CRC_Error_Count    -OSRCK   200   200   000    -    0
200 Multi_Zone_Error_Rate   PO---K   100   100   001    -    0
240 Head_Flying_Hours       ------   100   100   000    -    7 (142 152 0)
241 Total_LBAs_Written      ------   100   253   000    -    2361608184
242 Total_LBAs_Read         ------   100   253   000    -    7334215
                            ||||||_ K auto-keep
                            |||||__ C event count
                            ||||___ R error rate
                            |||____ S speed/performance
                            ||_____ O updated online
                            |______ P prefailure warning

General Purpose Log Directory Version 1
SMART           Log Directory Version 1 [multi-sector log support]
Address    Access  R/W   Size  Description
0x00       GPL,SL  R/O      1  Log Directory
0x01           SL  R/O      1  Summary SMART error log
0x02           SL  R/O      5  Comprehensive SMART error log
0x03       GPL     R/O      5  Ext. Comprehensive SMART error log
0x04       GPL     R/O    256  Device Statistics log
0x04       SL      R/O      8  Device Statistics log
0x06           SL  R/O      1  SMART self-test log
0x07       GPL     R/O      1  Extended self-test log
0x08       GPL     R/O      2  Power Conditions log
0x09           SL  R/W      1  Selective self-test log
0x0a       GPL     R/W      8  Device Statistics Notification
0x0c       GPL     R/O   2048  Pending Defects log
0x10       GPL     R/O      1  NCQ Command Error log
0x11       GPL     R/O      1  SATA Phy Event Counters log
0x13       GPL     R/O      1  SATA NCQ Send and Receive log
0x21       GPL     R/O      1  Write stream error log
0x22       GPL     R/O      1  Read stream error log
0x24       GPL     R/O    768  Current Device Internal Status Data log
0x2f       GPL     R/O      1  Set Sector Configuration
0x30       GPL,SL  R/O      9  IDENTIFY DEVICE data log
0x80-0x9f  GPL,SL  R/W     16  Host vendor specific log
0xa1       GPL,SL  VS     160  Device vendor specific log
0xa2       GPL     VS   16320  Device vendor specific log
0xa4       GPL,SL  VS     160  Device vendor specific log
0xa6       GPL     VS     192  Device vendor specific log
0xa8-0xa9  GPL,SL  VS     136  Device vendor specific log
0xab       GPL     VS       1  Device vendor specific log
0xad       GPL     VS      16  Device vendor specific log
0xb1       GPL,SL  VS     160  Device vendor specific log
0xb6       GPL     VS    1920  Device vendor specific log
0xbe-0xbf  GPL     VS   65535  Device vendor specific log
0xc1       GPL,SL  VS       8  Device vendor specific log
0xc3       GPL,SL  VS      24  Device vendor specific log
0xc6       GPL     VS    5184  Device vendor specific log
0xc7       GPL,SL  VS       8  Device vendor specific log
0xc9       GPL,SL  VS       8  Device vendor specific log
0xca       GPL,SL  VS      16  Device vendor specific log
0xcd       GPL,SL  VS       1  Device vendor specific log
0xce       GPL     VS       1  Device vendor specific log
0xcf       GPL     VS     512  Device vendor specific log
0xd1       GPL     VS     656  Device vendor specific log
0xd2       GPL     VS   10256  Device vendor specific log
0xd4       GPL     VS    2048  Device vendor specific log
0xda       GPL,SL  VS       1  Device vendor specific log
0xe0       GPL,SL  R/W      1  SCT Command/Status
0xe1       GPL,SL  R/W      1  SCT Data Transfer

SMART Extended Comprehensive Error Log Version: 1 (5 sectors)
No Errors Logged

SMART Extended Self-test Log Version: 1 (1 sectors)
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%        60         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

SCT Status Version:                  3
SCT Version (vendor specific):       522 (0x020a)
Device State:                        Active (0)
Current Temperature:                    35 Celsius
Power Cycle Min/Max Temperature:     31/35 Celsius
Lifetime    Min/Max Temperature:     25/67 Celsius
Under/Over Temperature Limit Count:   0/0
SMART Status:                        0xc24f (PASSED)
Vendor specific:
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 02 00 00 00 00 00 00 00 00 00 00 00

SCT Temperature History Version:     2
Temperature Sampling Period:         4 minutes
Temperature Logging Interval:        59 minutes
Min/Max recommended Temperature:     10/40 Celsius
Min/Max Temperature Limit:            5/60 Celsius
Temperature History Size (Index):    128 (79)

Index    Estimated Time   Temperature Celsius
  80    2026-06-16 16:57     ?  -
 ...    ..( 46 skipped).    ..  -
 127    2026-06-18 15:10     ?  -
   0    2026-06-18 16:09    67  ***************************************+
   1    2026-06-18 17:08     ?  -
   2    2026-06-18 18:07    26  *******
   3    2026-06-18 19:06    40  *********************
   4    2026-06-18 20:05    42  ***********************
   5    2026-06-18 21:04    42  ***********************
   6    2026-06-18 22:03    39  ********************
   7    2026-06-18 23:02     ?  -
   8    2026-06-19 00:01    26  *******
   9    2026-06-19 01:00    37  ******************
  10    2026-06-19 01:59    48  *****************************
  11    2026-06-19 02:58     ?  -
  12    2026-06-19 03:57    49  ******************************
  13    2026-06-19 04:56    47  ****************************
  14    2026-06-19 05:55    45  **************************
  15    2026-06-19 06:54    45  **************************
  16    2026-06-19 07:53    45  **************************
  17    2026-06-19 08:52    46  ***************************
  18    2026-06-19 09:51    45  **************************
  19    2026-06-19 10:50    45  **************************
  20    2026-06-19 11:49    46  ***************************
 ...    ..(  3 skipped).    ..  ***************************
  24    2026-06-19 15:45    46  ***************************
  25    2026-06-19 16:44    45  **************************
 ...    ..(  7 skipped).    ..  **************************
  33    2026-06-20 00:36    45  **************************
  34    2026-06-20 01:35    46  ***************************
  35    2026-06-20 02:34    45  **************************
  36    2026-06-20 03:33    39  ********************
  37    2026-06-20 04:32    39  ********************
  38    2026-06-20 05:31     ?  -
  39    2026-06-20 06:30    48  *****************************
  40    2026-06-20 07:29    48  *****************************
  41    2026-06-20 08:28    47  ****************************
  42    2026-06-20 09:27    46  ***************************
  43    2026-06-20 10:26    46  ***************************
  44    2026-06-20 11:25    46  ***************************
  45    2026-06-20 12:24    47  ****************************
  46    2026-06-20 13:23    46  ***************************
  47    2026-06-20 14:22     ?  -
  48    2026-06-20 15:21    29  **********
  49    2026-06-20 16:20     ?  -
  50    2026-06-20 17:19    27  ********
  51    2026-06-20 18:18     ?  -
  52    2026-06-20 19:17    28  *********
  53    2026-06-20 20:16     ?  -
  54    2026-06-20 21:15    29  **********
  55    2026-06-20 22:14     ?  -
  56    2026-06-20 23:13    37  ******************
  57    2026-06-21 00:12    39  ********************
  58    2026-06-21 01:11    39  ********************
  59    2026-06-21 02:10    38  *******************
  60    2026-06-21 03:09     ?  -
  61    2026-06-21 04:08    37  ******************
  62    2026-06-21 05:07     ?  -
  63    2026-06-21 06:06    28  *********
  64    2026-06-21 07:05     ?  -
  65    2026-06-21 08:04    35  ****************
  66    2026-06-21 09:03     ?  -
  67    2026-06-21 10:02    38  *******************
  68    2026-06-21 11:01     ?  -
  69    2026-06-21 12:00    39  ********************
  70    2026-06-21 12:59     ?  -
  71    2026-06-21 13:58    38  *******************
  72    2026-06-21 14:57     ?  -
  73    2026-06-21 15:56    27  ********
  74    2026-06-21 16:55     ?  -
  75    2026-06-21 17:54    34  ***************
  76    2026-06-21 18:53     ?  -
  77    2026-06-21 19:52    35  ****************
  78    2026-06-21 20:51     ?  -
  79    2026-06-21 21:50    31  ************

SCT Error Recovery Control:
           Read:    100 (10.0 seconds)
          Write:    100 (10.0 seconds)

Device Statistics (GP Log 0x04)
Page  Offset Size        Value Flags Description
0x01  =====  =               =  ===  == General Statistics (rev 1) ==
0x01  0x008  4              19  ---  Lifetime Power-On Resets
0x01  0x010  4              60  ---  Power-on Hours
0x01  0x018  6      2144199112  ---  Logical Sectors Written
0x01  0x020  6        14096394  ---  Number of Write Commands
0x01  0x028  6         7084873  ---  Logical Sectors Read
0x01  0x030  6           21048  ---  Number of Read Commands
0x01  0x038  6               -  ---  Date and Time TimeStamp
0x03  =====  =               =  ===  == Rotating Media Statistics (rev 1) ==
0x03  0x008  4              45  ---  Spindle Motor Power-on Hours
0x03  0x010  4               6  ---  Head Flying Hours
0x03  0x018  4              72  ---  Head Load Events
0x03  0x020  4               0  ---  Number of Reallocated Logical Sectors
0x03  0x028  4               0  ---  Read Recovery Attempts
0x03  0x030  4               0  ---  Number of Mechanical Start Failures
0x03  0x038  4               0  ---  Number of Realloc. Candidate Logical Sectors
0x03  0x040  4              11  ---  Number of High Priority Unload Events
0x04  =====  =               =  ===  == General Errors Statistics (rev 1) ==
0x04  0x008  4               0  ---  Number of Reported Uncorrectable Errors
0x04  0x010  4               0  ---  Resets Between Cmd Acceptance and Completion
0x04  0x018  4               0  -D-  Physical Element Status Changed
0x05  =====  =               =  ===  == Temperature Statistics (rev 1) ==
0x05  0x008  1              35  ---  Current Temperature
0x05  0x010  1              44  ---  Average Short Term Temperature
0x05  0x018  1               -  ---  Average Long Term Temperature
0x05  0x020  1              50  ---  Highest Temperature
0x05  0x028  1              27  ---  Lowest Temperature
0x05  0x030  1              44  ---  Highest Average Short Term Temperature
0x05  0x038  1              44  ---  Lowest Average Short Term Temperature
0x05  0x040  1               -  ---  Highest Average Long Term Temperature
0x05  0x048  1               -  ---  Lowest Average Long Term Temperature
0x05  0x050  4               0  ---  Time in Over-Temperature
0x05  0x058  1              60  ---  Specified Maximum Operating Temperature
0x05  0x060  4               0  ---  Time in Under-Temperature
0x05  0x068  1               5  ---  Specified Minimum Operating Temperature
0x06  =====  =               =  ===  == Transport Statistics (rev 1) ==
0x06  0x008  4              16  ---  Number of Hardware Resets
0x06  0x010  4               2  ---  Number of ASR Events
0x06  0x018  4               0  ---  Number of Interface CRC Errors
0xff  =====  =               =  ===  == Vendor Specific Statistics (rev 1) ==
0xff  0x008  7               0  ---  Vendor Specific
0xff  0x010  7               0  ---  Vendor Specific
0xff  0x018  7               0  ---  Vendor Specific
                                |||_ C monitored condition met
                                ||__ D supports DSN
                                |___ N normalized value

Pending Defects log (GP Log 0x0c)
No Defects Logged

SATA Phy Event Counters (GP Log 0x11)
ID      Size     Value  Description
0x000a  2            2  Device-to-host register FISes sent due to a COMRESET
0x0001  2            0  Command failed due to ICRC error
0x0003  2            0  R_ERR response for device-to-host data FIS
0x0004  2            0  R_ERR response for host-to-device data FIS
0x0006  2            0  R_ERR response for device-to-host non-data FIS
0x0007  2            0  R_ERR response for host-to-device non-data FIS

Seagate FARM log (GP Log 0xa6) supported [try: -l farm]

those are the ones of the two drives that are causing problems the 18tb the one that is getting re silvered and the 16tb that always reports errors

Max_Margreiter · June 21, 2026, 8:49pm

L2ARC

Max_Margreiter · June 21, 2026, 9:28pm

  Adapter Selected is a Avago SAS: SAS3008(C0)

Num   Ctlr            FW Ver        NVDATA        x86-BIOS         PCI Addr
----------------------------------------------------------------------------

0  SAS3008(C0)  16.00.12.00    0e.01.00.07    08.37.02.00     00:03:00:00
1  SAS3008(C0)  16.00.12.00    0e.01.00.07    08.37.02.00     00:05:00:00

        Finished Processing Commands Successfully.
        Exiting SAS3Flash.

so seems to be P16; cables are either 50 or 75cm depending on location of drive, I need the 75cm ones to reach the last part of the backplane that’s just the way the Jonsbo N5 case is build

Topic		Replies	Views
How to Fix Degraded VDEV? TrueNAS General ZFS	17	492	November 30, 2025
Corrupted Pool TrueNAS General SCALE , pool , corruption	25	373	December 30, 2024
Reused HDD Crashes Whole System Even After Sanitizing etc TrueNAS General	39	499	April 3, 2026
Pool degraded after update to 25.04.0 TrueNAS General	36	336	August 2, 2025
Device: /dev/ada0, 504 currently unreadable (pending) sectors TrueNAS General CORE	48	542	September 19, 2024

Bad Resilver problem and getting worse

Related topics