Pool state is ONLINE: One or more devices is currently being resilvered

Pool storage1 state is ONLINE: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state.

This critical error appeared yesterday but the pool is reporting fine and there are no other indications of any errors. Resilvering is defo not happening. Just wondering if anyone has any advice on how to in/validate this error or dig in to what it actually means

Many thanks
P

Hello,
I would go to the shell/cli and check the pool:
sudo zpool status

and you can always clear the errors of a device if you are sure they are fine (smart data checked e.g.) with:
sudo zpool clear <poolname>

It will then resilver if a device has been offline/reported errors.

Thanks for chiming in Etienne! yep looks fine to me. I wonder what generated the error then?

pool: storage1
state: ONLINE
status: Some supported and requested features are not enabled on the pool.
The pool can still be used, but some features are unavailable.
action: Enable all features using ‘zpool upgrade’. Once this is done,
the pool may no longer be accessible by software that does not support
the features. See zpool-features(7) for details.
scan: scrub repaired 0B in 12:31:26 with 0 errors on Mon Oct 13 02:29:01 2025
config:

    NAME                                      STATE     READ WRITE CKSUM
    storage1                                  ONLINE       0     0     0
      raidz1-0                                ONLINE       0     0     0
        7912023c-7e6b-4bef-8445-3cef8f64fefe  ONLINE       0     0     0
        43133b2a-0ae2-4104-b4dc-686d734cbb42  ONLINE       0     0     0
        9cdf928f-7d07-4e26-a103-05f58bfc7171  ONLINE       0     0     0
        77292dd5-47c2-4bb9-87bd-7c9d8b276a06  ONLINE       0     0     0

Try the following & scrolling down through the logs to try to get more info on what happened:
more /var/log/messages

Here is an example of when I recently had a port fail (luckily I have spare ports):

Oct  8 10:31:18 truenas kernel: sd 4:0:0:0: [sdf] tag#3 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
Oct  8 10:31:18 truenas kernel: sd 4:0:0:0: [sdf] tag#3 CDB: ATA command pass through(16) 85 06 2c 00 da 00 00 00 00 00 4f 00 c2 00 b0 00
Oct  8 10:31:18 truenas kernel: sd 4:0:0:0: [sdf] tag#0 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
Oct  8 10:31:18 truenas kernel: sd 4:0:0:0: [sdf] tag#0 CDB: ATA command pass through(16) 85 06 2c 00 da 00 00 00 00 00 4f 00 c2 00 b0 00

Some output referencing your disks will provide at least some insight as to what may have caused the pool to resilver, but I’m guessing a drive at least briefly dropped.

Edit: Maybe bad example as it was the next set of logs where I reseated the drive & it failed to negotiate link speed that clued me into the port being the fault, but you get the gist.

Also, to my knowledge, logs don’t persist past a reboot, so hopefully that wasn’t your first instinct in this case (I’ll be very happy if someone corrects me on this).

Thanks Fleshmauler! See below from msglog

Oct 12 09:42:45 truenas kernel: sd 10:0:1:0: device_block, handle(0x000a)
Oct 12 09:42:47 truenas kernel: sd 10:0:1:0: device_unblock and setting to running, handle(0x000a)
Oct 12 09:42:49 truenas kernel: mpt3sas_cm0: log_info(0x31110d00): originator(PL), code(0x11), sub_code(0x0d00)
Oct 12 09:42:49 truenas kernel: mpt3sas_cm0: log_info(0x31110d00): originator(PL), code(0x11), sub_code(0x0d00)
Oct 12 09:42:49 truenas kernel: mpt3sas_cm0: log_info(0x31110d00): originator(PL), code(0x11), sub_code(0x0d00)
Oct 12 09:42:49 truenas kernel: sd 10:0:1:0: [sdh] tag#9451 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK cmd_age=3s
Oct 12 09:42:49 truenas kernel: sd 10:0:1:0: [sdh] tag#9451 CDB: Read(16) 88 00 00 00 00 02 07 7b c9 58 00 00 00 08 00 00
Oct 12 09:42:49 truenas kernel: zio pool=storage1 vdev=/dev/disk/by-partuuid/7912023c-7e6b-4bef-8445-3cef8f64fefe error=5 type=1 offset=4462328590336 size=4096 flags=3145904
Oct 12 09:42:49 truenas kernel: zio pool=storage1 vdev=/dev/disk/by-partuuid/7912023c-7e6b-4bef-8445-3cef8f64fefe error=5 type=1 offset=4467012415488 size=237568 flags=2148533424
Oct 12 09:42:49 truenas kernel: sd 10:0:1:0: [sdh] tag#9453 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK cmd_age=3s
Oct 12 09:42:49 truenas kernel: sd 10:0:1:0: [sdh] tag#9454 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK cmd_age=3s
Oct 12 09:42:49 truenas kernel: sd 10:0:1:0: [sdh] tag#9453 CDB: Read(16) 88 00 00 00 00 02 08 07 53 10 00 00 00 08 00 00
Oct 12 09:42:49 truenas kernel: sd 10:0:1:0: [sdh] tag#9454 CDB: Read(16) 88 00 00 00 00 02 08 07 54 28 00 00 00 08 00 00
Oct 12 09:42:49 truenas kernel: zio pool=storage1 vdev=/dev/disk/by-partuuid/7912023c-7e6b-4bef-8445-3cef8f64fefe error=5 type=1 offset=4467010707456 size=4096 flags=3145904
Oct 12 09:42:49 truenas kernel: zio pool=storage1 vdev=/dev/disk/by-partuuid/7912023c-7e6b-4bef-8445-3cef8f64fefe error=5 type=1 offset=4467010850816 size=4096 flags=3145904
Oct 12 09:42:49 truenas kernel: sd 10:0:1:0: [sdh] Synchronizing SCSI cache
Oct 12 09:42:49 truenas kernel: sd 10:0:1:0: [sdh] Synchronize Cache(10) failed: Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
Oct 12 09:42:49 truenas kernel: mpt3sas_cm0: mpt3sas_transport_port_remove: removed: sas_addr(0x4433221101000000)
Oct 12 09:42:49 truenas kernel: mpt3sas_cm0: removing handle(0x000a), sas_addr(0x4433221101000000)
Oct 12 09:42:49 truenas kernel: mpt3sas_cm0: enclosure logical id(0x5d05099000052908), slot(2)
Oct 12 09:42:49 truenas kernel: mpt3sas_cm0: enclosure level(0x0000), connector name( )
Oct 12 09:42:55 truenas netdata[3849466]: CONFIG: cannot load cloud config ‘/var/lib/netdata/cloud.d/cloud.conf’. Running with internal defaults.
Oct 12 09:43:07 truenas kernel: mpt3sas_cm0: handle(0xa) sas_address(0x4433221101000000) port_type(0x1)
Oct 12 09:43:07 truenas kernel: scsi 10:0:2:0: Direct-Access ATA ST20000NT001-3LT EN01 PQ: 0 ANSI: 6
Oct 12 09:43:07 truenas kernel: scsi 10:0:2:0: SATA: handle(0x000a), sas_addr(0x4433221101000000), phy(1), device_name(0x0000000000000000)
Oct 12 09:43:07 truenas kernel: scsi 10:0:2:0: enclosure logical id (0x5d05099000052908), slot(2)
Oct 12 09:43:07 truenas kernel: scsi 10:0:2:0: enclosure level(0x0000), connector name( )
Oct 12 09:43:07 truenas kernel: scsi 10:0:2:0: atapi(n), ncq(y), asyn_notify(n), smart(y), fua(y), sw_preserve(y)
Oct 12 09:43:07 truenas kernel: scsi 10:0:2:0: qdepth(32), tagged(1), scsi_level(7), cmd_que(1)
Oct 12 09:43:07 truenas kernel: sd 10:0:2:0: Attached scsi generic sg10 type 0
Oct 12 09:43:07 truenas kernel: sd 10:0:2:0: Power-on or device reset occurred
Oct 12 09:43:07 truenas kernel: end_device-10:2: add: handle(0x000a), sas_addr(0x4433221101000000)
Oct 12 09:43:07 truenas kernel: sd 10:0:2:0: [sdh] 39063650304 512-byte logical blocks: (20.0 TB/18.2 TiB)
Oct 12 09:43:07 truenas kernel: sd 10:0:2:0: [sdh] 4096-byte physical blocks
Oct 12 09:43:07 truenas kernel: sd 10:0:2:0: [sdh] Write Protect is off
Oct 12 09:43:07 truenas kernel: sd 10:0:2:0: [sdh] Write cache: enabled, read cache: enabled, supports DPO and FUA
Oct 12 09:43:07 truenas kernel: sdh: sdh1
Oct 12 09:43:07 truenas kernel: sd 10:0:2:0: [sdh] Attached SCSI disk

Well, that gives us clues that something caused a blip on drive sdh, but the drive recovered. Now is where you’d have the option to run/review some smart tests on that drive, check physical connections, etc.

I’m sadly not wise enough to decode what caused sdh to drop from those logs, but at least you now know what went wrong & which drive to focus your attentions toward.

Many thanks mate t you and EtienneB for your input. I’ll interrogate that disk and perhaps reseat the connections

Best Rgds
P

1 Like

I have seen these errors too from time to time.
Used chatgpt or claude honestly to help me decode them and it gave me some commands to identify which controller/port/drive etc (I have 2 SAS controllers).
It even gave me some suggestion for bios settings for my supermicro board that could conflict with the sas controller.

Check the cabling, reseat the drive if possible or try to replace the SATA cable as a starter. Could be a one off, but as Fleshmauler said, run a smartctl test (which will probably be fine, at least for my drives)

this forum is great for support! The way it does real time search of what you type for related issues is a verry savvy feature. Lot’s of cases to read through and i’ll bet it will be source material for a TN bot in the future which is an ideal use of the technology.

The drive passed a short test in is well into a long test ,we’ll see what shakes out.

Thanks again!
P

Final update: drive has passed smart tests and pool has scrubbed fine. marking it up to space radiation for now.

P

1 Like

Ah, yes. Good ole’ BOFH excuse #2. Solar Flares.

2 Likes

lol tis only sorta a joke, there are documented cases of bit flips from such. This could be an indication of electronics failing but the longer it goes without reoccurring the more likely it was that space particles that caused the issue