Recovering from catastrophic drive loss

I didn’t find anything quite hitting the spot for this, but def open to being pointed to any resource.

That said, I have discovered today that while a new replacement drive was reslivering after a single-drive failure, a 2nd drive started erroring out suddenly until TN removed it from the Z1 pool.

I do have all of this data backed up, but still not a great tech day. I’ve recovered from a single drive failure before and it is relatively painless, even if time consuming. This is the first time I’ve had 2 drives fail at the same time.

I will buy another replacement drive, but how does one recover from catastrophic drive loss in terms of data recovery? Afaik, the data on the good drives is recoverable, and then I have the back up. So I just need to rebuild the pool with the existing and replacement drives, then recover the missing data lost to the bad drives?

Is there a way to understand exactly what was lost? I was thinking something like Beyond Compare but I feel like there is a better way?

Also, 70% sure I’m missing something completely as well. Thank you in advance to the braintrust for putting me on the right path.

If you lost two drives in a Raid-Z1 VDEV, you lost the entire pool.
I think the following will help you understand

2 Likes

Try and make sure that it’s actually the disk itself that has failed. Have you done a sound check? Do they spin up ok? Is there any scratching or clacking? (I’m just assuming their HDDs).

There are a lot of other components that can fail. It’s not possible for TrueNAS or ZFS to tell reliably which hardware component has failed. Especially if multiple drives fail at the same time that might suggest an issue somewhere else.

If you want some advice on the forums, you might want to post the output of the following commands:

  • sudo zpool status -v
  • sudo lsblk -o NAME,SIZE,TYPE,SERIAL,LABEL,PARTUUID
  • sudo dmesg

Also S.M.A.R.T output for each drive if you know how to do it.

1 Like

Bacon, do you want those commands posted as Preformatted text?

If yes, OP please post the results from those commands using the Preformatted text (Ctrl+e), looks like </> in toolbar where you reply

looks like this when posted

sudo zpool status -v
2 Likes

Another question is also how the drives are connected, what kind of failure, etc. Maybe it is just a cable was knocked slightly loose while swapping drives. Maybe it is an HBA overheating & in need of a fan pointed at it. Could be that situation isn’t a total disaster.

Another thing, if possible, is to have all drives & the replacement drive connected at the same time. If not mistaken, it could save you in situations where you have drives a,b,c,d,e with e failing & being replaced by x, but c randomly throws a checksum error. If e is still healthy enough & can access the files where checksum error happens on c, you should still be able to have the parity to continue (even though e is actively being replaced by x)… unless I’m misunderstanding something critically.

This’d require you to have enough sata ports.

1 Like

Hey guys, thanks for the replies. I will look through that guide after I run these commands (thanks for the link!).

To answer the Qs, yes they are HDDs. I actually didn’t hear any bad noise coming from either drive, and the really weird thing is, I can still browse the data. I have turned off all services and am trying not to access the disks at all. I’m assuming this is some kind of browse cache? I was expecting the share to be offline completely So am I to understand that all the data is lost/needing restore from back up then?

I have 6 SATA ports, with 1 free one. If it was just a cable that would be amazing. It very well could be, I know so little. I just assumed the drive was bad and ordered a new one when TN booted it from the pool. I guess I’ll check when this new drive is done resilvering? In the case, the drives do have a fan directly on them, and I am monitoring the temperature constantly throughout the day. They never get above 45 deg c and are usually below 40.

Regarding the S.M.A.R.T tests, how do I get at them when the drives are offline? I’m looking now, but I will probably need some guidance on that one.

Here’s the output of zpool status -v:

  pool: PLEX
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
	continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Tue Nov 26 11:51:20 2024
	3.75T scanned at 35.7M/s, 3.56T issued at 33.9M/s, 14.1T total
	51.0G resilvered, 25.20% done, 3 days 18:51:04 to go
config:

	NAME                                              STATE     READ WRITE CKSUM
	PLEX                                              DEGRADED     0     0     0
	  raidz1-0                                        DEGRADED     0     0     0
	    ada1p2                                        DEGRADED     0     0 2.23M  too many errors
	    ada4p2                                        REMOVED      0     0     0
	    ada2p2                                        DEGRADED     0     0 2.23M  too many errors
	    replacing-3                                   DEGRADED     0     0 2.23M
	      gptid/e48d5170-df72-11ed-b9cd-002500f1258a  REMOVED      0     0     0
	      gptid/00745c37-a8d1-11ef-8947-0cc47a710b82  ONLINE       0     0     0  (resilvering)

errors: Permanent errors have been detected in the following files:

... stuff n things ...

  pool: boot-pool
 state: ONLINE
  scan: scrub repaired 0B in 00:00:12 with 0 errors on Wed Nov 27 03:45:12 2024
config:

	NAME        STATE     READ WRITE CKSUM
	boot-pool   ONLINE       0     0     0
	  ada0p2    ONLINE       0     0     0

errors: No known data errors

I tried lsblk but that doesn’t seem to be installed? There’s an lsb? I didn’t run it.

I’ll post dmesg in another post, due to post size limit.

dmesg is just too dig-dang big (10x limit). Here’s top and tail (eyeballed):

(ada4:ahcich4:0:0:0): Retrying command, 2 more tries remain
(ada4:ahcich4:0:0:0): READ_FPDMA_QUEUED. ACB: 60 50 20 c8 fe 40 ff 00 00 00 00 00
(ada4:ahcich4:0:0:0): CAM status: ATA Status Error
(ada4:ahcich4:0:0:0): ATA status: 00 ()
(ada4:ahcich4:0:0:0): RES: 00 00 00 00 00 00 00 00 00 00 00
(ada4:ahcich4:0:0:0): Retrying command, 2 more tries remain
(ada4:ahcich4:0:0:0): READ_FPDMA_QUEUED. ACB: 60 58 f0 cd fe 40 ff 00 00 00 00 00
(ada4:ahcich4:0:0:0): CAM status: ATA Status Error
(ada4:ahcich4:0:0:0): ATA status: 00 ()
(ada4:ahcich4:0:0:0): RES: 00 00 00 00 00 00 00 00 00 00 00
(ada4:ahcich4:0:0:0): Retrying command, 2 more tries remain
(ada4:ahcich4:0:0:0): READ_FPDMA_QUEUED. ACB: 60 50 60 ce fe 40 ff 00 00 00 00 00
(ada4:ahcich4:0:0:0): CAM status: ATA Status Error
(ada4:ahcich4:0:0:0): ATA status: 00 ()
(ada4:ahcich4:0:0:0): RES: 00 00 00 00 00 00 00 00 00 00 00
(ada4:ahcich4:0:0:0): Retrying command, 2 more tries remain
(ada4:ahcich4:0:0:0): READ_FPDMA_QUEUED. ACB: 60 50 b8 ce fe 40 ff 00 00 00 00 00
(ada4:ahcich4:0:0:0): CAM status: ATA Status Error
(ada4:ahcich4:0:0:0): ATA status: 00 ()
(ada4:ahcich4:0:0:0): RES: 00 00 00 00 00 00 00 00 00 00 00
(ada4:ahcich4:0:0:0): Retrying command, 2 more tries remain
(ada4:ahcich4:0:0:0): READ_FPDMA_QUEUED. ACB: 60 08 18 89 89 40 02 00 00 00 00 00
(ada4:ahcich4:0:0:0): CAM status: ATA Status Error
(ada4:ahcich4:0:0:0): ATA status: 00 ()
(ada4:ahcich4:0:0:0): RES: 00 00 00 00 00 00 00 00 00 00 00
(ada4:ahcich4:0:0:0): Retrying command, 2 more tries remain
(ada4:ahcich4:0:0:0): READ_FPDMA_QUEUED. ACB: 60 08 20 61 eb 40 fe 01 00 00 00 00
(ada4:ahcich4:0:0:0): CAM status: ATA Status Error
(ada4:ahcich4:0:0:0): ATA status: 00 ()
(ada4:ahcich4:0:0:0): RES: 00 00 00 00 00 00 00 00 00 00 00
(ada4:ahcich4:0:0:0): Retrying command, 2 more tries remain
(ada4:ahcich4:0:0:0): READ_FPDMA_QUEUED. ACB: 60 08 08 01 97 40 0c 00 00 00 00 00
(ada4:ahcich4:0:0:0): CAM status: ATA Status Error
(ada4:ahcich4:0:0:0): ATA status: 00 ()
(ada4:ahcich4:0:0:0): RES: 00 00 00 00 00 00 00 00 00 00 00
(ada4:ahcich4:0:0:0): Retrying command, 2 more tries remain
ahcich4: Timeout on slot 12 port 0
ahcich4: is 00000000 cs 00001000 ss 00000000 rs 00001000 tfd c0 serr 00000000 cmd 0004cc17
ahcich4: Error while READ LOG EXT
(ada4:ahcich4:0:0:0): READ_FPDMA_QUEUED. ACB: 60 18 18 22 22 40 79 01 00 00 00 00
(ada4:ahcich4:0:0:0): CAM status: ATA Status Error
(ada4:ahcich4:0:0:0): ATA status: 00 ()
(ada4:ahcich4:0:0:0): RES: 00 00 00 00 00 00 00 00 00 00 00
(ada4:ahcich4:0:0:0): Retrying command, 1 more tries remain
(ada4:ahcich4:0:0:0): READ_FPDMA_QUEUED. ACB: 60 08 e8 20 22 40 79 01 00 00 00 00
(ada4:ahcich4:0:0:0): CAM status: ATA Status Error
(ada4:ahcich4:0:0:0): ATA status: 00 ()
(ada4:ahcich4:0:0:0): RES: 00 00 00 00 00 00 00 00 00 00 00
(ada4:ahcich4:0:0:0): Retrying command, 0 more tries remain
(ada4:ahcich4:0:0:0): READ_FPDMA_QUEUED. ACB: 60 08 90 d0 38 40 0b 00 00 00 00 00
(ada4:ahcich4:0:0:0): CAM status: ATA Status Error
(ada4:ahcich4:0:0:0): ATA status: 00 ()
(ada4:ahcich4:0:0:0): RES: 00 00 00 00 00 00 00 00 00 00 00
(ada4:ahcich4:0:0:0): Retrying command, 1 more tries remain
(ada4:ahcich4:0:0:0): READ_FPDMA_QUEUED. ACB: 60 60 00 c5 fe 40 ff 00 00 01 00 00
(ada4:ahcich4:0:0:0): CAM status: ATA Status Error
(ada4:ahcich4:0:0:0): ATA status: 00 ()
(ada4:ahcich4:0:0:0): RES: 00 00 00 00 00 00 00 00 00 00 00
(ada4:ahcich4:0:0:0): Retrying command, 1 more tries remain
(ada4:ahcich4:0:0:0): READ_FPDMA_QUEUED. ACB: 60 00 c0 c6 fe 40 ff 00 00 01 00 00
(ada4:ahcich4:0:0:0): CAM status: ATA Status Error
(ada4:ahcich4:0:0:0): ATA status: 00 ()
(ada4:ahcich4:0:0:0): RES: 00 00 00 00 00 00 00 00 00 00 00
(ada4:ahcich4:0:0:0): Retrying command, 1 more tries remain
(ada4:ahcich4:0:0:0): READ_FPDMA_QUEUED. ACB: 60 50 20 c8 fe 40 ff 00 00 00 00 00
(ada4:ahcich4:0:0:0): CAM status: ATA Status Error
(ada4:ahcich4:0:0:0): ATA status: 00 ()
(ada4:ahcich4:0:0:0): RES: 00 00 00 00 00 00 00 00 00 00 00
(ada4:ahcich4:0:0:0): Retrying command, 1 more tries remain
(ada4:ahcich4:0:0:0): READ_FPDMA_QUEUED. ACB: 60 58 f0 cd fe 40 ff 00 00 00 00 00
(ada4:ahcich4:0:0:0): CAM status: ATA Status Error
(ada4:ahcich4:0:0:0): ATA status: 00 ()
(ada4:ahcich4:0:0:0): RES: 00 00 00 00 00 00 00 00 00 00 00
(ada4:ahcich4:0:0:0): Retrying command, 1 more tries remain
(ada4:ahcich4:0:0:0): READ_FPDMA_QUEUED. ACB: 60 50 60 ce fe 40 ff 00 00 00 00 00
(ada4:ahcich4:0:0:0): CAM status: ATA Status Error
(ada4:ahcich4:0:0:0): ATA status: 00 ()
(ada4:ahcich4:0:0:0): RES: 00 00 00 00 00 00 00 00 00 00 00
(ada4:ahcich4:0:0:0): Retrying command, 1 more tries remain
(ada4:ahcich4:0:0:0): READ_FPDMA_QUEUED. ACB: 60 50 b8 ce fe 40 ff 00 00 00 00 00
(ada4:ahcich4:0:0:0): CAM status: ATA Status Error
(ada4:ahcich4:0:0:0): ATA status: 00 ()
(ada4:ahcich4:0:0:0): RES: 00 00 00 00 00 00 00 00 00 00 00
(ada4:ahcich4:0:0:0): Retrying command, 1 more tries remain
(ada4:ahcich4:0:0:0): READ_FPDMA_QUEUED. ACB: 60 08 18 89 89 40 02 00 00 00 00 00
(ada4:ahcich4:0:0:0): CAM status: ATA Status Error
(ada4:ahcich4:0:0:0): ATA status: 00 ()
(ada4:ahcich4:0:0:0): RES: 00 00 00 00 00 00 00 00 00 00 00
(ada4:ahcich4:0:0:0): Retrying command, 1 more tries remain
(ada4:ahcich4:0:0:0): READ_FPDMA_QUEUED. ACB: 60 08 20 61 eb 40 fe 01 00 00 00 00
(ada4:ahcich4:0:0:0): CAM status: ATA Status Error
(ada4:ahcich4:0:0:0): ATA status: 00 ()
(ada4:ahcich4:0:0:0): RES: 00 00 00 00 00 00 00 00 00 00 00
(ada4:ahcich4:0:0:0): Retrying command, 1 more tries remain
(ada4:ahcich4:0:0:0): READ_FPDMA_QUEUED. ACB: 60 08 08 01 97 40 0c 00 00 00 00 00
(ada4:ahcich4:0:0:0): CAM status: ATA Status Error
(ada4:ahcich4:0:0:0): ATA status: 00 ()
(ada4:ahcich4:0:0:0): RES: 00 00 00 00 00 00 00 00 00 00 00
(ada4:ahcich4:0:0:0): Retrying command, 1 more tries remain
ahcich4: Timeout on slot 31 port 0
ahcich4: is 00000000 cs 80000000 ss 00000000 rs 80000000 tfd c0 serr 00000000 cmd 0004df17
ahcich4: Error while READ LOG EXT
(ada4:ahcich4:0:0:0): READ_FPDMA_QUEUED. ACB: 60 18 18 22 22 40 79 01 00 00 00 00
(ada4:ahcich4:0:0:0): CAM status: ATA Status Error
(ada4:ahcich4:0:0:0): ATA status: 00 ()
(ada4:ahcich4:0:0:0): RES: 00 00 00 00 00 00 00 00 00 00 00
(ada4:ahcich4:0:0:0): Retrying command, 0 more tries remain
(ada4:ahcich4:0:0:0): READ_FPDMA_QUEUED. ACB: 60 08 e8 20 22 40 79 01 00 00 00 00
(ada4:ahcich4:0:0:0): CAM status: ATA Status Error
(ada4:ahcich4:0:0:0): ATA status: 00 ()
(ada4:ahcich4:0:0:0): RES: 00 00 00 00 00 00 00 00 00 00 00
(ada4:ahcich4:0:0:0): Error 5, Retries exhausted
(ada4:ahcich4:0:0:0): READ_FPDMA_QUEUED. ACB: 60 08 90 d0 38 40 0b 00 00 00 00 00
(ada4:ahcich4:0:0:0): CAM status: ATA Status Error
(ada4:ahcich4:0:0:0): ATA status: 00 ()
(ada4:ahcich4:0:0:0): RES: 00 00 00 00 00 00 00 00 00 00 00
(ada4:ahcich4:0:0:0): Retrying command, 0 more tries remain
(ada4:ahcich4:0:0:0): READ_FPDMA_QUEUED. ACB: 60 60 00 c5 fe 40 ff 00 00 01 00 00
(ada4:ahcich4:0:0:0): CAM status: ATA Status Error
(ada4:ahcich4:0:0:0): ATA status: 00 ()
(ada4:ahcich4:0:0:0): RES: 00 00 00 00 00 00 00 00 00 00 00
(ada4:ahcich4:0:0:0): Retrying command, 0 more tries remain
(ada4:ahcich4:0:0:0): READ_FPDMA_QUEUED. ACB: 60 00 c0 c6 fe 40 ff 00 00 01 00 00
(ada4:ahcich4:0:0:0): CAM status: ATA Status Error
(ada4:ahcich4:0:0:0): ATA status: 00 ()
(ada4:ahcich4:0:0:0): RES: 00 00 00 00 00 00 00 00 00 00 00
(ada4:ahcich4:0:0:0): Retrying command, 0 more tries remain
(ada4:ahcich4:0:0:0): READ_FPDMA_QUEUED. ACB: 60 50 20 c8 fe 40 ff 00 00 00 00 00
(ada4:ahcich4:0:0:0): CAM status: ATA Status Error
(ada4:ahcich4:0:0:0): ATA status: 00 ()
(ada4:ahcich4:0:0:0): RES: 00 00 00 00 00 00 00 00 00 00 00
(ada4:ahcich4:0:0:0): Retrying command, 0 more tries remain
(ada4:ahcich4:0:0:0): READ_FPDMA_QUEUED. ACB: 60 58 f0 cd fe 40 ff 00 00 00 00 00
(ada4:ahcich4:0:0:0): CAM status: ATA Status Error
(ada4:ahcich4:0:0:0): ATA status: 00 ()
(ada4:ahcich4:0:0:0): RES: 00 00 00 00 00 00 00 00 00 00 00
(ada4:ahcich4:0:0:0): Retrying command, 0 more tries remain
(ada4:ahcich4:0:0:0): READ_FPDMA_QUEUED. ACB: 60 50 60 ce fe 40 ff 00 00 00 00 00
(ada4:ahcich4:0:0:0): CAM status: ATA Status Error
(ada4:ahcich4:0:0:0): ATA status: 00 ()
(ada4:ahcich4:0:0:0): RES: 00 00 00 00 00 00 00 00 00 00 00
(ada4:ahcich4:0:0:0): Retrying command, 0 more tries remain
(ada4:ahcich4:0:0:0): READ_FPDMA_QUEUED. ACB: 60 50 b8 ce fe 40 ff 00 00 00 00 00
(ada4:ahcich4:0:0:0): CAM status: ATA Status Error
(ada4:ahcich4:0:0:0): ATA status: 00 ()
(ada4:ahcich4:0:0:0): RES: 00 00 00 00 00 00 00 00 00 00 00
(ada4:ahcich4:0:0:0): Retrying command, 0 more tries remain
(ada4:ahcich4:0:0:0): READ_FPDMA_QUEUED. ACB: 60 08 18 89 89 40 02 00 00 00 00 00
(ada4:ahcich4:0:0:0): CAM status: ATA Status Error
(ada4:ahcich4:0:0:0): ATA status: 00 ()
(ada4:ahcich4:0:0:0): RES: 00 00 00 00 00 00 00 00 00 00 00
(ada4:ahcich4:0:0:0): Retrying command, 0 more tries remain
(ada4:ahcich4:0:0:0): READ_FPDMA_QUEUED. ACB: 60 08 20 61 eb 40 fe 01 00 00 00 00
(ada4:ahcich4:0:0:0): CAM status: ATA Status Error
(ada4:ahcich4:0:0:0): ATA status: 00 ()
(ada4:ahcich4:0:0:0): RES: 00 00 00 00 00 00 00 00 00 00 00
(ada4:ahcich4:0:0:0): Retrying command, 0 more tries remain
(ada4:ahcich4:0:0:0): READ_FPDMA_QUEUED. ACB: 60 08 08 01 97 40 0c 00 00 00 00 00
(ada4:ahcich4:0:0:0): CAM status: ATA Status Error
(ada4:ahcich4:0:0:0): ATA status: 00 ()
(ada4:ahcich4:0:0:0): RES: 00 00 00 00 00 00 00 00 00 00 00
(ada4:ahcich4:0:0:0): Retrying command, 0 more tries remain
ahcich4: Timeout on slot 17 port 0
ahcich4: is 00000000 cs 00020000 ss 00000000 rs 00020000 tfd c0 serr 00000000 cmd 0004d117
ahcich4: Error while READ LOG EXT

...

(ada4:ahcich4:0:0:0): READ_FPDMA_QUEUED. ACB: 60 50 18 c7 fe 40 ff 00 00 00 00 00
(ada4:ahcich4:0:0:0): CAM status: ATA Status Error
(ada4:ahcich4:0:0:0): ATA status: 00 ()
(ada4:ahcich4:0:0:0): RES: 00 00 00 00 00 00 00 00 00 00 00
(ada4:ahcich4:0:0:0): Error 5, Retries exhausted
(ada4:ahcich4:0:0:0): READ_FPDMA_QUEUED. ACB: 60 58 58 c5 fe 40 ff 00 00 00 00 00
(ada4:ahcich4:0:0:0): CAM status: ATA Status Error
(ada4:ahcich4:0:0:0): ATA status: 00 ()
(ada4:ahcich4:0:0:0): RES: 00 00 00 00 00 00 00 00 00 00 00
(ada4:ahcich4:0:0:0): Error 5, Retries exhausted
(ada4:ahcich4:0:0:0): READ_FPDMA_QUEUED. ACB: 60 40 60 72 44 40 e0 01 00 00 00 00
(ada4:ahcich4:0:0:0): CAM status: ATA Status Error
(ada4:ahcich4:0:0:0): ATA status: 00 ()
(ada4:ahcich4:0:0:0): RES: 00 00 00 00 00 00 00 00 00 00 00
(ada4:ahcich4:0:0:0): Error 5, Retries exhausted
(ada4:ahcich4:0:0:0): READ_FPDMA_QUEUED. ACB: 60 48 30 77 44 40 e0 01 00 00 00 00
(ada4:ahcich4:0:0:0): CAM status: ATA Status Error
(ada4:ahcich4:0:0:0): ATA status: 00 ()
(ada4:ahcich4:0:0:0): RES: 00 00 00 00 00 00 00 00 00 00 00
(ada4:ahcich4:0:0:0): Error 5, Retries exhausted
(ada4:ahcich4:0:0:0): READ_FPDMA_QUEUED. ACB: 60 60 90 78 44 40 e0 01 00 00 00 00
(ada4:ahcich4:0:0:0): CAM status: ATA Status Error
(ada4:ahcich4:0:0:0): ATA status: 00 ()
(ada4:ahcich4:0:0:0): RES: 00 00 00 00 00 00 00 00 00 00 00
(ada4:ahcich4:0:0:0): Error 5, Retries exhausted
(ada4:ahcich4:0:0:0): READ_FPDMA_QUEUED. ACB: 60 08 d0 40 19 40 77 01 00 00 00 00
(ada4:ahcich4:0:0:0): CAM status: ATA Status Error
(ada4:ahcich4:0:0:0): ATA status: 00 ()
(ada4:ahcich4:0:0:0): RES: 00 00 00 00 00 00 00 00 00 00 00
(ada4:ahcich4:0:0:0): Error 5, Retries exhausted
(ada4:ahcich4:0:0:0): READ_FPDMA_QUEUED. ACB: 60 00 30 db 70 40 e0 01 00 08 00 00
(ada4:ahcich4:0:0:0): CAM status: ATA Status Error
(ada4:ahcich4:0:0:0): ATA status: 00 ()
(ada4:ahcich4:0:0:0): RES: 00 00 00 00 00 00 00 00 00 00 00
(ada4:ahcich4:0:0:0): Error 5, Retries exhausted
(ada4:ahcich4:0:0:0): READ_FPDMA_QUEUED. ACB: 60 58 58 5c f9 40 34 00 00 00 00 00
(ada4:ahcich4:0:0:0): CAM status: ATA Status Error
(ada4:ahcich4:0:0:0): ATA status: 00 ()
(ada4:ahcich4:0:0:0): RES: 00 00 00 00 00 00 00 00 00 00 00
(ada4:ahcich4:0:0:0): Error 5, Retries exhausted
(ada4:ahcich4:0:0:0): READ_FPDMA_QUEUED. ACB: 60 08 40 95 2b 40 05 00 00 00 00 00
(ada4:ahcich4:0:0:0): CAM status: ATA Status Error
(ada4:ahcich4:0:0:0): ATA status: 00 ()
(ada4:ahcich4:0:0:0): RES: 00 00 00 00 00 00 00 00 00 00 00
(ada4:ahcich4:0:0:0): Error 5, Retries exhausted
(ada4:ahcich4:0:0:0): READ_FPDMA_QUEUED. ACB: 60 10 90 02 40 40 00 00 00 00 00 00
(ada4:ahcich4:0:0:0): CAM status: ATA Status Error
(ada4:ahcich4:0:0:0): ATA status: 00 ()
(ada4:ahcich4:0:0:0): RES: 00 00 00 00 00 00 00 00 00 00 00
(ada4:ahcich4:0:0:0): Error 5, Retries exhausted
(ada4:ahcich4:0:0:0): READ_FPDMA_QUEUED. ACB: 60 10 90 d4 30 40 46 02 00 00 00 00
(ada4:ahcich4:0:0:0): CAM status: ATA Status Error
(ada4:ahcich4:0:0:0): ATA status: 00 ()
(ada4:ahcich4:0:0:0): RES: 00 00 00 00 00 00 00 00 00 00 00
(ada4:ahcich4:0:0:0): Error 5, Retries exhausted
(ada4:ahcich4:0:0:0): READ_FPDMA_QUEUED. ACB: 60 10 90 d6 30 40 46 02 00 00 00 00
(ada4:ahcich4:0:0:0): CAM status: ATA Status Error
(ada4:ahcich4:0:0:0): ATA status: 00 ()
(ada4:ahcich4:0:0:0): RES: 00 00 00 00 00 00 00 00 00 00 00
(ada4:ahcich4:0:0:0): Error 5, Retries exhausted
vm_fault: pager read error, pid 19015 (pkg)
pid 19015 (pkg), jid 1, uid 0: exited on signal 10
vm_fault: pager read error, pid 20847 (pkg)
pid 20847 (pkg), jid 1, uid 0: exited on signal 10
vm_fault: pager read error, pid 20848 (pkg)
pid 20848 (pkg), jid 1, uid 0: exited on signal 10
(ada4:ahcich4:0:0:0): READ_DMA48. ACB: 25 00 f0 cd fe 40 ff 00 00 00 c8 01
(ada4:ahcich4:0:0:0): CAM status: ATA Status Error
(ada4:ahcich4:0:0:0): ATA status: 51 (DRDY SERV ERR), error: 40 (UNC )
(ada4:ahcich4:0:0:0): RES: 51 40 f0 cd fe 00 ff 00 00 00 00
(ada4:ahcich4:0:0:0): Retrying command, 3 more tries remain
(ada4:ahcich4:0:0:0): READ_DMA48. ACB: 25 00 f0 cd fe 40 ff 00 00 00 c8 01
(ada4:ahcich4:0:0:0): CAM status: ATA Status Error
(ada4:ahcich4:0:0:0): ATA status: 51 (DRDY SERV ERR), error: 40 (UNC )
(ada4:ahcich4:0:0:0): RES: 51 40 f0 cd fe 00 ff 00 00 00 00
(ada4:ahcich4:0:0:0): Retrying command, 2 more tries remain
(ada4:ahcich4:0:0:0): READ_FPDMA_QUEUED. ACB: 60 08 58 7e 6a 40 84 01 00 00 00 00
(ada4:ahcich4:0:0:0): CAM status: ATA Status Error
(ada4:ahcich4:0:0:0): ATA status: 41 (DRDY ERR), error: 40 (UNC )
(ada4:ahcich4:0:0:0): RES: 41 40 58 7e 6a 00 84 01 00 08 00
(ada4:ahcich4:0:0:0): Retrying command, 3 more tries remain
(ada4:ahcich4:0:0:0): READ_FPDMA_QUEUED. ACB: 60 08 58 7e 6a 40 84 01 00 00 00 00
(ada4:ahcich4:0:0:0): CAM status: ATA Status Error
(ada4:ahcich4:0:0:0): ATA status: 41 (DRDY ERR), error: 40 (UNC )
(ada4:ahcich4:0:0:0): RES: 41 40 58 7e 6a 00 84 01 00 08 00
(ada4:ahcich4:0:0:0): Retrying command, 2 more tries remain
(ada4:ahcich4:0:0:0): READ_FPDMA_QUEUED. ACB: 60 08 58 7e 6a 40 84 01 00 00 00 00
(ada4:ahcich4:0:0:0): CAM status: ATA Status Error
(ada4:ahcich4:0:0:0): ATA status: 41 (DRDY ERR), error: 40 (UNC )
(ada4:ahcich4:0:0:0): RES: 41 40 58 7e 6a 00 84 01 00 08 00
(ada4:ahcich4:0:0:0): Retrying command, 1 more tries remain
(ada4:ahcich4:0:0:0): READ_FPDMA_QUEUED. ACB: 60 08 58 7e 6a 40 84 01 00 00 00 00
(ada4:ahcich4:0:0:0): CAM status: ATA Status Error
(ada4:ahcich4:0:0:0): ATA status: 41 (DRDY ERR), error: 40 (UNC )
(ada4:ahcich4:0:0:0): RES: 41 40 58 7e 6a 00 84 01 00 08 00
(ada4:ahcich4:0:0:0): Retrying command, 0 more tries remain
(ada4:ahcich4:0:0:0): READ_FPDMA_QUEUED. ACB: 60 08 58 7e 6a 40 84 01 00 00 00 00
(ada4:ahcich4:0:0:0): CAM status: ATA Status Error
(ada4:ahcich4:0:0:0): ATA status: 41 (DRDY ERR), error: 40 (UNC )
(ada4:ahcich4:0:0:0): RES: 41 40 58 7e 6a 00 84 01 00 08 00
(ada4:ahcich4:0:0:0): Error 5, Retries exhausted
(ada4:ahcich4:0:0:0): READ_FPDMA_QUEUED. ACB: 60 08 a8 1e d3 40 86 01 00 00 00 00
(ada4:ahcich4:0:0:0): CAM status: ATA Status Error
(ada4:ahcich4:0:0:0): ATA status: 41 (DRDY ERR), error: 40 (UNC )
(ada4:ahcich4:0:0:0): RES: 41 40 a8 1e d3 00 86 01 00 08 00
(ada4:ahcich4:0:0:0): Retrying command, 3 more tries remain
(ada4:ahcich4:0:0:0): READ_FPDMA_QUEUED. ACB: 60 08 a8 1e d3 40 86 01 00 00 00 00
(ada4:ahcich4:0:0:0): CAM status: ATA Status Error
(ada4:ahcich4:0:0:0): ATA status: 41 (DRDY ERR), error: 40 (UNC )
(ada4:ahcich4:0:0:0): RES: 41 40 a8 1e d3 00 86 01 00 08 00
(ada4:ahcich4:0:0:0): Retrying command, 2 more tries remain
(ada4:ahcich4:0:0:0): READ_FPDMA_QUEUED. ACB: 60 08 a8 1e d3 40 86 01 00 00 00 00
(ada4:ahcich4:0:0:0): CAM status: ATA Status Error
(ada4:ahcich4:0:0:0): ATA status: 41 (DRDY ERR), error: 40 (UNC )
(ada4:ahcich4:0:0:0): RES: 41 40 a8 1e d3 00 86 01 00 08 00
(ada4:ahcich4:0:0:0): Retrying command, 1 more tries remain
(ada4:ahcich4:0:0:0): READ_FPDMA_QUEUED. ACB: 60 08 a8 1e d3 40 86 01 00 00 00 00
(ada4:ahcich4:0:0:0): CAM status: ATA Status Error
(ada4:ahcich4:0:0:0): ATA status: 41 (DRDY ERR), error: 40 (UNC )
(ada4:ahcich4:0:0:0): RES: 41 40 a8 1e d3 00 86 01 00 08 00
(ada4:ahcich4:0:0:0): Retrying command, 0 more tries remain
(ada4:ahcich4:0:0:0): READ_FPDMA_QUEUED. ACB: 60 08 a8 1e d3 40 86 01 00 00 00 00
(ada4:ahcich4:0:0:0): CAM status: ATA Status Error
(ada4:ahcich4:0:0:0): ATA status: 41 (DRDY ERR), error: 40 (UNC )
(ada4:ahcich4:0:0:0): RES: 41 40 a8 1e d3 00 86 01 00 08 00
(ada4:ahcich4:0:0:0): Error 5, Retries exhausted
ahcich4: Timeout on slot 20 port 0
ahcich4: is 00000000 cs 00100000 ss 00000000 rs 00100000 tfd c0 serr 00000000 cmd 0004d417
(ada4:ahcich4:0:0:0): FLUSHCACHE48. ACB: ea 00 00 00 00 40 00 00 00 00 00 00
(ada4:ahcich4:0:0:0): CAM status: Command timeout
(ada4:ahcich4:0:0:0): Retrying command, 0 more tries remain
ahcich4: Timeout on slot 0 port 0
ahcich4: is 00000000 cs 00000003 ss 00000000 rs 00000003 tfd c0 serr 00000000 cmd 0004c017
(ada4:ahcich4:0:0:0): FLUSHCACHE48. ACB: ea 00 00 00 00 40 00 00 00 00 00 00
(ada4:ahcich4:0:0:0): CAM status: Command timeout
(ada4:ahcich4:0:0:0): Retrying command, 0 more tries remain
ahcich4: Timeout on slot 23 port 0
ahcich4: is 00000000 cs 01800000 ss 00000000 rs 01800000 tfd c0 serr 00000000 cmd 0004d717
(ada4:ahcich4:0:0:0): FLUSHCACHE48. ACB: ea 00 00 00 00 40 00 00 00 00 00 00
(ada4:ahcich4:0:0:0): CAM status: Command timeout
(ada4:ahcich4:0:0:0): Retrying command, 0 more tries remain
ahcich4: Timeout on slot 1 port 0
ahcich4: is 00000000 cs 00000002 ss 00000000 rs 00000002 tfd c0 serr 00000000 cmd 0004c117
(ada4:ahcich4:0:0:0): FLUSHCACHE48. ACB: ea 00 00 00 00 40 00 00 00 00 00 00
(ada4:ahcich4:0:0:0): CAM status: Command timeout
(ada4:ahcich4:0:0:0): Retrying command, 0 more tries remain
ahcich4: Timeout on slot 3 port 0
ahcich4: is 00000000 cs 00000008 ss 00000000 rs 00000008 tfd d0 serr 00000000 cmd 0004c317
(aprobe0:ahcich4:0:0:0): SETFEATURES SET TRANSFER MODE. ACB: ef 03 00 00 00 40 00 00 00 00 46 00
(aprobe0:ahcich4:0:0:0): CAM status: Command timeout
(aprobe0:ahcich4:0:0:0): Retrying command, 0 more tries remain
ahcich4: Timeout on slot 6 port 0
ahcich4: is 00000000 cs 00000040 ss 00000000 rs 00000040 tfd c0 serr 00000000 cmd 0004c617
(ada4:ahcich4:0:0:0): FLUSHCACHE48. ACB: ea 00 00 00 00 40 00 00 00 00 00 00
(ada4:ahcich4:0:0:0): CAM status: Command timeout
(ada4:ahcich4:0:0:0): Retrying command, 0 more tries remain
ahcich4: Timeout on slot 8 port 0
ahcich4: is 00000000 cs 00000100 ss 00000000 rs 00000100 tfd d0 serr 00000000 cmd 0004c817
(aprobe0:ahcich4:0:0:0): SETFEATURES SET TRANSFER MODE. ACB: ef 03 00 00 00 40 00 00 00 00 46 00
(aprobe0:ahcich4:0:0:0): CAM status: Command timeout
(aprobe0:ahcich4:0:0:0): Retrying command, 0 more tries remain
ahcich4: Timeout on slot 9 port 0
ahcich4: is 00000000 cs 00000200 ss 00000000 rs 00000200 tfd 1d0 serr 00000000 cmd 0004c917
(aprobe0:ahcich4:0:0:0): SETFEATURES SET TRANSFER MODE. ACB: ef 03 00 00 00 40 00 00 00 00 46 00
(aprobe0:ahcich4:0:0:0): CAM status: Command timeout
(aprobe0:ahcich4:0:0:0): Error 5, Retries exhausted
ahcich4: Timeout on slot 12 port 0
ahcich4: is 00000000 cs 00001000 ss 00000000 rs 00001000 tfd d0 serr 00000000 cmd 0004cc17
(aprobe0:ahcich4:0:0:0): SETFEATURES SET TRANSFER MODE. ACB: ef 03 00 00 00 40 00 00 00 00 46 00
(aprobe0:ahcich4:0:0:0): CAM status: Command timeout
(aprobe0:ahcich4:0:0:0): Retrying command, 0 more tries remain
ahcich4: Timeout on slot 13 port 0
ahcich4: is 00000000 cs 00002000 ss 00000000 rs 00002000 tfd 1d0 serr 00000000 cmd 0004cd17
(aprobe0:ahcich4:0:0:0): SETFEATURES SET TRANSFER MODE. ACB: ef 03 00 00 00 40 00 00 00 00 46 00
(aprobe0:ahcich4:0:0:0): CAM status: Command timeout
(aprobe0:ahcich4:0:0:0): Error 5, Retries exhausted
ada4 at ahcich4 bus 0 scbus4 target 0 lun 0
ada4: <ST5000LM000-2U8170 0001> s/n WCJ6PFAB detached
GEOM_MIRROR: Device swap0: provider ada4p1 disconnected.
ahcich4: Timeout on slot 15 port 0
ahcich4: is 00000000 cs 00008000 ss 00000000 rs 00008000 tfd d0 serr 00000000 cmd 0004cf17
(aprobe0:ahcich4:0:0:0): SETFEATURES SET TRANSFER MODE. ACB: ef 03 00 00 00 40 00 00 00 00 46 00
(aprobe0:ahcich4:0:0:0): CAM status: Command timeout
(aprobe0:ahcich4:0:0:0): Retrying command, 0 more tries remain
ahcich4: Timeout on slot 16 port 0
ahcich4: is 00000000 cs 00010000 ss 00000000 rs 00010000 tfd 1d0 serr 00000000 cmd 0004d017
(aprobe0:ahcich4:0:0:0): SETFEATURES SET TRANSFER MODE. ACB: ef 03 00 00 00 40 00 00 00 00 46 00
(aprobe0:ahcich4:0:0:0): CAM status: Command timeout
(aprobe0:ahcich4:0:0:0): Error 5, Retries exhausted
ahcich4: Poll timeout on slot 18 port 0
ahcich4: is 00000000 cs 00040000 ss 00000000 rs 00040000 tfd 1d0 serr 00000000 cmd 0004d217
(aprobe0:ahcich4:0:0:0): NOP FLUSHQUEUE. ACB: 00 00 00 00 00 00 00 00 00 00 00 00
(aprobe0:ahcich4:0:0:0): CAM status: Command timeout
(aprobe0:ahcich4:0:0:0): Error 5, Retries exhausted
(ada4:ahcich4:0:0:0): Periph destroyed

Did you unplug the failed drives or leave them connected during the replace/resilver? If you left them plugged in, it would explain why pool is just degraded instead of dead.

The first one came out when I replaced it. The 2nd failed drive I left it in there. I didn’t want to interrupt the resilver. So I guess on the next reboot, the pool will just be unavail. Is there anything we can do while it’s like this?

Very good! Both failed drives left plugged in is best!

1 Like

Ah, it appears you’re running TrueNAS Core. Sadly I don’t know remember commands for that, other people will have to help here.

ZFS hasn’t FAULTED any of your disks, which is why you can still access data. You have a ton of checksum errors tough.

1 Like

When I was getting rando read errors for me it was cable related - could in theory be chipset overheating since you ain’t running an HBA.

Would need to do smart long tests after resilver is done. In regards to:

Wait for the resilver to complete & see if anything was actually reported as lost - you might even be very lucky & after a scrub things magically get back into parity.

1 Like

camcontrol devlist
gpart list for (much!) more details
and, same as in SCALE,
smartctl -a /dev/adaN for all relevant disks (no sudo)
sas2flash -list or sas3flash -list where applicable (not here, it seems)

Ok friends. After a few fits and starts, and despite early promise, I have lost hope that this pool will be rebuilt/resilvered successfully. The resilver process keeps crapping out at varrying progress around 30% and there are tons of errors.

So, I’ve never had to rebuild a pool before. This one was actually the first and only ZFS pool I’ve ever built. After 5 years, it’s been a good run I guess? What is the recommended course of action from here to proceed with recreating it? I imagine I should test all drives first. But apart from that, I am unsure what’s next?

Thanks for any advice.