How bad is this? Pool repeatedly resilvering

I changed my HBA from a 9300-8i to a 9305-24i . The day prior I ran a scrub, with nil errors. I also changed the GPU, but nothing else at that time, no drives were changed. I am running 7 drives in z2.

Since then there has been an intermittently running resilver job, which doesn’t appear to complete, with heavy writes to the drive labeled ‘sdd’. But the errors (below) are unhelpful.

Did I lose a drive? Or 2? Cabling issue? Is the HBA a dud? Something else?

Critical

Device: /dev/sdd [SAT], not capable of SMART self-check.

2024-10-23 23:34:19

Critical

Device: /dev/sdf [SAT], failed to read SMART Attribute Data.

2024-10-25 09:25:31

Critical

Pool main state is ONLINE: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state.

2024-10-25 08:52:44 (Australia/Sydney)

pool: main
state: ONLINE
status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scan: resilver in progress since Fri Oct 25 20:47:34 2024
27.5T / 27.5T scanned, 1.52T / 7.11T issued at 1.03G/s
223G resilvered, 21.42% done, 01:32:29 to go
config:

    NAME                                      STATE     READ WRITE CKSUM
    main                                      ONLINE       0     0     0
      raidz2-0                                ONLINE       0     0     0
        638aa963-3608-4de3-bdab-bbcf91048c55  ONLINE       0     0     0
        a14aa6cb-8b88-4f15-85c3-369da778ba9e  ONLINE       0     0     0
        fabc7921-abfa-455e-a59c-f5c73ce8b9cd  ONLINE       0     0     0
        1f87e00e-4a6a-40a6-a87c-5fa1ac7285f1  ONLINE       0     0     0
        378c080e-2b10-4cdb-a8e9-780b1227ca93  ONLINE       0     0     0
        020c7fd5-bc56-448c-a485-6205a787855b  ONLINE       0     0     0
        33ae7d9e-d3f7-403a-b91a-91e9896b67cb  ONLINE       0     0     0  (resilvering)
    cache
      c463aca6-ae1f-4113-ab6a-e4586b3442ba    ONLINE       0     0     0

errors: No known data errors

The resilver appears to be repeatedly failing. I’m also seeing all logging intermittently containing zero values (disk temp, CPU load, total active processes).

And I’m getting disk.sync_all every few minutes… I think I’ve not connected something properly

I have reseated everything, and fixed a fan that somehow shifted, the frequent “disk.sync_all” tasks have stopped, as has the weird logging behavior. Now to leave it be for a day with only the resilver running.

2 Likes

Everything is working now. Resilvered fine, and no errors on SMART.

Sometimes it’s that simple. :slightly_smiling_face: