Reset degraded HD status

plhmk · November 3, 2024, 2:08pm

Hello,
I have a pool of 12 HDs of various storage capacity, ZFS reports some HDs as degraded, but currently none of these have read, write or checksum errors.
I would not like to replace 5 HDs apparently without errors, I tried to change controllers but without success.
Is it possible to “reset” the status?
Attached is the pool status.
Using the “-xv” option I get the affected files, if I delete them can I “recover” the pool status?
Thanks

truenas_admin@lp-truenas-ge[~]$ sudo zpool status -v
[sudo] password for truenas_admin:
pool: HD8
state: DEGRADED
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: Message ID: ZFS-8000-8A — OpenZFS documentation
scan: scrub paused since Sun Nov 3 13:57:56 2024
scrub started on Sun Nov 3 08:18:09 2024
0B / 52.7T scanned, 0B / 52.7T issued
0B repaired, 0.00% done
remove: Removal of /dev/disk/by-partuuid/1f0c5b1d-05a8-4a3b-b44b-bb87774bddb2 canceled on Thu Oct 31 11:59:52 2024
config:

    NAME                                    STATE     READ WRITE CKSUM
    HD8                                     DEGRADED     0     0     0
      58393e90-f55e-4781-9d42-438e189d5297  DEGRADED     0     0     0  too many errors
      93ba1c8b-7665-4001-87c2-e8956352d3a2  ONLINE       0     0     0
      a66b0299-7679-4547-89bb-cc41f717d5d4  ONLINE       0     0     0
      b65b6595-e676-45c5-b0ed-e238186029ee  ONLINE       0     0     0
      ea333c0c-45f5-4844-9059-269265d23197  ONLINE       0     0     0
      a9de848e-53e2-4da8-97e7-0eb5efd4ca9e  DEGRADED     0     0     0  too many errors
      971229d2-a7be-4624-bc93-5e95b7f388b8  DEGRADED     0     0     0  too many errors
      02ba4089-43fe-4019-ba0e-fdfc440bbff9  DEGRADED     0     0     0  too many errors
      3d8aee2e-5e2d-4bcc-ba19-6605d188a887  ONLINE       0     0     0
      623a217e-25e3-4553-b44e-d20c87720740  ONLINE       0     0     0
      c4b394cc-0c65-48cb-be5f-8aafa1455fba  DEGRADED     0     0     0  too many errors
      1f0c5b1d-05a8-4a3b-b44b-bb87774bddb2  ONLINE       0     0     0

Farout · November 3, 2024, 2:54pm

What type of controllers do you use ?
Is this a bare metal install or virtualized ?

You can clear errors with zpool clear poolname.

joeschmuck · November 3, 2024, 4:23pm

You only provided part of the entire printout, why? The other parts matter as well especially for the error message you have.

The zpool clear will likely not clear the error as these are different from Read/Write/Cksum errors. This error message is telling you that you are missing too much data in those drives.

Run zpool scrub HD8 and let it finish. Then run zpool status -v HD8 and report the entire output of the command. You can also check the status of the scrub with that same second command.

Also, are you running a STRIPE of all those drives?

plhmk · November 3, 2024, 4:39pm

Thanks,
i perform this, now error on HD are clear but i have the message “Pool is not healty”
All HD are without error.
The controller is bare metal, it can manage 16 HD SATA

Farout · November 3, 2024, 4:43pm

Run a scrub now.

What type of controller ?
And as mentioned, are you really running a stripe ?

plhmk · November 3, 2024, 4:43pm

Hello, thanks
This is the command :
sudo zpool status -v

truenas_admin@lp-truenas-ge[~]$ sudo zpool status -v
[sudo] password for truenas_admin: 
  pool: HD8
 state: DEGRADED
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
  scan: scrub paused since Sun Nov  3 13:57:56 2024
        scrub started on Sun Nov  3 08:18:09 2024
        0B / 52.7T scanned, 0B / 52.7T issued
        0B repaired, 0.00% done
remove: Removal of /dev/disk/by-partuuid/1f0c5b1d-05a8-4a3b-b44b-bb87774bddb2 canceled on Thu Oct 31 11:59:52 2024
config:

        NAME                                    STATE     READ WRITE CKSUM
        HD8                                     DEGRADED     0     0     0
          58393e90-f55e-4781-9d42-438e189d5297  DEGRADED     0     0     0  too many errors
          93ba1c8b-7665-4001-87c2-e8956352d3a2  ONLINE       0     0     0
          a66b0299-7679-4547-89bb-cc41f717d5d4  ONLINE       0     0     0
          b65b6595-e676-45c5-b0ed-e238186029ee  ONLINE       0     0     0
          ea333c0c-45f5-4844-9059-269265d23197  ONLINE       0     0     0
          a9de848e-53e2-4da8-97e7-0eb5efd4ca9e  DEGRADED     0     0     0  too many errors
          971229d2-a7be-4624-bc93-5e95b7f388b8  DEGRADED     0     0     0  too many errors
          02ba4089-43fe-4019-ba0e-fdfc440bbff9  DEGRADED     0     0     0  too many errors
          3d8aee2e-5e2d-4bcc-ba19-6605d188a887  ONLINE       0     0     0
          623a217e-25e3-4553-b44e-d20c87720740  ONLINE       0     0     0
          c4b394cc-0c65-48cb-be5f-8aafa1455fba  DEGRADED     0     0     0  too many errors
          1f0c5b1d-05a8-4a3b-b44b-bb87774bddb2  ONLINE       0     0     0

For scrubbing HD8, the system run for about 56 hours (!!!), so i try next time …
All drive are stripes non RAID / redundance

Farout · November 3, 2024, 4:51pm

Damn… I see you like to live dangerous.

Again, what exact model of controller ?

joeschmuck · November 3, 2024, 4:56pm

Please use the code tags “</>” to insert any screen captures into. I change the format to make it look correct. That is what was confusing me before.

Since the last scrub said no bytes repaired, you can give the zpool clear HD8 a try. The problem with no redundancy is one drive goes bad, the entire thing is gone.

In the meantime you can run a SMART Long self-test on all of the drives. The command is smartctl -t long /dev/sda for example. I’m making an assumption that you are using SCALE. Do this command for all your drives. The smaller drives should finish faster than the larger drives. What you want to do is when you issue the command, you will get a message to wait XXX minutes for the test to complete. After that time has passed then run smartctl -a /dev/sda and you should be looking for Test Description Status and you want to see the Extended Test Completed without error. If there is a failure, that is not good as your drive is failing.

plhmk · November 4, 2024, 7:08am

Hello,
i perform a scrubbing now, from yesterday night and for about 50 hours
Some disk have error (one have at now (5%) 156 error)
At the end i rewrite here the result
Thanks very much !

Farout · November 4, 2024, 7:33am

Maybe now would be a good time to post your complete hardware. Including the exact type of controller and HDs used.

joeschmuck · November 4, 2024, 12:49pm

Since you have a STRIPE (highly advised against), I would caution you to backup whatever data you have before complete failure. If one drive drops out, your data is for the most part gone. You may be able to recover some of it but that is a process for the folks who really need that data back. Not something I would ever want to try.

Posting your hardware would be a good idea, it has been asked for before and allows us to help you better vice making assumptions.

plhmk · November 4, 2024, 2:07pm

This is my (next week) controller :
YABOANG Scheda di espansione PCI-E16X a SATA3.0, controller SATA Gen III a 24 porte 6 Gbps, compatibile con WIN11 /WIN10 /WIN8/ LINUX2.6x. : Amazon.it: Informatica

At now i have same but with 16 port.
HD are

S/N Model Capacity Error On line
8HGL3XDH HGST 12 TB No x
ZCT28TJ8 Seagate 8 TB No x
ZCT26GY1 Seagate 8 TB No x
AAGGLA7H HGST 12 TB No x
ZV708WDS Seagate 10 TB No x
ZJV1PJKZ Seagate 12 TB No x
ZR50VM0A Seagate 18 TB No x
8DGVYPYH HGST 12 TB No x
X1D0A161FVGG Toshiba 16 TB No x
Y1X0A1U2FJDH Toshiba 16 TB No x
8DHAM5ZH HGST 12 TB No x
43JN80BGS Toshiba 3 TB No x

joeschmuck · November 4, 2024, 2:57pm

Did you provide the drive serial numbers or the model numbers?

plhmk · November 4, 2024, 4:39pm

They are Serial number

joeschmuck · November 4, 2024, 4:56pm

Need the model numbers. The serial numbers are good to see if the drive is under warranty, but the model numbers allow us to see what drive type you have.

It is a minor concern at this point as you need to save any data you want to retain somewhere, but then if there are any drive models which are SMR for example, we can warn you about those.

Stux · November 4, 2024, 10:46pm

Your drives may be failing, your PSU may be failing. Your cabling may be flaky. Your SATA card may be overheating, or just flaky. Your drives may be overheating. Or it could be memory errors.

But you have no redundancy.

So any error may result in a failure, corruption or total loss of the pool.

Suggest backing up what you can save and rebuilding your pool with at least one drive of redundancy, and then you can try and work out the cause of your errors.

etorix · November 5, 2024, 7:20am

6 ASM1064 SATA controllers (PCIe 3.0x1) behind a 24-lane PCIe 2.0 switch…
This may not be as bad as I though when I saw the contraption, but it’s certainly not a recommendable or supportable configuration. Do not use that!

Edit. Corrected specifications.

plhmk · November 5, 2024, 11:50am

Here the complete configuration

Screenshot 2024-11-05 125447

At now after scrub of 2 days have no error but ZFS Health is “bad” see after attached, and i don’t understand why.
So next time i transfer all data to other external NAS (58 TB !) reconfigure pool with a little change and re-copy all data to the new pool.
About 1 week of “work” …
Screenshot 2024-11-05 124040

The system has 5 pools, total of 29 HD / SSD / NVMe, 32 GB RAM, CPU an old AMD 9590 , MB Crosshair Formula
Thanks

joeschmuck · November 5, 2024, 1:22pm

Please post the output of zpool status -v HD8
If there are any errors at all, run zpool clear HD8 and see if that clears the status alert.
Also, what does the TrueNAS GUI ‘Alert’ say?

Farout · November 5, 2024, 2:07pm

He did. And then ran the scrub. Errors are now
back.