Reset degraded HD status

etorix · November 5, 2024, 2:15pm

That’s a lot of HDDs on this old system and possibly not quite enough memory to deal with it.
The mix of different sizes (3 to 18 TB) is not quite suitable for ZFS.

The only good part is that you have another NAS to serve as backup.
What does 58 TB represent? The size of actual data in HD8, or the capacity of the external NAS?

You have to sanitise the layout somewhat before you end up losing all your data. The first step would be to get a proper SAS HBA instead of these SATA cards. At least it seems that no port multiplier is involved, the card you linked to puts several 4-port PCIe 3.0x1 controllers (already not great) behind a PCIe 2.0 switch; so the whole thing operates at PCIe 2.0 speed and each group of four drives shares the bandwith of a single PCIe 2.0 lane (500 MB/s). No wonder the whole thing is slow and generates errors when ZFS tries to access all drives at the same time, at it is designed to do.

joeschmuck · November 5, 2024, 2:26pm

I would have expected him to run the clear, run a SMART Long test, run a scrub, and now I’d like to see what zpool status shows, and depending on what that says, possibly running a clear again to see if that alarm will go away. I don’t for a second think it will fix it for any length of time, I personally think he has a failing system (many hard drive failures).

plhmk · November 5, 2024, 6:52pm

I will try another “scrub” soon, now all HD are without errors but ZFS is “non health status” … with “red cross”.
The folders / data are available, apparently no loss.
Thanks

etorix · November 5, 2024, 7:10pm

No errors reported by sudo zpool status HD8 ?

joeschmuck · November 5, 2024, 8:02pm

I would not do that. Your pool is fragile, copy the data off and then you can do any testing. SCRUB puts stress on the drives.

etorix · November 5, 2024, 8:10pm

Check the health of the drives (long SMART test, then smartctl -a /dev/XXX).
Copy the data out and/or remove some drives. Make another pool, with some redundancy (at least raidz1, but raidz2 would be better), and transfer data there.
Potentially, you have the basis for a 5-wide vdev of 12 TB drives (Z1/Z2) and two 3-wide (18-18-16, 8-8-10, not great but better than no redundancy).

And replace this weird SATA card by a SAS HBA.

plhmk · November 21, 2024, 4:52am

truenas_admin@lp-truenas-ge[~]$ sudo zpool status HD8
[sudo] password for truenas_admin:
pool: HD8
state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: Message ID: ZFS-8000-8A — OpenZFS documentation
scan: scrub in progress since Wed Nov 20 05:17:34 2024
4.70T / 52.8T scanned at 1.93G/s, 3.18T / 52.8T issued at 1.31G/s
0B repaired, 6.01% done, 10:47:19 to go
remove: Removal of vdev 11 copied 1.35T in 2h21m, completed on Tue Nov 19 15:13:50 2024
2.25M memory used for removed device mappings
config:

    NAME                                    STATE     READ WRITE CKSUM
    HD8                                     ONLINE       0     0     0
      58393e90-f55e-4781-9d42-438e189d5297  ONLINE       0     0     0
      93ba1c8b-7665-4001-87c2-e8956352d3a2  ONLINE       0     0     0
      a66b0299-7679-4547-89bb-cc41f717d5d4  ONLINE       0     0     0
      b65b6595-e676-45c5-b0ed-e238186029ee  ONLINE       0     0     0
      ea333c0c-45f5-4844-9059-269265d23197  ONLINE       0     0     0
      a9de848e-53e2-4da8-97e7-0eb5efd4ca9e  ONLINE       0     0     0
      971229d2-a7be-4624-bc93-5e95b7f388b8  ONLINE       0     0     0
      02ba4089-43fe-4019-ba0e-fdfc440bbff9  ONLINE       0     0     0
      3d8aee2e-5e2d-4bcc-ba19-6605d188a887  ONLINE       0     0     0
      623a217e-25e3-4553-b44e-d20c87720740  ONLINE       0     0     0
      c4b394cc-0c65-48cb-be5f-8aafa1455fba  ONLINE       0     0     0
      9a95ae98-48f9-4150-88d4-b87bd89fc237  ONLINE       0     0     0
    cache
      a393a044-ffa7-4fc6-96b0-fce8900c4cdd  ONLINE       0     0     0

errors: 165 data errors, use ‘-v’ for a list