Pool status "Unhealthy"

larrybg · August 25, 2024, 5:08pm

So, this day came - got the alert email from my TrueNAS Core server:

Pool data-pool state is ONLINE: One or more devices has experienced an
unrecoverable error. An attempt was made to correct the error. Applications
are unaffected.

I run zpool status -v data-pool and got these results:

truenas% zpool status -v data-pool
  pool: data-pool
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
	attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
	using 'zpool clear' or replace the device with 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
  scan: scrub repaired 88K in 01:08:10 with 0 errors on Sun Aug 25 01:08:10 2024
config:

	NAME                                            STATE     READ WRITE CKSUM
	data-pool                                       ONLINE       0     0     0
	  raidz2-0                                      ONLINE       0     0     0
	    gptid/4935d2cc-42be-11ef-a6db-b8ca3a875545  ONLINE       0     0     1
	    gptid/4942b051-42be-11ef-a6db-b8ca3a875545  ONLINE       0     0     1
	    gptid/494aa7b4-42be-11ef-a6db-b8ca3a875545  ONLINE       0     0     0
	    gptid/49545bdd-42be-11ef-a6db-b8ca3a875545  ONLINE       0     0     0
	    gptid/484cba63-42be-11ef-a6db-b8ca3a875545  ONLINE       0     0     0

errors: No known data errors

I’m guessing that this is related to the checksum errors… Do I need to do something about it? The disks are WD Red 2TB each and about 3 years old

Redcoat · August 25, 2024, 5:37pm

Chksum errors are often associated with cable issues. First thing I would do is shut down and unbplug/replug data cables to the drives in question, also examining the cables for any obvious issues. Replace if obvious problems. if not restart and repeat the status check. Replace the cables if checksum issues persist.

Another potential issue - have you established if your WD Reds are SMR dives or not? I don’t think of them causing chksum faults, but? Look at the model number of your drives when you do the cable checks and compare the numbers with the info in the forum resource that has the data on the SMR topic : List of known SMR drives | TrueNAS Community

joeschmuck · August 25, 2024, 8:47pm

Once you have done what @Redcoat has suggested, you will still have the chksum errors so…

Run a zpool scrub data-pool and then check it once complete. You will still have the errors but what you want to see is “errors: No known data errors” like you had above.

If that works out, next clear the errors zpool clear data-pool and your errors should be gone.

If you didn’t find anything wrong, like SMR drives, or you are not powering the system up/down all the time, or your system freezes, then I recommend you provide us your hardware listing and then start RAM and CPU testing.

Oh yes, for the two drives that had the errors, maybe posting the smartctl -x /dev/drive output here, maybe the drive(s) is/are faulty but I doubt they are, but it is an easy thing to look at.

larrybg · August 26, 2024, 2:50pm

Checked the cables - all looks good, reattached them. All drives are wd20efzx model, so no SMR drives.

Redcoat · August 26, 2024, 4:27pm

OK - sounds good - so please follow Joe’s recommended path and post the smart test results for the two drives.

larrybg · August 26, 2024, 5:55pm

Here is the zpool status after I run zpool scrub:

truenas% sudo zpool status
  pool: boot-pool
 state: ONLINE
  scan: scrub repaired 0B in 00:00:22 with 0 errors on Mon Aug 26 06:45:22 2024
config:

	NAME        STATE     READ WRITE CKSUM
	boot-pool   ONLINE       0     0     0
	  da4p2     ONLINE       0     0     0

errors: No known data errors

  pool: data-pool
 state: ONLINE
  scan: scrub repaired 0B in 01:08:09 with 0 errors on Mon Aug 26 12:29:36 2024
config:

	NAME                                            STATE     READ WRITE CKSUM
	data-pool                                       ONLINE       0     0     0
	  raidz2-0                                      ONLINE       0     0     0
	    gptid/4935d2cc-42be-11ef-a6db-b8ca3a875545  ONLINE       0     0     0
	    gptid/4942b051-42be-11ef-a6db-b8ca3a875545  ONLINE       0     0     0
	    gptid/494aa7b4-42be-11ef-a6db-b8ca3a875545  ONLINE       0     0     0
	    gptid/49545bdd-42be-11ef-a6db-b8ca3a875545  ONLINE       0     0     0
	    gptid/484cba63-42be-11ef-a6db-b8ca3a875545  ONLINE       0     0     0

errors: No known data errors

Redcoat · August 26, 2024, 6:47pm

Sounds like replugging the data connectors may have been the fix you needed. Suggest that you watch your system carefully for further problems with those two drives.

Stux · August 27, 2024, 4:11am

Full smart results for the drives with the checksum issues would be interesting.

May show either UDMA issues (indicating cabling/power issues) or reallocated blocks etc… which may indicate failing surfaces.

Or that long tests are not being run.

larrybg · August 27, 2024, 1:03pm

This is what I could find:

truenas% sudo smartctl -l selftest /dev/ada0
smartctl 7.2 2021-09-14 r5236 [FreeBSD 13.1-RELEASE-p9 amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%     15063         -
# 2  Extended offline    Completed without error       00%     15046         -
# 3  Extended offline    Interrupted (host reset)      90%      1145         -

truenas% sudo smartctl -l selftest /dev/ada1
smartctl 7.2 2021-09-14 r5236 [FreeBSD 13.1-RELEASE-p9 amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%      5274         -
# 2  Extended offline    Interrupted (host reset)      90%     56904         -
# 3  Extended offline    Completed without error       00%     55689         -

truenas% sudo smartctl -l selftest /dev/ada2
smartctl 7.2 2021-09-14 r5236 [FreeBSD 13.1-RELEASE-p9 amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%     15049         -
# 2  Extended offline    Interrupted (host reset)      90%      1145         -

truenas% sudo smartctl -l selftest /dev/ada3
smartctl 7.2 2021-09-14 r5236 [FreeBSD 13.1-RELEASE-p9 amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%     36562         -
# 2  Extended offline    Interrupted (host reset)      90%     22670         -

truenas% sudo smartctl -l selftest /dev/ada4
smartctl 7.2 2021-09-14 r5236 [FreeBSD 13.1-RELEASE-p9 amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%     20879         -
# 2  Extended offline    Interrupted (host reset)      90%      6974         -
# 3  Extended offline    Completed without error       00%      5757         -

Redcoat · August 28, 2024, 8:57am

Thanks for that response.

We were hoping for the output of smartctl - t long /dev/daX for each drive - which is output after the test is finished with the command smartctl -a /dev/daX.

Stux · August 28, 2024, 9:59am

Output before test is finished would work

StormRider · September 1, 2024, 9:40am

Thank you for your sharing. It’s beneficial.