So, this day came - got the alert email from my TrueNAS Core server:
Pool data-pool state is ONLINE: One or more devices has experienced an
unrecoverable error. An attempt was made to correct the error. Applications
are unaffected.
I run zpool status -v data-pool and got these results:
truenas% zpool status -v data-pool
pool: data-pool
state: ONLINE
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
scan: scrub repaired 88K in 01:08:10 with 0 errors on Sun Aug 25 01:08:10 2024
config:
NAME STATE READ WRITE CKSUM
data-pool ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
gptid/4935d2cc-42be-11ef-a6db-b8ca3a875545 ONLINE 0 0 1
gptid/4942b051-42be-11ef-a6db-b8ca3a875545 ONLINE 0 0 1
gptid/494aa7b4-42be-11ef-a6db-b8ca3a875545 ONLINE 0 0 0
gptid/49545bdd-42be-11ef-a6db-b8ca3a875545 ONLINE 0 0 0
gptid/484cba63-42be-11ef-a6db-b8ca3a875545 ONLINE 0 0 0
errors: No known data errors
I’m guessing that this is related to the checksum errors… Do I need to do something about it? The disks are WD Red 2TB each and about 3 years old
Chksum errors are often associated with cable issues. First thing I would do is shut down and unbplug/replug data cables to the drives in question, also examining the cables for any obvious issues. Replace if obvious problems. if not restart and repeat the status check. Replace the cables if checksum issues persist.
Another potential issue - have you established if your WD Reds are SMR dives or not? I don’t think of them causing chksum faults, but? Look at the model number of your drives when you do the cable checks and compare the numbers with the info in the forum resource that has the data on the SMR topic : List of known SMR drives | TrueNAS Community
Once you have done what @Redcoat has suggested, you will still have the chksum errors so…
Run a zpool scrub data-pool and then check it once complete. You will still have the errors but what you want to see is “errors: No known data errors” like you had above.
If that works out, next clear the errors zpool clear data-pool and your errors should be gone.
If you didn’t find anything wrong, like SMR drives, or you are not powering the system up/down all the time, or your system freezes, then I recommend you provide us your hardware listing and then start RAM and CPU testing.
Oh yes, for the two drives that had the errors, maybe posting the smartctl -x /dev/drive output here, maybe the drive(s) is/are faulty but I doubt they are, but it is an easy thing to look at.
Sounds like replugging the data connectors may have been the fix you needed. Suggest that you watch your system carefully for further problems with those two drives.
We were hoping for the output of smartctl - t long /dev/daX for each drive - which is output after the test is finished with the command smartctl -a /dev/daX.