Permenant Error

Both drives are good at this moment, given the data provided.

You “need” to run a SMART extended/long tests on each drive. From the command line, enter smartctl -t long /dev/sdd for the SSD, and run this command on all your drives, replacing sdd with the drive ident. If the test passes, you are good, if it fails, odds are your drive must be replaced. You can always post the output of the drive SMART data like you did above if you need to ask a question about it.

The SSD has 3 “Pending” sector errors. That becomes a more notable problem when they become Reallocated Sectors (ID 5), but even that alone would not warrant replacement. Also the Pending Sector errors can just go away and often do, not not always. You have a Wear Level of 100% (higher is better) and no other indications. What I find is odd, all the SMART tests conducted at hour 20881. I have no idea what happened there. It looks like the drive started a selftest and then restarted for whatever unknown reason.

Your HDD looks perfectly fine.

Run those SMART long tests on all your drives.

Now to address your boot-pool issue. In your original post you have 12 CKSUM errors listed. Here is what you do for those:

  1. Backup your TrueNAS configuration file.
  2. Run the SMART Long test (if not already accomplished).
  3. If the SMART tests passes, continue, if it fails then you will need to exit this procedure and replace the drive.
  4. If the drive passes, run zpool status -v boot-pool and this gives you a starting reference on the status of your pool.
  5. Run a scrub on the boot-pool zpool scrub boot-pool and wait 2 minutes, then run the same command in step 4. You should see something like “scan: scrub repaired 0B in xx with 0 errors” This is good, it means no more new errors. If it has more errors then post that result here, but still go to the next step.
  6. To clear these errors run zpool clear boot-pool and then you can repeat the step 4 command. Your CHSUM errors should be gone. If they come back in a few days/weeks, then the drive may need to be replaced, and I’m making assumptions that you leave the NAS on all the time and you are not doing anything strange to it.

All these troubleshooting steps are in my Drive Troubleshooting Flowcharts. I recommend you download the few pages of steps and you could also use that for now or any future problems you have. It is much faster than waiting on someone in a forum to respond.

The only true problem I see is no routine SMART testing scheduled. I’m not sure how comfortable you would be trying to install Multi-Report, it isn’t very difficult but if you don’t know how to use Linux, it would be a learning process. I recommend a daily SMART Short test on all drives, and a Weekly SMART Long test on all drives. Of course, if a drive has a long test scheduled, the short test does not need to be run on that same day. Multi-Report handles all that for you, just install, run the config where you basically enter your email address for the report to go to, and select Automatic Compensation for if you have any drives with an error value that you want to monitor, then set the script to run in CRON Jobs once a day. Every day you would get a report emailed to you and the subject line will end in “All is Good”, or “Warning”, or “Critical”. Of course you want the “All is Good”. Quite a few people use this script.

If you have any questions, ask. Also, take a look at the troubleshooting flowcharts, they would possibly answer a few questions as well.

EDIT: One last thing as it seems to come up often lately, this statement is not the results for the entire drive, this is a quick power on selftest, very little drive diagnostics (very little):

It only means something if the result is “FAILED”. I rarely see “FAILED” because most drives fail in other ways and are replaced before this status is ever displayed.

2 Likes