False SMART errors?

Hi,
Since rebooting my TrueNAS SCALE yesterday, I’m getting these errors:

New alerts:

* Device: /dev/sdb [SAT], 2 Offline uncorrectable sectors.

Current alerts:

* Device: /dev/sdb [SAT], 2 Currently unreadable (pending) sectors.
* Device: /dev/sdb [SAT], 2 Offline uncorrectable sectors.

I have logged in to the dashboard, I see these errors on the bell, but when I’m looking at the storage panel, I don’t see any error. I’ve checked with zpool status - same, no errors. I checked the journal and dmesg - same, no error at all. I also checked using smartctl -H /dev/sdb - same, no errors.

So, what’s up with those messages and how can I “fix” this issue?

-H is close to useless. You should be looking at the output of -A at a bare minimum, -a for a full overview, and -x for the nitty-gritty.

3 Likes

If smartctl -a confirms what was reported, you should look into replacing the drives. This is valid ground for RMA. And if the drives are too old for RMA, the recycling bin awaits…

Well, looking at the short test output (smartctl -l selftest /dev/sdb) I get:

smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.131+truenas] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%     10722         -
# 2  Short offline       Completed without error       00%     10611         -

So … I don’t see any errors.

Well, it helps to not look at where the errors would be found.

  1. Relocated sectors do not necessarily lead to SMART test failures (but they would be logged, including in the SMART parameter TrueNAS is reporting to you)
  2. Short tests are not great, which is fine because they’re cheap. Short tests just don’t fail all that often, even on pretty bad disks.
2 Likes

There’s really no reason to expect a correlation between SMART errors and ZFS errors.

You demonstrate that that drive has never had a long SMART self-test. That isn’t good.

Start here:

1 Like

We told you to look at smartctl -a (or -A or -x). There you should see the bad and the dubious sectors.

Look at the long results… run a long test if you haven’t.

Pending sectors don’t always cause long failures.

If the drive is in warranty RMA it.

If not, you could try wiping with zeros. May make the problem go away for a bit, but at the end of the day, a pending sector is a sector that contains data that can no longer be read.

It’s safer to replace it before doing that.

But it’s pending being rewritten. Hence the wipe with zeros.

Luckily, perhaps the sector is not in use by ZFS

  1. tmux new.
  2. smartctl -t long /dev/sbd in order to run a long test.
  3. close the terminal and do something else, it will take hours.
  4. tmux attach
  5. Post the output of smartctl -a /dev/sbd.

You don’t need tmux for that, SMART tests are run by the disks themselves.

3 Likes