False SMART errors?

Dorin_Xevious · April 14, 2024, 11:21am

Hi,
Since rebooting my TrueNAS SCALE yesterday, I’m getting these errors:

New alerts:

* Device: /dev/sdb [SAT], 2 Offline uncorrectable sectors.

Current alerts:

* Device: /dev/sdb [SAT], 2 Currently unreadable (pending) sectors.
* Device: /dev/sdb [SAT], 2 Offline uncorrectable sectors.

I have logged in to the dashboard, I see these errors on the bell, but when I’m looking at the storage panel, I don’t see any error. I’ve checked with zpool status - same, no errors. I checked the journal and dmesg - same, no error at all. I also checked using smartctl -H /dev/sdb - same, no errors.

So, what’s up with those messages and how can I “fix” this issue?

ericloewe · April 14, 2024, 11:26am

-H is close to useless. You should be looking at the output of -A at a bare minimum, -a for a full overview, and -x for the nitty-gritty.

etorix · April 14, 2024, 11:31am

If smartctl -a confirms what was reported, you should look into replacing the drives. This is valid ground for RMA. And if the drives are too old for RMA, the recycling bin awaits…

Dorin_Xevious · April 14, 2024, 12:29pm

Well, looking at the short test output (smartctl -l selftest /dev/sdb) I get:

smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.131+truenas] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%     10722         -
# 2  Short offline       Completed without error       00%     10611         -

So … I don’t see any errors.

ericloewe · April 14, 2024, 1:54pm

Well, it helps to not look at where the errors would be found.

Relocated sectors do not necessarily lead to SMART test failures (but they would be logged, including in the SMART parameter TrueNAS is reporting to you)
Short tests are not great, which is fine because they’re cheap. Short tests just don’t fail all that often, even on pretty bad disks.

dan · April 14, 2024, 1:58pm

There’s really no reason to expect a correlation between SMART errors and ZFS errors.

You demonstrate that that drive has never had a long SMART self-test. That isn’t good.

Start here:

etorix · April 14, 2024, 3:15pm

We told you to look at smartctl -a (or -A or -x). There you should see the bad and the dubious sectors.

Stux · April 14, 2024, 6:19pm

Look at the long results… run a long test if you haven’t.

Pending sectors don’t always cause long failures.

If the drive is in warranty RMA it.

If not, you could try wiping with zeros. May make the problem go away for a bit, but at the end of the day, a pending sector is a sector that contains data that can no longer be read.

It’s safer to replace it before doing that.

But it’s pending being rewritten. Hence the wipe with zeros.

Luckily, perhaps the sector is not in use by ZFS

Davvo · April 17, 2024, 7:22pm

tmux new.
smartctl -t long /dev/sbd in order to run a long test.
close the terminal and do something else, it will take hours.
tmux attach
Post the output of smartctl -a /dev/sbd.

ericloewe · April 17, 2024, 8:25pm

You don’t need tmux for that, SMART tests are run by the disks themselves.