Disk errors - but SMART seems fine, how come (new disk!)

maeffjus · August 22, 2024, 10:22am

Hey, I just installed a new disk (see here: Truenas Core exchange pool with 4+8tb against a 14tb drive)

So far nothin unusual - but after It was installed, I let the pool resilver the first disk (/dev(sdb).

After this had finished, I did the same with the second disk (dev/sda) in the pool / mirror.

Once this was resilvered too, I saw that there are 1468 errors on /dev/sdb, wich is just brand new! It was never dropped or whatever and just got installed right out of the package and remained untouched since then.

See here:

This is what the disk says in smartctl /dev/sdb -a:

Any idea what is the cause and how to correct the errors? Somehow it seems that SMART and the errors do not match in any way. (I also wonder how the G-sensor could note anything!)

Regards

FYI, the disks both are Toshiba Enterprise Capacity MG07ACA 14TB.

dan · August 22, 2024, 10:33am

You really should test drives before adding them to your pool. There are many guides for doing so; I typically follow the one here:
https://wiki.familybrown.org/fester/build-hardware/validate-disks
…or this script also works well and automates much of the process:

Maybe your disk is DOA; maybe the problem is somewhere else. It’s running way too hot, so you should investigate your drive cooling.

maeffjus · August 22, 2024, 12:32pm

I know it is pretty hot - but this is for a reason:

This is my backup device with is idle, with disk stopped 99% of the day.
Only at 0:00 it has about 15min of work to do.

So there is not a very powerful fan cooling the drives. Usually they are in the mid to low 40s, but since I resilvered about 4tb, it got a bit warm.

But from the SMART- I’d say it is generally fine - or am I mistaken?

maeffjus · August 22, 2024, 12:37pm

It is checksum errors - does this make a difference?

etorix · August 22, 2024, 1:50pm

Could be a cable issue, or over heating controller. Being 99% idle is no excuse for poor cooling—though it should not result in these errors.

Davvo · August 22, 2024, 2:07pm

You are watching a smart page where NO testing has been done: the data you see is the one the drive is able to gather by itself and is not showing you the complete picture.

While using an SSH connection, run tmux new then smartctl -t long /dev/sdb. Wait for that to end (depending on how big the drive is it can take over 12 hours) then check again the smart data page.

etorix · August 22, 2024, 3:17pm

There’s no need to use tmux: Once initiated, the test is carried out by the drive itself.

Davvo · August 22, 2024, 3:53pm

Used the chance to introduce OP to tmux.

joeschmuck · August 22, 2024, 8:34pm

My two cents…

While your cooling methods are not good, I do think the heat is causing the your issue, just not where you think it is.

First lets attack the HDD since you are moving in that direction already. You have no SMART errors, you did not run a SMART Long test however you were provided advice to run the test so commence that test. When you start the test odds are it told you to poll the drive again for results after XX minutes. If not or you do not recall that value, run smartctl -a /dev/sdb and near the top if the data that spits out, look for Extended test time and the value is likely within parentheses. Could be well over 600. Anyway this is how many minutes the Long test should take, assuming no real activity on the drive, but I’m going to ask you to give it some activity. You can also check the test status using that same command, just read the data it generates, you will find it. And one more thing, If your drive hits that 54C value, you do realize that you have likely voided your warranty, right?

I suspect the drive will test just fine, so going off that premise and looking at the actual error messages you are having, these are likely ZFS errors.

Go ahead and enter zpool -v status and provide all the data, do not assume you can provide less of it and that this is good enough. It often isn’t and delays fixing the problem. What I am looking for in particular is the pool name so you can use it in the next step below, however it may also provide other information that will narrow down the problem.

Let’s say you have some errors that show up getting this status message (I expect it to), next is to run a scrub by typing zpool scrub poolname where poolname is the name of the pool with the suspect drive.

The goal here is to have the scrub pass and you have zero file errors. You may have checksum errors or other errors, but so long as the scrub reports no file errors then your data should be fully in tact. And we can discuss the next step if things work out as I expect.

Do not get ZFS errors confused with Drive Errors, they are not the same and the majority of the time the drive is not at fault. It could be the controller/HBA, an unstable system (too hot), bad power supply. I’m leaning towards the too hot myself.

Oh yes, you can run the scrub commands (I highly advise it) while the drive is performing the SMART test. The SMART test has a lower priority so the SCRUB will finish first. Just do not power off the system or reboot it or the SMART test will abort and you will need to start all over again. You could use tmux here with the zpool commands or since you are only dealing with a single drive, it will not really benefit you.

If you have any questions about the instructions, ask before you do anything. Be smart, it is your data and we suspect you will want to retain it. If I don’t answer, there are many others here who are smarter than me that will help out.

And a question: How did you get that screen of the drive errors? I must be blind today as I cannot locate it on my TrueNAS 24.04.2 system.

Cheers!