Unclear Status of SMART tests on Western Digital Red Pro HDDs

BackDatNASUp · July 31, 2024, 6:40pm

Hello,

I am having trouble getting SMART tests to run on my Western Digital Red Pro HDDs. My system is running TrueNAS Scale Dragonfish-24.04.2. All of the HDDs in question have SMART enabled and the SMART service is enabled and running.

I scheduled a long SMART test on all disks, which was scheduled to run at 1:00 AM today. After I scheduled it, the UI correctly stated that the test would run in X hours. This morning, I checked for results, but am not seeing any. When I click the S.M.A.R.T. Test Results on any of these disks, the dialog says “No S.M.A.R.T. tests have been performed on this disk yet.”

I then updated the scheduled task to run the same test at noon today. The UI said it would run in 7 minutes. After noon came around, the test did not run and the UI said it would run again in 24 hours. The disks still say “No S.M.A.R.T. tests have been performed on this disk yet.”

I then launched a manual, long SMART test on an individual HDD. A dialog opened, stating:

sda

smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.6.32-production+truenas] (local build) Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org === START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION === Can’t start self-test without aborting current test (50% remaining), add ‘-t force’ option to override, or run ‘smartctl -X’ to abort test.

Questions:

This dialog implies to me that there is already a SMART test running (perhaps the initial 1:00 AM test) - is this the case?
Where/how can I check the status of running SMART tests? I notice that the TrueNAS UI does not show any jobs in the queue. Should a running SMART test appear in the jobs queue?
I have x4 16TB HDDs in RAIDZ-1. Anyone have a ballpark estimate for how long a long SMART test should take to complete?
When I run smartctl -i /dev/sda, it states that SMART support is available and enabled, but also says “Device is: Not in smartctl database 7.3/5528”. What does this mean?

dan · July 31, 2024, 6:41pm

I’d expect about 24 hours.

Glorious1 · July 31, 2024, 7:14pm

Sure sounds like it.
If you’re wondering what the status of a test is, in console or SSH run smartctl -a /dev/sda Near the top will be a section ‘Self-test execution status:’ that will tell you if a test is running and roughly how far along it is.

Protopia · July 31, 2024, 8:24pm

A SMART Long Test elapsed time depends on the size of the drive - it reads every sector so the more sectors, the longer it can take.

So I suspect that the issue is simply that the Long test has failed to complete yet.

I would start with a Short test and then a Conveyance test - both of which are quick to complete. Once you have seen the results of these, you will have more confidence that the Long test will run.

Also, if you implement @joeschmuck 's Multi-Report script, and ask for the full SMART results to be included in the email, you will get to see the log of tests that were run and whether they succeeded or failed.

Protopia · July 31, 2024, 8:36pm

SMART tests are run on a single drive by the drive’s firmware - you can run them simultaneously on all drives in parallel if you wish.

Scrubs run on the pool (rather than on individual drives) and so run across all disks in the pool at the same time.

It means that (for some reason) your specific WD Red Pro 16TB is not in the global database of drives - so SMART doesn’t know any specifics about this drive (like manufacturer specific SMART values). But this should not impact its functioning to any noticeable extent.

Okedokey · July 31, 2024, 10:04pm

How often do you guys recommend running smart checks? I read weekly, but this seems excessive if it is going to take a full day to complete. Hopefully I’m not hijacking this thread, but I think it is in line with the OPs interest.

joeschmuck · July 31, 2024, 10:08pm

Your WD Red Pro 16TB drive should take about 22 hours, uninterrupted. This means no reads, no writes from the NAS. While a SMART test runs, the drive is still fully operational. When the computer needs to read/write, the test give priority to the data request and the self-test stops briefly. If you perform a lot of operation then those fractions of a second can add up to an hour or even more.

Rebooting can, turning off power will, stop the test as aborted. As @Glorious1 said, check your status by running the command smartctl -a /dev/sda however you should have two places to look, the top and the bottom. At the bottom is the Self-test log. The very first entry is the most recent test conducted. It will tell you how much of the test remains (in a percentage).

I’m kind of surprise the drive isn’t in the drivedb.h file. What is the drive model, you can get that as well from the SMART output at the top.

joeschmuck · July 31, 2024, 10:11pm

I recommend a Daily Short test and a Weekly Long test. Some people run the long tests much further apart. The short test takes 2 minutes and is a very basic test. The long test reads all the surface area of the drive to ensure it can read all the sectors on the platters.

Okedokey · July 31, 2024, 10:13pm

Thanks, yeah, that’s basically what I have now. Half my drives long tested on one day, the other half a few days later with all drives being short tested on the first day of the week. I have Z-3 so I think this is a balanced approach.

Stux · July 31, 2024, 10:32pm

Issue with the daily shorts is that they can push the long results off the list of tests

I run 4 tests a month. 2 shorts. 2 longs.

Long tests have a measurable impact on array performance while they are running.

BackDatNASUp · July 31, 2024, 10:59pm

The model # is: WDC_WD161KFGX.

I also find this strange because iXsystems recommends the WD Red Pro drives in their TrueNAS Mini series. IIRC, they recommended several of the sizes, including 14TB and 18TB, but no mention of the 16TB. I’m not sure why, as I assume they are all the same except for the size. I just figured iXsystems didn’t want to spend the time to test each drive. I bought the 16TBs because they were on a great sale that I couldn’t refuse.

BackDatNASUp · July 31, 2024, 11:12pm

Thank you for the tip. I see now that SMART tests are running on all 4 disks in the pool, which is good.

Remaining thoughts (not directed at anyone in particular):

I wish it was more obvious that SMART tests are running, especially since they can impact performance and using the system can slow down the SMART tests. Short of a shell command, my only clue that this is running is I can recognize the sound pattern of the drives while the tests are running
One odd thing I see is that 3 of the drives are at 20% remaining, whereas the 4th is at 30% remaining. They are all the same drive model, so I find this odd.

BackDatNASUp · July 31, 2024, 11:46pm

This must just be rounding. It appears the status is reported in 10% increments because now the 4th disk has 20% remaining as well

joeschmuck · August 1, 2024, 9:15am

Generally I see that if a short test is running (start at say 1 AM), then it takes 2 minutes to run, then start a Long test at 1:05AM, then the drive has almost 24 hours to complete the Long test. If it does not complete the Long test and the next Short test is requested, well it the Short test is just ignored since the Long test is still running. I don’t think TrueNAS will force the Long test to terminate but to be honest with you, I haven’t tested that in a very long time.

Additionally I would recommend that a person space out the Long test, for example on Mondays test drive sda, Tuesdays sdb, etc. Whatever works for the end user as I agree, this would impact performance.

joeschmuck · August 1, 2024, 9:37am

This is not uncommon actually and could be the result of two things (off the top of my head):

The drive completes the test slower. When you look at the drive data it will list how many minutes it takes to run the test. Each drive could be different, it isn’t one value for all model of drives.
The slower drive had more activity on it thus the test increments slower. If the one drive were a single stripe for maybe all your VMs, for example, that would cause it.

But yes, it would be nice to have some flashing status stating the drive is in self-test. Hum… Nope, that is beyond my abilities right now but maybe I could write a little change to TrueNAS and submit it where it would provide some sort of status on the desktop. Don’t hold your breath, I only think I’m that good, but not in reality.

Stux · August 1, 2024, 10:21am

I like this idea

I think I will switch to short followed by long on a weekly basis

(Ideally over the weekend!)