I am having trouble getting SMART tests to run on my Western Digital Red Pro HDDs. My system is running TrueNAS Scale Dragonfish-24.04.2. All of the HDDs in question have SMART enabled and the SMART service is enabled and running.
I scheduled a long SMART test on all disks, which was scheduled to run at 1:00 AM today. After I scheduled it, the UI correctly stated that the test would run in X hours. This morning, I checked for results, but am not seeing any. When I click the S.M.A.R.T. Test Results on any of these disks, the dialog says āNo S.M.A.R.T. tests have been performed on this disk yet.ā
I then updated the scheduled task to run the same test at noon today. The UI said it would run in 7 minutes. After noon came around, the test did not run and the UI said it would run again in 24 hours. The disks still say āNo S.M.A.R.T. tests have been performed on this disk yet.ā
I then launched a manual, long SMART test on an individual HDD. A dialog opened, stating:
sda
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.6.32-production+truenas] (local build) Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org === START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION === Canāt start self-test without aborting current test (50% remaining), add ā-t forceā option to override, or run āsmartctl -Xā to abort test.
Questions:
This dialog implies to me that there is already a SMART test running (perhaps the initial 1:00 AM test) - is this the case?
Where/how can I check the status of running SMART tests? I notice that the TrueNAS UI does not show any jobs in the queue. Should a running SMART test appear in the jobs queue?
I have x4 16TB HDDs in RAIDZ-1. Anyone have a ballpark estimate for how long a long SMART test should take to complete?
When I run smartctl -i /dev/sda, it states that SMART support is available and enabled, but also says āDevice is: Not in smartctl database 7.3/5528ā. What does this mean?
If youāre wondering what the status of a test is, in console or SSH run smartctl -a /dev/sda Near the top will be a section āSelf-test execution status:ā that will tell you if a test is running and roughly how far along it is.
A SMART Long Test elapsed time depends on the size of the drive - it reads every sector so the more sectors, the longer it can take.
So I suspect that the issue is simply that the Long test has failed to complete yet.
I would start with a Short test and then a Conveyance test - both of which are quick to complete. Once you have seen the results of these, you will have more confidence that the Long test will run.
Also, if you implement @joeschmuck 's Multi-Report script, and ask for the full SMART results to be included in the email, you will get to see the log of tests that were run and whether they succeeded or failed.
SMART tests are run on a single drive by the driveās firmware - you can run them simultaneously on all drives in parallel if you wish.
Scrubs run on the pool (rather than on individual drives) and so run across all disks in the pool at the same time.
It means that (for some reason) your specific WD Red Pro 16TB is not in the global database of drives - so SMART doesnāt know any specifics about this drive (like manufacturer specific SMART values). But this should not impact its functioning to any noticeable extent.
How often do you guys recommend running smart checks? I read weekly, but this seems excessive if it is going to take a full day to complete. Hopefully Iām not hijacking this thread, but I think it is in line with the OPs interest.
Your WD Red Pro 16TB drive should take about 22 hours, uninterrupted. This means no reads, no writes from the NAS. While a SMART test runs, the drive is still fully operational. When the computer needs to read/write, the test give priority to the data request and the self-test stops briefly. If you perform a lot of operation then those fractions of a second can add up to an hour or even more.
Rebooting can, turning off power will, stop the test as aborted. As @Glorious1 said, check your status by running the command smartctl -a /dev/sda however you should have two places to look, the top and the bottom. At the bottom is the Self-test log. The very first entry is the most recent test conducted. It will tell you how much of the test remains (in a percentage).
Iām kind of surprise the drive isnāt in the drivedb.h file. What is the drive model, you can get that as well from the SMART output at the top.
I recommend a Daily Short test and a Weekly Long test. Some people run the long tests much further apart. The short test takes 2 minutes and is a very basic test. The long test reads all the surface area of the drive to ensure it can read all the sectors on the platters.
Thanks, yeah, thatās basically what I have now. Half my drives long tested on one day, the other half a few days later with all drives being short tested on the first day of the week. I have Z-3 so I think this is a balanced approach.
I also find this strange because iXsystems recommends the WD Red Pro drives in their TrueNAS Mini series. IIRC, they recommended several of the sizes, including 14TB and 18TB, but no mention of the 16TB. Iām not sure why, as I assume they are all the same except for the size. I just figured iXsystems didnāt want to spend the time to test each drive. I bought the 16TBs because they were on a great sale that I couldnāt refuse.
Thank you for the tip. I see now that SMART tests are running on all 4 disks in the pool, which is good.
Remaining thoughts (not directed at anyone in particular):
I wish it was more obvious that SMART tests are running, especially since they can impact performance and using the system can slow down the SMART tests. Short of a shell command, my only clue that this is running is I can recognize the sound pattern of the drives while the tests are running
One odd thing I see is that 3 of the drives are at 20% remaining, whereas the 4th is at 30% remaining. They are all the same drive model, so I find this odd.
Generally I see that if a short test is running (start at say 1 AM), then it takes 2 minutes to run, then start a Long test at 1:05AM, then the drive has almost 24 hours to complete the Long test. If it does not complete the Long test and the next Short test is requested, well it the Short test is just ignored since the Long test is still running. I donāt think TrueNAS will force the Long test to terminate but to be honest with you, I havenāt tested that in a very long time.
Additionally I would recommend that a person space out the Long test, for example on Mondays test drive sda, Tuesdays sdb, etc. Whatever works for the end user as I agree, this would impact performance.
This is not uncommon actually and could be the result of two things (off the top of my head):
The drive completes the test slower. When you look at the drive data it will list how many minutes it takes to run the test. Each drive could be different, it isnāt one value for all model of drives.
The slower drive had more activity on it thus the test increments slower. If the one drive were a single stripe for maybe all your VMs, for example, that would cause it.
But yes, it would be nice to have some flashing status stating the drive is in self-test. Hum⦠Nope, that is beyond my abilities right now but maybe I could write a little change to TrueNAS and submit it where it would provide some sort of status on the desktop. Donāt hold your breath, I only think Iām that good, but not in reality.