Questions re: Optimizing SCRUB and SMART Test Schedules for HDD and NVME SSD Pools (Home/Small Office)

There’s at least one similar question here (no replies): Newbie Q: Frequency of scrub and long SMART tests on large HDDs (20TB)

I have two pools:

  1. HDD Pool: 4x Mirror VDEVs with 14TB enterprise HDDs; and
  2. SSD Pool: 1x Mirror VDEV with 2x 4 TB NVME.

I’m trying to optimize my SMART test and SCRUB schedules.
Currently, they look like this:

  1. SCRUB (per pool): Sunday at midnight, every 35 days.
  2. LONG (per pool): Once a week, Wednesday, at 7 PM.
  3. SHORT (per pool): Daily at 4 PM.

Some questions:

  1. Given the size of my disks, do I have the SCRUB set far enough apart from the LONG HDD tests to minimize the possibility of them running at the same time? I guessed at this scheduling, to be honest.
  2. I’ve seen suggestions that it’s sufficient for home/small office use to do the LONG HDD tests once a month instead of once a week, especially since ZFS adds another layer of health checks. Good idea/bad idea?
  3. I’ve also seen suggestions that running a LONG test on an NVME isn’t really beneficial enough to be worth it. I’m (vaguely) assuming that if they’re going to have problems, a SHORT test is sufficient to detect them combined with ZFS’ health checks. (Also, every NVME vendor seems to implement SMART their own special way, so maybe that has something to do with it.)

I’d really appreciate some advice so I could settle on a strategy and stop thinking about it. :stuck_out_tongue: Thanks!

One notice: there is no short/long S.M.A.R.T. test per pool. S.M.A.R.T. is performed on physical disks, while scrub is performed on pools.

Given the size of my disks, do I have the SCRUB set far enough apart from the LONG HDD tests to minimize the possibility of them running at the same time? I guessed at this scheduling, to be honest.

Long S.M.A.R.T. test on 14TB HDD may take more than 24 hours under normal, idle conditions. If it is busy / in use it may take even several days.

It is not adviseable to schedule tests at a fixed date in month, after a number of days and at a week day as they may overlap. Your strategy is well done - one at Wednesday, another at Sunday, so they will not occur at the same time (assuming that long test starting at Wednesday will complete until Sunday).

I’ve seen suggestions that it’s sufficient for home/small office use to do the LONG HDD tests once a month instead of once a week, especially since ZFS adds another layer of health checks. Good idea/bad idea?

That’s true. There’s no need to stress your disks that often. You can freely schedule tests monthly. For example, S.M.A.R.T. every 1st day of month and scrub every 15th.

3 Likes

That could be because there is no single definitive anwser.
But with long SMART tests taking over 24 hurs on large HDDs, weekly long tests certainly look excessive.
Conversely, with SMART tests on NVMe drives being rather quick, it the short test which is questionable: Just go with long.

True for the moment. I had a suggestion to make a change to my script to only remove drives that are in a pool having a scrub active. I have expanded this to include Resilver as well.

TrueNAS (all versions) are currently (as of this writing) not capable of scheduling and running SMART tests. I have Multi-Report which does perform this operation, well to be specific the Drive-Selftest script. It can and will plan all your scheduling needs for all your drives. You tell it what days of the week it is allowed to run, tell it how often to run a Short or Long test (Daily/Weekly/Monthly). And you tell it that if a Scrub is going on, if xx minutes remains then it can run a Long test if the schedule supports it, if the scrub is longer, run a Short test instead.

You will not find a manufacture specification on how often a SMART test is to be run, or at least I have never seen one. With that said, I would change your setup to use Multi-Report so you can test your NVMe drives and remove the scheduled testing from the TrueNAS GUI.

Otherwise your schedule is sort of close to correct. The 14TB drives are scheduled differently that what I’d do.

If I had to schedule my hard drives using TrueNAS, I’d do it this way.
The Short testing is fine.
The Long testing should be broken up into individual days, drive sda on Monday, sdb on Tuesday, etc… Do not run Long tests on multiple drives for any given day if possible. This keeps the pool more responsive and generates less heat in the system, especially if your drives are stacked together.
The Scrub is fine as well.

I do need to clarify two things, the Multi-Report script needs to be run every day that it would have been authorized to run or you run the risk of a drive skipping a Long test. And the script is run as root. This can be an issue for some folks and if you can find a way to run it not being root, tell me how as I would really like to know. That is a serious comment. I’m not all knowing (my wife thinks that I think I am).

And please do not think I’m peddling Multi-Report, that is not my purpose. Also, you can perform all the drive testing using just the Drive-Selftest script, so you do not need to install Multi-Report itself.

Why did I create Drive-Selftest? Because I know people who have dozens and even hundreds of drives and this greatly simplifies scheduling these tests.

Drive-Selftest is currently at version 1.05 and I started working on version 1.06 a few days ago, as time permits. The big change there is currently, any SCRUB on any pool will cause all Long tests to be changed to Short tests for that time period. It will be change to be pool specific. Again, someone with a lot of drives uses this and they are the ones who asked if it could be implemented.

If you want to read the User Guide, here is my GitHub link where the files are located. Look for the Drive-Selftest User Guide with respect to this topic. The Multi-Report User Guide s there and the Quick Start Guide.

Given ZFS pool is scrubbed regular and SMART SHORT runs regular too,
(imho) SMART LONG runs are waste of energy and generate unnecessary wear on drives.

SMART is by default enabled on drives and therefore it is active in normal operation. In case an error is noticed in operation it gets SMART logged.
Either immediately or after SMART SHORT run you will get alarmed.

The only benefit of a LONG run in that scenario would be that regions of space not used by ZFS (yet) will get checked too. If such a region is faulty one would notice earlier. If you run SMART SHORT daily that time benefit would be <24h at best.

Is potentially knowing <24h earlier of a fault (if a fault ever happens) worth the energy consumption and additional wear? I highly doubt that!

A short test only samples a very small random part of the drive and is unlikely to catch a defect which would be revealed by a long test.

1 Like

??? :thinking:
I am using it in Core normally. I don’t have access to Scale/CE right now, but I’d bet that it also has it as well.

Fangtooth 25.04.1

Data Protection tab in GUI seems to have something for SMART.

Yes. Thanks for clarifying that. The risk of making longwinded how do I TrueNAS? posts so late at night is that I start mixing up my terms. Again.

On mostly idle disks, it seems to take ~26 hours. These were the biggest disks I’d ever owned, and I was very excited to get them … and then I did my first LONG SMART test on them and realized managing massive disks is quite a bit different than managing <= 8 TB ones. I’m dreading how long it’ll take once I’m using significant space.

Thanks for giving me some confidence on my current schedule. I’ll go ahead and move the LONG tests to a monthly schedule for now. I was pretty confident up until yesterday, but it’s pretty easy to start second-guessing yourself if you read too many tutorials on the Internet. :stuck_out_tongue:

@joeschmuck Thanks for the info about Multi-Report and Drive-Selftest. I’m still on Electric Eel. I’m planning to take a closer look at implementing those scripts on my NAS after I upgrade to Fangtooth. I’d rather not add new components and distract myself setting up something new until I’ve managed to get that done. :wink:

Thanks for this suggestion. I’ve got 8 drives (in 4 mirrors), so I’ll have to think about how to space out the testing if I’m breaking them up over a week. Any suggestions?

At least, having been brute forcing all of them for weekly LONG tests, I know the system or the drives aren’t going to overheat anytime soon. I have a love-hate relationship with this physical server hardware, but the airflow is excellent, if nothing else.

A few people were wondering about this, so a couple of clarification points from my own expeirence before @joeschmuck is able to chime in again.

  1. TrueNAS Scale will let you schedule SMART tests non NVME disks, but never ever runs them. Manual tests on NVME work fine and show up in the TrueNAS GUI’s SMART history for the disks. This is a known issue that has not been fixed for … reasons I don’t understand, honestly. So, yeah, there’s currently no way to schedule NVME tests at all.
  2. The TrueNAS scheduler is a GUI front end for cron, which may or may not be sufficiently flexible, and will probably be some users’ first introduction to cron.
  3. The in-GUI logging for SMART tests–including whether they’ve run, what the results were, etc., is very basic and confusing.
    3.1 It tells you the lifetime hour count when the test was run, but not the date/time, so you have to use smartctl and do math to figure it out if you want to know. And you probably should want to know that a scheduled SMART test actually completed successfully. Once you know the schedule is working, then you can rely on the notification system to send you warnings and errors. (If the SMART tests are failing and the notifications are misconfigured, you’d never notice).

    3.2 The Job History/Running Job status for SMART tests, in my experience, is limited to telling you that a SMART test has been started. (It’s TrueNAS’s job to start the test, not do the test. So, from TrueNAS’ point of view, the Job succeeds if the test starts, even if the test fails. This is confusing.

This reminded me that I need to put in a feature request to improve the UI on the SMART test history and logging. It was really confusing when I first started.

1 Like

There is not much benefit in running long smart tests on array members: results don’t add anything actionable to what you get with zfs scrub (that you do have to run periodically): the latter can heal the issues, the former cannot.

If your long smart test detects the issues — what are you going to? Regardless of whether the bad sector is in used or free space — you would run scrub anyway. But you are already running scrubs. So… why bother with long smart test?

1 Like

Replace the drive—with the failed test being a ground for RMA if the drive is still warranty.
A scrub detects software defects. SMART tests detects hardware defects. Two different, and complementary, things.

2 Likes

A scrub run can also indirectly lead to a drive identify hardware defects.
In fact any operation a drive does is kind of a hardware defect test too. if something fails it will be logged and maybe an error returned.

Returning a drive is an option of course
But continuing to use a drive with replaced sectors is possible too.
It depends on warranty, available budget, data criticality, available backups, risk appetite, etc…

1 Like

For the non-believers, TrueNAS does not actually test NVMe drives, yet. The GUI looks great but no actual tests are being performed, CORE or SCALE.

Examine your SMART data, use smartclt -a /dev/nvme0 and then note the value for Power_On_Hours. Next look at the bottom of the report, compare the hours for the last time a SMART test was conducted (it will be the first one listed).

If you find out that the value has been changing as you planned the testing to occur, please reach out to me. I want to know how, why, what is going on.

There is a reason for the tests not being run and I have already opened a ticket on it quite a while ago. iXsystem has responded with why and what needs to happen to fix it. I was hoping 25.04 would have the fix but nope. Maybe 25.10 in a few weeks? I will certainly test it when it comes out.

An easy way to test if it works, setup an hourly short test. Let it run for half a day, go check the SMART data to see if in fact it is testing.

And I do not take offense to anyone calling me on it. I certainly understand because it is listed in the GUI.

Um, nope.
A SCRUB only reads the data on the drive, where the data is located. A Long test read the entire drive, regardless if data exists or not.

A SCRUB cannot tell you the head armature is having issues, or repeated data transfer errors. At least TrueNAS will report some errors via smartd, but only when they become errors and not the precursor to an error.

Immagine if you will, little Johnny has all 500 of his bitcoins stored on his hard drive. He doesn’t do any tests on the drive. One day the click of death. If it were a SSD, you would have other failing indicators. SMART might have caught the issue before the total failure occurs.

The catch is, to find a prefailure you need to read the data after the test completes.

One very good thing that a LONG test and TrueNAS will do if if there are any LONG or SHORT test failures, it will tell you. The long test can tell you if your patters are flaking off. Catch it early and you can backup all your data, order a replacement drive, do the replacement, and all before catastrophe.

Joe climbs down off his soapbox

1 Like

If you use Drive-Selftest I would setup the parameters like this:

Short_Test_Mode=2                           # 1 = Use Short_Drives_to_Test_Per_Day value, 2 = All Drives Tested (Ignores other options), 3 = No Drives Tested.
Short_Time_Delay_Between_Drives=1           # Tests will have a XX second delay between the drives starting testing.
Short_SMART_Testing_Order="DriveID"         # Test order is for Test Mode 1 ONLY, select "Serial" or "DriveID" for sort order.  Default = "Serial"
Short_Drives_to_Test_Per_Day=1              # For Test_Mode 1) How many drives to run each day minimum?
Short_Drives_Test_Period="Week"             # "Week" (7 days) or "Month" (28 days)
Short_Drives_Tested_Days_of_the_Week="1,2,3,4,5,6,7"    # Days of the week to run, 1=Mon, 2=Tue, 3=Wed, 4=Thu, 5=Fri, 6=Sat, 7=Sun.
Short_Drives_Test_Delay=130					  # How long to delay when running Short tests, before exiting to controlling procedure.  Default is 130 second should allow.
											  # Short tests to complete before continuing.  If using without Multi-Report, set this value to 1.
### LONG SETTINGS
Long_Test_Mode=1                            # 1 = Use Long_Drives_to_Test_Per_Day value, 2 = All Drives Tested (Ignores other options), 3 = No Drives Tested.
Long_Time_Delay_Between_Drives=1            # Tests will have a XX second delay between the drives starting the next test.
Long_SMART_Testing_Order="Serial"           # Test order is either "Serial" or "DriveID".  Default = "Serial"
Long_Drives_to_Test_Per_Day=1               # For Test_Mode 1) How many drives to run each day minimum?
Long_Drives_Test_Period="Week"              # "Week" (7 days) or "Month" (28 days)
Long_Drives_Tested_Days_of_the_Week="1,2,3,4,5,6,7"     # Days of the week to run, 1=Mon, 2=Tue, 3=Wed, 4=Thu, 5=Fri, 6=Sat, 7=Sun.

Now let me tell you what this does. First you must setup a CRON Job to run once everyday. This will perform a SHORT test on all drives within a 1 week period (7 days).
It will test Mon, Tue, Wed, Thu, Fri, and Sat. You could change Short_Drives_Tested_Days_of_the_Week="1,2,3,4,5,6,7" to “1,2,3,4,5” if you do not want to test Saturday and Sunday. You still want the CRON to run daily.
There is a 130 second delay, it’s purpose is to let any Short tests complete so they can be reported in Multi-Report. If you are using Drive-Selftest as stand-alone, you can change this value to “0”.

Here is the beauty of this script, you don’t have to tell it how many drives you have. It will do them all. And if a SCRUB is going on, any drives scheduled for a LONG test would be changed to a SHORT test to cause minimal impact to the system. If a RESILVER is going on, no drives are tested.

To answer your question above, SHORT every day, LONG once a week but space the drives out two a day. If using the GUI, that is one entry for the SHORT tests, plus four more entries for the LONG test. My script, one CRON Job.

You have 8 drives, this means my script would automatically run a SHORT test on all drives everyday, and a automatically run a LONG test on two drives a day until all drives were tested. Mon (2 drives), Tue (2 drives), Wed (2 drives), Thu (the last two drives).

If you decide to use this script, I am more than happy to help you configure it, but the default settings are Short every day, Long once a week.

Hopefully that more than answers your question. I know it was overkill, sorry about that. I know what it is like to lose a lot of data, several times and all due to hardware failure.

1 Like

@joeschmuck I think it was the way you wrote it in the reply. It didn’t say NVMe specifically and we thought you were talking about HDs also.

Thanks for understanding. I misread as well.

Could you link me to the Jira ticket? I’d like to track that one, too.

Given how pleased I am with how my pants are fitting at this exact moment in time, I’m willing call myself Little Johnny for purposes of this thread.

I am intrigued to know that I have 500 bitcoins and they are apparently on one of HDDs somewhere, and would like to subscribe to your newsletter. (But no seriously, which HDD are we talking about?)

Thanks for the sample config and explanation. I’m definitely going to be setting this up soon.

I also just noticed the config backup script in your signature. I’ve been looking for a way to do exactly this. Awesome. :slight_smile:

Correct. Why shall I care about sectors that hold nothing?

Again, disk writes and reads data? Yes? I keep using it. No? I replace it. What smart thinks about it and what error rate is is irrelevant. Unless it affects performance. Then you will also notice and replace. And if you don’t notice? What can be better — keep using it.

Jonny needs to do backups. Nothing to do with smart, raid, or scrub. This all is to keep uptime; to safeguard data you need backups

Precisely. I buy exclusively used and refurbished enterprise disks both for my business and personal use. I despise any warranty (why shall I pay someone to carry risk for me when I am capable of carrying that risk myself?) and by buying used I completely avoid left edge of the bathtub curve. I also ignore smart. I use disks until zfs offlines them. Why shall I replace them any earlier? I want to get every last cent worth of value from it. If there is 200 reallocated sectors — so what? Filesystem either works or it does not. I have zero interest in knowing what smart thinks about it.

My pools universally comprise of several raidz1 vdevs + special device, and I replicate snapshots between (remote) servers as a backup. Smooth sailing for the past 15 years and counting. No pool losses, and I’ve replaced a handful of disks over these years.

1 Like

Multi-Report as well, it sends you the configuration and sums up if “All is Good”, or if you have a warning in the email subject line. Here is an example, my SCALE system from last night, it is only four NVMe drives.




And the drive data is displayed for all drives.

I was not aware for a long time that the GUI was not running and never has run any S.M.A.R.T. tests. I had the short tests configured for all drives and to run on a schedule and initially thought the multi-report script running the tests was redundant when S.M.A.R.T. tests were added to the multi-report. Perhaps since S.M.A.R.T. tests are apparently more or less permanently broken maybe ix should take out the ability to schedule them in the GUI and only offer a manual run on a selected set(s) of drives. The S.M.A.R.T. tests including long tests do work if run manually from the GUI, but what fun is that?

For those on the thread or reading the thread that are not using the multi-report script. You should. It is customizable to your needs and system(s) and the S.M.A.R.T. tests scheduling is well thought out and actually works. There are also a host of other things the script tests and/or gathers/tracks. All of this info can be sent to you (even drive partition info) and the script initial setup is guided by the script. The script defaults are sane and well thought out with input and suggestions from many people in the forums.