Drive BMS in lieu of SMART Tests - is iXsystems malicious or incompetent?

Pre-History

Anecdote

As documented (a bit) in Is my drive running background media scans (BMS)? How often? - TrueNAS General - TrueNAS Community Forums, I had a brand new drive with minimal history and it showed that the BMS feature was not enabled.

That initial investigation was on my main 25.04 system. For fun, earlier today I slapped together a 25.10 installation and immediately confirmed a couple things:

  1. smartd was not running in the background. Its process was not found when running (as root) systemctl status | grep smartd
  2. smartctl -c /dev/sdX outputs SMART Disabled. use option -s with argument 'on' to enable it. when smartd isn’t running.
  3. Running smartd and then repeating the -c command shows the same output I saw on TN 25.04. BMS/Automatic Offline Data Collection is Disabled.

That’s not a good look but hey, benefit of the doubt. Kris says this runs every 90 minutes. I reboot the system so that all the daemons are back to their default states and then … do nothing for over three hours.

When I take another look, these are the results (in no particular order):

  • smartd isn’t running
  • smartctl -c /dev/sdX still outputs SMART Disabled until I manually run smartd.
  • smartctl -c /dev/sdX again still shows Automatic Offline Data Collection is Disabled.
  • journalctl -xe | grep smartd output only shows smartd commands directly caused by my activities on the system.

What the actual farfignewton?
Edited by Joe as we don’t use that language here.

You guys have said this is supposed to be automatic. The admin doesn’t need to do anything. That you’re not taking a “trust me bro” approach to our data.

Well, if that’s the case and it’s supposed to be automatic, why doesn’t your latest release auto-enable BMS on drives so that the drives can scan during idle time? Is my disk unique? How many disks come from the factory with BMS enabled by default? How many don’t?

What technology (in lieu of SMART and BMS) is going to automatically/routinely locate bad sectors in unallocated regions of drives? How is that going to be fed into TN so that it can alert the condition to administrators?

Once again, this whole problem stems from the fact you guys never fully documented your rationale for the SMART changes in a central location.

Your post did not get better by having expletives in it.
Perhaps you could avoided it by writing it up and then holding off posting it for a day, giving yourself time to cool down and rethink some of the wording. Nothing in your post was in any way shape or form time critical.

Why do I mention this instead of any of the actual content in your post?
Because how you present your thoughts and how you word your questions matters.

1 Like

Expletives? Plural? OK mate.

SMART monitoring != SMARTD. Our middleware runs SMART queries against drives directly. If you are going to go to this much trouble you may as well go look at the code repo itself. It is open after all. :slight_smile: Prefer to have real conversations based on the code were we can debate and disagree based on the common set of information.

4 Likes

I’m detecting a bit of Hanlon’s Razor and Dunning-Kruger’s effect here :grimacing: We’re open source and the code that runs these checks are in the open. You’re free to check them out and make a logical argument based on real evidence instead of attributing your perceived reality to malice on our part.

I am pretty new to GitHub, I am not fluent in Python, but I had the look that was suggested.

I searched inside the middleware Repo for mentions of SMART, as well as associated variables, such as middlewared.alert.schedule, smart_interval, etc. I also tried looking into the issue via Jira tickets, but some of them were marked private and not accessible to me. (which is totally fine)

Bottom line, reviewing the code for an non-CS outsider is not as simple as cruising over to GitHub and having a look re: what SmartMon, SmartCTL, etc. might be doing, what process schedules them, or even where the variables for the intervals are defined. That is not a criticism of the Repo or the code, it just speaks to my familiarity with GitHub, Python, and so on.

Given the continued community interest around this subject, I wonder if it wouldn’t make more sense to cover in the TrueNAS documentation what SMART operations are and are not done? Call it a knowledgebase article as part of documentation or the resource section.

2 Likes

I agree with you. I’m not a programmer and muddle through it very slowly, not this specific code but other code in the past. If you are not a programmer, there is no need to waste your time trying to understand it. And I’m not asking for a detailed description of how it works, although I would like to see a white paper on it, but iX doesn’t need to produce it for us. Maybe I will take a look at the code to see what it looks like. Maybe I can follow it, maybe not.

The following comments are about the title of this thread.

My current stance is this:

  1. TrueNAS does report to you whenever there is a SMART attribute problem. It isn’t smartd, iX has said this in the past.
  2. TrueNAS is bringing back the GUI that allows easy SMART testing setup.
  3. Before I throw any mud, I will wait for the next version of TrueNAS 26.x that supports the new changes, then I will figure out if I agree or not. If I do not agree, I will talk about it, not throw mud. That gets us nowhere.
  4. If a person does not like how iX is doing this, they can always implement SMART testing using CRON TAB. Or you could use a tool like Multi-Report (yes it was shameless of me to mention it.) But there are options.
3 Likes

Also the first post boils down to:
If TrueNAS now relies on BMS rather than on explicit SMART tests, what happens when BMS is NOT enabled on the drive?

I think that this is a fair question, and deserves a better answer than “look at the source”.

2 Likes

Maybe one should also keep in mind, that smart is monitoring a lot of values even without running explicit tests. (more than 80 values, according to this sources)

https://techterms.com/definition/smart#google_vignette

These offline tests are more like data collection of values, which are not updated constantly.

As someone who owns SAS drives that do BMS (by default, out of the box), I can confidently say that it’s not a feature anyone who spends much of their time nearby the server would want.

Any time the drives are “idle” they start doing BMS. In other words, you get constant noise from the drives doing head movements and the like; the drives are constantly active.

In my case the drives are in my backup system. That system is only on when I’m doing my scheduled backup and is otherwise powered down. I schedule backups to happen when I am away from home as best as I can, and absolutely not overnight.

1 Like

BMS is primarily (I think only) a SAS drive type operation, if the drive allows it. This does not migrate to SATA drives, and I don’t know what NVMe does but I don’t recall reading in the NVMe Specs that NVMe drive are required to do their own background scans. This means that BMS is only for SAS drives.

I won’t defend how iX got to the point of not using smartd, removing the SMART test GUI section, or whatever. They had reasons and it was a corporate decision. Due to a large number of concerns voiced here in the forums, they are bringing back the GUI interface. They might have even had some pushback from some customers, I don’t know.

I know iX stated earlier that TrueNAS monitors for drive issues and reports the issues as they occur, realtime, this means they do monitor the SMART data on the drives, just not using smartd.

The only thing missing from my perspective is the setting up of routine SMART tests. Today, the 25.x GUI does not support it, however anyone can manually setup these tests if they desire. Of course my position is to perform a Daily Short test and a Weekly Long Test (subjective to the amount of time a Long test takes, larger drives mainly might be once a Month). But that is my position alone. I have no statistics to back up my rationale. It is just what works for me.

I still say, let’s see what happens in TrueNAS 26.x. I suspect the SMART GUI will return, and that is it. That is all they took away functionally. If I miss something, please say so. I’m not on this forum all the time and read a fraction of the postings. I’m saying, why sling mud since iX said they are bringing back the GUI. To me it feels like a dead horse that might need to be revived after 26.x comes out.

And of course, I can be wrong on some or all of these points, I am, oh my gosh, fallible. I hope I didn’t offend anyone, of course you can speak your mind, I’m not censoring it in any way, except the colorful language of course. We all have opinions and I try my best to be level headed and see things from both sides of the fence.

1 Like

Using smartctl, BMS appears to be enabled on my SATA HDDs:

# camcontrol devlist
<ST20000NM007D-3DJ103 SN06>        at scbus0 target 0 lun 0 (ada0,pass0)
<TOSHIBA MG10ACA20TEY GG03>        at scbus2 target 0 lun 0 (ada1,pass1)
...
# smartctl -c /dev/ada0
=== START OF READ SMART DATA SECTION ===
General SMART Values:
Offline data collection status:  (0x82)	Offline data collection activity
					was completed without error.
					Auto Offline Data Collection: Enabled.

This is Seagate Exos X20 20 TB; same result from Toshiba MG10 20 TB at ada1, or from WD Red 12 TB (WD12EFAX) in another system.

This smartmontools wiki entry makes me question if “Offline data collection” really is the equivalent to Background Media Scans.

My read of that entry is that they are vastly different in scope and function.

@etorix Thanks for making that statement, it is helpful and it pushed me to look a little harder. I need that periodically, even if I prove myself wrong.

While I have not located the reference, and I will never completely believe AI, but it appears this is a manufacturer specific feature, which use to be driven by the SFF-8035i Spec. According to AI, this spec no longer exists and is obsolete.

Now as I said, I don’t flatout believe AI, with that said, I looked further…
This is from the Ubuntu Manpage (note the second paragraph):

Basically, you cannot rely on the fact that this feature will be there, it is not required.

1 Like

The argument here is usually “SMART Long will spot a bad sector before ZFS writes to it” - but only if that sector is bad on read, not bad on write. ZFS will spot that bad sector as soon as it writes to it when it does the checksum operation. Adding onto this, the concept of “Continuous Background Defect Scanning” was introduced back in SATA spec 2.5. This turns “write” into “write and verify” at the firmware level - so it’s an always-on layer. Similarly, Background Media Scan in SAS drives.

Sharing my experience as I have a pool made of 4 Ironwolf pro 4tb that have been slowly failing in the last couple of months.

Before the update, TrueNas used to alert me when a smart test failed.

After the update, I am no longer alerted as the disk is reporting healthy but the long test is failing at 10% from a read error.

I remember hearing from the podcast that those who already had scheduled CRON tests will carry over after the update. I had assumed that error reporting would too but I am getting the impression it is not.

Having heard and understood all the arguments on both sides, I am glad iX is returning the smart UI feature as I find it a crucial NAS feature.

OP I wish you gave iX the benefit of the doubt and asked for some clarification instead of blatantly accuse them with malicious or incompetence. That said, I agree that the smart feature removal was slightly confusing and needed many clarifications.

The entire community has been asking for clarification for months. I don’t see any reason to give them any more benefit of the doubt.

Please note that my entire post was about how iX has claimed certain things are happening and I can more or less prove that they’re not.