Responding to your Feedback on 25.10, SMART, NVIDIA, and more | TrueNAS Tech Talk (T3) E045

no, not a nightly.

TrueNAS 25.10.0.1 is Now Available! - Announcements - TrueNAS Community Forums

1 Like

We are discussing how to collect all the info together. In the meantime the release notes are a good summary.

It’s a change… our experience is that zfs is now better than SMART which was developed for generic non-ZFS servers. We are looking for examples of where real drive failure issues are not being handled well. They are the top priority.

We do have hundreds of drive failure examples where SMART provided false positives that caused unnecessary drive replacements.

1 Like

You really need to clarify your messaging here. What I’m seeing here, and in Kris’ posts here on the subject, is “SMART is unreliable, so we’re deprecating it; ZFS is better anyway.” What Constantin is seeing in the video up-topic is that you’re still going to be running SMART self-tests on attached drives, and factoring those results into whatever black box you’ve set up for drive health monitoring. Which of these is the case? Is TrueNAS itself going to be running SMART self-tests on its own going forward? I’m not asking about monitoring SMART attributes, I’m asking about running the drive self-tests.

ZFS is, of course, very good at detecting when data on disk can’t be read or has become corrupted. But you know better than I that it doesn’t have any way of monitoring the unused blocks on disk. SMART attributes will give you some of that information, but the results of the drive-self tests are also useful here–they tell you things that ZFS simply can’t. And if it’s your position that the results of the SMART self-tests just can’t be trusted, I think that’s going to need quite a bit of support and defense.

3 Likes

I can give a different kind of example: SMART noted that for one of my drives the helium level was dropping. Per ZFS, the drive was fine. Per SMART, it was on a downward spiral.

The correct action was to replace the drive before excessive air resistance / Bernoulli issues caused problems inside the drive. With the help of SMART, I was able to identify the issue and schedule a replacement on my schedule.

I have had other SMART issues get flagged that with the help of Demi-gods like @joeschmuck and many others were identified as interesting but not fatal. So I could ignore those. Part of being a sysadmin is applying that kind of measured response.

I agree with @dan that the messaging here should be improved dramatically. For example, it’s somewhat incongruous that GUI scheduling of scans is too broken to fix and yet can be done reliably if it’s done automatically.

Similarly, i suggest a better explanation how relying on a abandonware project that runs in an app (ie scrutiny) is a a great idea if TrueNAS 25.10+ is also doing stuff internally. Ie. I suggest highlighting what TrueNAS will do differently and what scrutiny does better - what is the use case for either?

2 Likes

Plus, you could record and display the wear indicator for SSDs. Works with both SATA and NVMe:

This is the percentage of the vendor specified TBW value that has already been used. It’s easily available with smartctl - ignore the formatting for InfluxDB in the final echo statement:

#! /bin/sh

PREFIX='servers.'
SMARTCTL='/usr/local/sbin/smartctl -x'

time=$(/bin/date +%s)
hostname=$(/bin/hostname | /usr/bin/tr '.' '_')
drives=$(/bin/ls /dev | /usr/bin/egrep '^(a?da|nvme)[0-9]+$')

for drive in ${drives}
do
	case ${drive} in
		nvme*)
			wear=$(${SMARTCTL} /dev/${drive} | awk '/Percentage Used:/ { printf "%d", $3 }')
		;;

		da*|ada*)
			wear=$(${SMARTCTL} /dev/${drive} | awk '/Percentage Used Endurance Indicator/ { printf "%d", $4 }')
		;;
	esac

	# catch the case that $drive is not an SSD ...
	if [ "x${wear}" != 'x' ]
	then
		echo "${PREFIX}${hostname}.diskwear.${drive}.wear-percent ${wear} ${time}"
	fi
done
4 Likes

I think a lot of people are missing that two major things were removed from TrueNAS.

  1. The removal of the SMART test scheduling feature
  2. The almost complete removal of smart attribute monitoring (removal of smartd monitoring)

While 1) is documented in the release notes and talked about in videos, the bigger change 2) seems to be flying under the radar. It’s not mentioned in the release notes.

The smart monitoring done in 25.10 is very minimal. There is basically one attribute that is monitored (uncorrected errors). There is some more monitoring done on specific models of SSD. I don’t see how users are happy with this.

In comparison, smartd checks all attributes against the threshold defined by the manufacturer. Of course this is more aggressive and more prone to false positive and is very dependant on the manufacturer. But it is a more thorough and tested approach than what is currently in 25.10.

3 Likes

Is this a removal, or is it something TrueNAS never had? smartd is still there, of course. It may or may not be running SMART self-tests on your drives. TrueNAS used to monitor drive temps via SMART; now, according to the release notes, a different method is used. Leaving aside whether that other method is better, worse, or a wash, it’s there. Did TrueNAS ever monitor any other attributes?

Pre 25.10 TrueNAS launches the smartd daemon which monitors all attributes. It is pretty much the default and goto solution on Linux and FreeBSD to monitor SMART attributes.

If you want some more information about how pre 25.10 works:

TrueNAS middleware generates /etc/smartd.conf

$ cat /etc/smartd.conf
/dev/sda -a -d removable -n never -W 0,0,0 -m root -M exec /usr/local/libexec/smart_alert.py\
-s S/../../(7)/(00)\

smartd runs in the background and checks SMART attributes every 30 minutes.

$ ps auxf | grep smartd
root        2642  0.0  0.0  12260  4520 ?        Ss   Sep27   0:14 /usr/sbin/smartd -n --interval=1800 -q never

If any alerts occur, /usr/local/libexec/smart_alert.py is called with creates a TrueNAS alert.

Further Information:

1 Like

Thank you! For those of us who are not as conversant re: SMART, will the transitioned SMART cron jobs trigger a smartd check as part of each run or does smartd have to be invoked separately?

@joeschmuck, isn’t a smartd review part of your amazing multi-report script?

That is not what I heard. They are “monitoring” smartd. That is not the same as actively running a SMART Short or Extended test. I don’t see smartd ever going away (fingers crossed).

And I know I’m late to the party, been busy.

While I understand what Kris and Chris are saying, and I understand why they are saying it, that doesn’t mean I completely agree with it. Case in point: A person has a “drive failure” as recognized by ZFS sumcheck errors. So the poor soul tells himself/herself that the drive has failed, we must replace it. In reality not all of these ZFS sumcheck errors are the result of a failing drive. Using smartctl and zpool can help diagnose the problem and perform the proper action to correct the situation. With that said, we do rely on using SMART.

Do we NEED to test our drives all the time, maybe not but being proactive is what my experience has taught me to do. To use the working on the engine of a car metaphor, If I didn’t routinely change the oil, rotate the tires, replace the air filter, clean the battery terminals, make sure the brakes are working…. then I’m looking for a very costly and inconvenient breakdown. I don’t really enjoy vehicle maintenance but it needs to be done, unless you have the money to buy a new vehicle every 6 months.

During the TTT there was reference to the drive indicating it failed. I’m not sure if that is talking about the SMART PASSED/FAILED result but if it says FAILED, it has failed. If it says PASSED, the drive still may be a failed or failing drive and this is where a person analyses the SMART data. The PASSED/FAILED indication is not a summary of all the SMART data as many people think it is.

This is going to be a sore topic for quite a while as based on this TTT, I get the feeling a GUI to manage SMART testing will not be coming back. There are things people can do to schedule SMART testing via a third party script or via CRON JOB (CRONTAB). There is help out here if you want it and are willing to put forth a little effort.

No sir. smartd doesn’t work like that. It runs in the background monitoring all the drives and when it sees a possible problem, it passes that data to the OS, or whatever invoked it. That is what I understand happens. I could be slightly off on how it communicates the error back to the main program. I only poll the drives when the script is run. The Multi-Report script collects all the data and generates the report, while Drive-Selftest (part of Multi-Report) will run all the testing. This means that TrueNAS will still monitor your drives for critical failures or alarms, you would get an immediate email from TrueNAS, and you could then address the issue. You could use Multi-Report to collect the SMART data if desired at that point.

If I were to make Multi-Report a docker, maybe I could run smartd or possibly hook into the already running smartd, but honestly, that is outside my comfort zone at the moment. I am not a programmer, but I played one in TV. :clown_face:

1 Like

I have some bad news for you…

Lay it on me. What is the bad news?

1 Like

Noted just up-thread. Apparently smartd isn’t running any more in 25.10.

No kidding. You know I’m going to now install 25.10 again and look at it. Maybe I can do it in a VM this time and save myself the work. I’m no longer running 25.10 as I had a strange problem with it and some error messages. I was waiting on 25.10.1 before I tried it again.

Thanks for letting me know, I still find it hard to believe so I must prove it to myself.

1 Like

What does this command return? (trying to save myself some grief)

systemctl status smartd

that’s the content of my smard.conf file on 25.10.0.1

  GNU nano 7.2                    /etc/smartd.conf                              
# Sample configuration file for smartd.  See man smartd.conf.

# Home page is: https://www.smartmontools.org

# smartd will re-read the configuration file if it receives a HUP
# signal

# The file gives a list of devices to monitor using smartd, with one
# device per line. Text after a hash (#) is ignored, and you may use
# spaces and tabs for white space. You may use '\' to continue lines.

# The word DEVICESCAN will cause any remaining lines in this
# configuration file to be ignored: it tells smartd to scan for all
# ATA and SCSI devices.  DEVICESCAN may be followed by any of the
# Directives listed below, which will be applied to all devices that
# are found.  Most users should comment out DEVICESCAN and explicitly
# list the devices that they wish to monitor.

DEVICESCAN -d removable -n standby -m root -M exec /usr/share/smartmontools/sma>

# Alternative setting to ignore temperature and power-on hours reports

which for me reads that smard is still running and monitors all devices it finds :confused:

Edit: fixed wrong version number

1 Like
truenas_admin@nas2[~]$ sudo systemctl status smartd
Unit smartd.service could not be found.
truenas_admin@nas2[~]$ sudo ps auxf | grep smartd
truenas+ 2719694  0.0  0.0   3748  1448 pts/4    S+   07:31   0:00          \_ grep smartd

This is on a fresh install of 25.10.0, not upgraded from a prior version.

1 Like

I assume you mean 25.10.0.1, since 25.10.1 doesn’t exist yet. But that’s just the default config, and it won’t do anything if smartd itself isn’t running.

1 Like

yes, my mistake.

1 Like

That makes me sad to see that result. It does look like smartd is not running.

@HoneyBadger can you please explain how we can view if smartd is running? It should have shown up with systemctl status smartd but apparently does not. Are we looking in the wrong place or is this a bug in 25.10.0 ?

1 Like