Add GUI Support for NVMe S.M.A.R.T. Testing

joeschmuck · July 17, 2024, 10:31pm

At present time neither TrueNAS SCALE nor CORE GUI’s support configuring SMART Self-test (short or long) for NVMe drives.

To be honest in my opinion this shouldn’t be a feature request, it should be a bug fix, but I don’t see this issue gaining any ground.

NVMe drives are becoming more popular and affordable and while they may not be something used in a data center due to lower capacity and higher cost, us home users do use NVMe drives.

While I’m glad that I can fill the gaps with the little script I maintain, the correct long term fix is for TrueNAS to incorporate it into the GUI. They have smartmontool 7.4 now (not that they really needed it) so I don’t understand why it hasn’t been incorporated already.

Captain_Morgan · July 18, 2024, 7:15am

It would be useful to hear of success stories where SMART testing of NVMe drives paid dividends and helped users.

pmh · July 18, 2024, 7:29am

I watch all my drives with SMART and Scrutiny as well as a Grafana dashboard that shows the expected remaining life time according to vendor specs and SMART values.

I also expect TrueNAS to be able to schedule regular SMART tests for all drives in the system.

Whenever a user shows up with a “help, my pool failed” thread, the first answer is most of the time “run a long selftest on the disk drives”.

joeschmuck · July 18, 2024, 9:45am

You are asking to prove a negative. That is not fair. The question should be “How many times has SMART identified problems before complete failure in an SSD.” After all, SSD=NVMe, just a slightly different interface technology.

Let me ask this question: Would iXsystems include SMART testing if they sold full NVMe solutions? What would the customer want?

To @pmh point, we do ask for SMART statistics when someone is having problems and this data can help diagnose a problem before or after a failure. We often find that a person with a serious failure had never performed routine SMART testing and when TrueNAS sends them an email, it is sometimes utter disaster, while someone that did perform routine testing would find issues before complete failure. This isn’t always the case of course however it is very common. The forums are riddled with this (the old one and current one).

SMART is an extra level of testing and notification. I would have to say that more people would want it than not, and even for NVMe (SSD) drives.

Lastly, SCALE has smartmontools v7.4 and CORE 13.3(Beta) does as well. It can’t be that much effort to add a few lines of code to also schedule SMART testing for “nvme0” (SCALE) or “nvd0” (CORE).

@pmh I will try out Scrutiny, looks interesting. That won’t get me to stop working my little script but it exposes me to other options I could recommend to people. And if iXsystems is not going to include NVMe testing, I could make a significantly smaller script to just run NVMe tests should that be what someone wants vice the whole Multi-Report script.

pmh · July 18, 2024, 10:05am

@joeschmuck you can run Scrutiny as a central instance with this docker compose file:

version: '2.4'

services:
  influxdb:
    image: influxdb:2.2
    ports:
      - '8086:8086'
    volumes:
      - './influxdb:/var/lib/influxdb2'
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8086/health"]
      interval: 5s
      timeout: 10s
      retries: 20

  web:
    image: 'ghcr.io/analogj/scrutiny:master-web'
    ports:
      - '8080:8080'
    volumes:
      - './config:/opt/scrutiny/config'
    environment:
      SCRUTINY_WEB_INFLUXDB_HOST: 'influxdb'
    depends_on:
      influxdb:
        condition: service_healthy
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8080/api/health"]
      interval: 5s
      timeout: 10s
      retries: 20
      start_period: 10s

And then daily cron jobs like these on the managed systems:

CORE:

cd /mnt/hdd/scripts && ./scrutiny-collector-metrics-freebsd-amd64 run --api-endpoint "http://truecommand.ettlingen.hausen.com:8080" --host-id "TrueNAS Core 1" --log-file "scrutiny-collector.log"

SCALE:

cd /mnt/nvme/scripts && ./scrutiny-collector-metrics-linux-amd64 run --api-endpoint "http://truecommand.ettlingen.hausen.com:8080" --host-id "TrueNAS SCALE" --log-file "scrutiny-collector.log"

To feed estimated SSD wear into InfluxDB from CORE I use this script:

#! /bin/sh

PREFIX='servers.'
SMARTCTL='/usr/local/sbin/smartctl -x'

time=$(/bin/date +%s)
hostname=$(/bin/hostname | /usr/bin/tr '.' '_')
drives=$(/bin/ls /dev | /usr/bin/egrep '^(a?da|nvme)[0-9]+$')

for drive in ${drives}
do
	case ${drive} in
		nvme*)
			wear=$(${SMARTCTL} /dev/${drive} | awk '/Percentage Used:/ { printf "%d", $3 }')
		;;

		da*|ada*)
			wear=$(${SMARTCTL} /dev/${drive} | awk '/Percentage Used Endurance Indicator/ { printf "%d", $4 }')
		;;
	esac

	# catch the case that $drive is not an SSD ...
	if [ "x${wear}" != 'x' ]
	then
		echo "${PREFIX}${hostname}.diskwear.${drive}.wear-percent ${wear} ${time}"
	fi
done

Captain_Morgan · July 19, 2024, 1:44am

We do provide all NVMe systems… look at F-Series.

It would be fair to say that HDDs and SSDs, particularly NVMe, fail in very different ways.

HDDs can be SMART tested … I haven’t seen evidence that testing SSDs is as effective. Hence I’m looking for evidence.

SSDs and their flash do wear out if you use the wrong type of flash in a particular application.

Looking at erase cycles is useful… but not necessarily a great predictor of immediate failure. It will tell you the wrong type of flash was used.

Looking at bad blocks in the flash is probably better…

I’m looking for stories of where someone has been able to systemically predict failure of SSDs across multiple vendors and SSD models.

joeschmuck · July 19, 2024, 9:41am

I’m sure I can dig up a few examples of the SMART data being a useful predictor of pending failure, but it may not be immediate failure.

I agree, Wear Level and Bad Blocks are both great pieces of data to determine the end of life, and I would expect the end user would rather find out sooner than later.

There was a reason SMART became part of the NVMe 1.4 standard. I don’t honestly know that reason and if I find out, I’ll include it here. Maybe it was to find out wear levels and temperatures (NVMe drives can get pretty hot).

Anyway, it is only a feature request. I will not lose much sleep over if it is not included in the GUI as I can and do test mine every day for peace of mind and looking for any trends.

Cheers

Arwen · July 19, 2024, 1:49pm

In the past, I did not bother with SMART on my Linux desktop’s 2 NVMe. I don’t recall SMART working with NVMes. However, I did have an early failure, pretty annoying, so I made sure I bought a different brand. (They are in a ZFS Mirror… so no real problem.)

Today, checking, I can see a pre-failure indicator. (The desktop has been up 5 years…) Here is the relevant parts from the 2 drives. The first one is the original, and may be lying about it’s spares. The second is the replacement for the early failure.

root:~# smartctl -x /dev/nvme0 | egrep "Model|Spare|Percent|Power On"
Model Number:                       HP SSD EX950 512GB
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    16%
Power On Hours:                     43,436

root:~# smartctl -x /dev/nvme1 | egrep "Model|Spare|Percent|Power On"
Model Number:                       Lexar SSD
Available Spare:                    83%
Available Spare Threshold:          10%
Percentage Used:                    4%
Power On Hours:                     41,692

etorix · July 19, 2024, 1:59pm

It is fair to question how useful SMART really is, for NVMe or even in general.

But on the other hand, how much ressources would it take to update SMART-related packages and allow the middleware to monitor nme/nvd devices the way it already does for ada/da/sd devices?

Looks like a low hanging fruit to me.

william · July 19, 2024, 2:54pm

If its such a low hanging fruit as everyone is saying I am sure iXsystems will contemplate/appreciate a Pull Request. Why not contribute?

Captain_Morgan · July 19, 2024, 5:35pm

Lets not trivialize the process or work involved. Each feature requires:

development of additional software
Validation of that software (with appropriate range of hardware)
Integration with TrueNAS - middleware and UI

in the case of NVMe, its potentially different data to collect and display

QA of TrueNAS - across range of systems
Development of ongoing QA tests for future releases
Support and bug fixing of the new software

All of this adds up to a reasonable cost.

There are literally thousands of features we can add to TrueNAS, so we have to prioritize allocation of staff to specific projects. We base prioritization on cost-benefit. So we do ask for what the benefit is. Is it saving data, time, money and for how many systems? For our commercial users, we can assess the benefits easily. They can make it a requirement of their next purchase.

We run TrueNAS as an Open Source project. If it’s worth someone’s time to contribute the validated software, that’s a good indication of the benefit seen. If users have horror stories of time wasted or data almost lost… we’d like to hear them.

So these new “Feature Request” pages are really requests for helping prioritize. Provide code or share your stories. We can’t prioritize “advice” only “problems & solutions”

Love the contribution of Scrutiny as a temporary solution. If people see real benefits there, that’s a great sign of needing to integrate better.

patrickkeane · July 19, 2024, 10:14pm

Below is a good, fairly recent article from a very trusted resource that explains the current state of the art for SMART as it relates to both mechanical HDDs and solid state drives.

joeschmuck · July 20, 2024, 1:14am

I am contributing by offering up the suggestion, however if I were a programmer, I would have already added those changes in my own version and I would be more than happy helping if I could.

As for difficulty adding the change, i realize we are no longer talking about “ada” or “da” or “sda” so the naming convention is different, however my little script does it and it is Bash.

TrueNAS already does monitor some data from the NVMe, it appears to only be missing the ability to schedule SMART tests. Okay, there is other stuff missing, the charts. I have no idea what is involved hete either. Maybe this is a 2 hour fix or 6 hours.

I have not compiled FreeNAS since version 9.x and back then i would contribute, however I’m fairly certain the language has changed since then, and i was winging it. I have not looked to see if where the source code is so i could grab a copy and compile it. One done, maybe i could make the needed changes.

And I also agree that a third party script should not be required for a basic test, however i will provide it as long as it is needed.

panzerscope · July 20, 2024, 7:16pm

Speaking from non technical terms and simply as a Scale user who recently had an NVMe drop out with zero indicators that it may be failing, I feel that SMART (or similar if not SMART) should be added to Scale.

Drive failure indicators are hugely important, hence why we have SMART it the first place and with NVMe being a popular drive type, it should get the same love as other drive types. Granted I understand that the way the drive is monitored comes into question as to making any monitoring tool as useful as possible, but I would imagine there is a “best fit” metric that can be decided upon for monitoring an NVMe drive, or perhaps a combination of metrics used as an average, again I’m not familiar with the technical aspect.

Fortunately when my NVMe failed on me, I was using a mirror, like everyone should BUT, I should still have had some form of heads up that the drive was coming to the end of its life. Again, I know there are plenty of factors to consider where sometimes a drive will just up and die with no warning and no amount of monitoring will help, but those cases should be considered the minority.

As a user I would expect NVMe monitoring to be considered a “basic” feature of Scale, it is a mass storage system after all and NVMe adoption is only going to increase over time with cost per Gb falling.

I’m keeping my fingers crossed that this is being given serious consideration. Ultimately, Scale users should not be adopting a 3rd party solution for NVMe monitoring.

Stux · July 20, 2024, 10:11pm

Seems to me NVME wear monitoring would be a good add to the already present temperature monitoring.

But maybe smart testing is not needed.

william · July 22, 2024, 12:41pm

joeschmuck · July 22, 2024, 11:17pm

@william

Because I’m not a programmer and all that Python stuff makes my brain numb, what does that mean? I thought the ticket was closed months ago and not to be implemented. Hopefully I’m reading this incorrectly.

Thanks

Stux · July 22, 2024, 11:39pm

Seems that it’s already in 24.10…

https://ixsystems.atlassian.net/browse/NAS-128116

joeschmuck · July 23, 2024, 1:24am

Thanks, for some reason I read 23.10 vice 24.10. Guess I will need to grab that when it becomes a Beta and take it for a spin.

Stux · July 23, 2024, 1:40am

Yeah. I had to double check the bug as well