Added a new NVME pool (single mirror vdev, 2xNVME); Existing SMART tasks set for "All Disks" and Auto-Generated SCRUB Task Ignore Them

Associated Bug Report: Jira

Hello! I’m posting this both to make y’all aware of this potential bug (as it doesn’t throw any sort of SMART/SCRUB failure message), and also to see if anyone else has run into this and has any information that might be worth adding to the ticket.

Here’s the text of the bug report (with some additions), explaining the issue.

I have a system that has always been set up with a single 4-way mirror pool (8 HDDs). I added Periodic SMART Tests configured for All Disks (short every 24 hours; long once a week). This arrangement has worked perfectly for scheduled SMART tests since I set it up. These tests are recorded in the web GUI’s SMART test history and also visible in smartctl for each disk.

I just added a new pool, with a single mirror VDEV (m.2 NVME). I didn’t change any SMART scheduled task settings, and a new scheduled SCRUB task for the new pool was auto-generated.

Expected behavior: the existing SMART jobs would start automatically testing the NVME, as they are members of “All Disks.”

Observed behavior:

  1. Scheduled SMART tests ignore the NVME (both long and short scheduled tests).
  2. Both NVME are able to run manual SMART long tests without a problem.
  3. No errors reported in zpool status, but also no record of a SCRUB being run.

This seems like it must be a bug. It seems like the scheduled SMART tasks didn’t update their disk lists either when the m.2 disks were added or when I created the NVME pool. A new SCRUB task was added for the NVME pool, but it doesn’t appear to ever run even though it’s scheduled for 12:00am every Sunday.

I suspect everything would work as expected if I deleted the SMART jobs and re-added them, but I don’t think that’s how it’s meant to work.

I’m not comfortable at all using this pool with it refusing to run scheduled data integrity and health checks, so I haven’t actually been able to use the NVME pool for the ~10 days or so I’ve had it set up.

I’ve not attempted destroying and re-creating the SMART and SCRUB tasks, because I suspect this is a bug and don’t want to lose access to potential diagnostic data that iX might ask for. (Also, TrueNAS itself created the SCRUB task, so that one not working correctly really confuses me.)

Any further suggestions for troubleshooting or information gathering would be greatly appreciated. Has anyone seen anything like this?

It’s been known for quite a while that TrueNAS does not support NVMe SMART testing, in spite of the GUI looking like it supports it. This is still true on 25.04 RC.1 I have filed several jira tickets over the years, yes I think it has been years.

A Scrub schedule should be automatically created for the new pool. A Scrub doesn’t care about if the pool is NVMe, HDD, SSD, Flash Drive, it is a zpool function. The scrub schedule by default is once every month, only on Sunday. The boot-pool is typically once a week but it shows up in it’s own location.

You can use the pool. If a failure is noticed then TrueNAS will notify you of it. That does work.

Check out my link for Multi-Report. My script does test NVMe drives (provided they support SMART Self-tests), it tests all drives except some USB attached drives but I;m working to resolve that, just for fun.

FYI, Multi-Report by default does not test drives and this is because many people still use TrueNAS GUI to setup a schedule. It is very easy to change the setup and if you have questions, toss me an email, it is in the Multi-Report thread.

If you have another related question, ask it, I’m easy about sharing information and if you need help setting up Multi-Report, just ask. It is easy to setup. Since you are using SCALE, pay attention to the Quick Start Guide, the command line is unique.

Cheers

A known problem with S.M.A.R.T. on NVMe in general is that even up until recent versions, smartmontools didn’t know it needed to be pointed at the root nvme device instead of one of the namespace devices, or else some of the commands would break. Meaning that even though you may have nvme0n1p# partitions, you can’t resolve the smartmon target to nvme0n1 like you would a sata device by removing the partition marker. Gotta remove the namespace, too, resulting in a query against the nvme01 device.

Assuming they actually fix this, considering your attestation to literal years in bug reporting being unanswered.