Added a new NVME pool (single mirror vdev, 2xNVME); Existing SMART tasks set for "All Disks" and Auto-Generated SCRUB Task Ignore Them

SinisterPisces · March 13, 2025, 10:41pm

Associated Bug Report: Jira

Hello! I’m posting this both to make y’all aware of this potential bug (as it doesn’t throw any sort of SMART/SCRUB failure message), and also to see if anyone else has run into this and has any information that might be worth adding to the ticket.

Here’s the text of the bug report (with some additions), explaining the issue.

I have a system that has always been set up with a single 4-way mirror pool (8 HDDs). I added Periodic SMART Tests configured for All Disks (short every 24 hours; long once a week). This arrangement has worked perfectly for scheduled SMART tests since I set it up. These tests are recorded in the web GUI’s SMART test history and also visible in smartctl for each disk.

I just added a new pool, with a single mirror VDEV (m.2 NVME). I didn’t change any SMART scheduled task settings, and a new scheduled SCRUB task for the new pool was auto-generated.

Expected behavior: the existing SMART jobs would start automatically testing the NVME, as they are members of “All Disks.”

Observed behavior:

Scheduled SMART tests ignore the NVME (both long and short scheduled tests).

Both NVME are able to run manual SMART long tests without a problem.

No errors reported in zpool status, but also no record of a SCRUB being run.

This seems like it must be a bug. It seems like the scheduled SMART tasks didn’t update their disk lists either when the m.2 disks were added or when I created the NVME pool. A new SCRUB task was added for the NVME pool, but it doesn’t appear to ever run even though it’s scheduled for 12:00am every Sunday.

I suspect everything would work as expected if I deleted the SMART jobs and re-added them, but I don’t think that’s how it’s meant to work.

I’m not comfortable at all using this pool with it refusing to run scheduled data integrity and health checks, so I haven’t actually been able to use the NVME pool for the ~10 days or so I’ve had it set up.

I’ve not attempted destroying and re-creating the SMART and SCRUB tasks, because I suspect this is a bug and don’t want to lose access to potential diagnostic data that iX might ask for. (Also, TrueNAS itself created the SCRUB task, so that one not working correctly really confuses me.)

Any further suggestions for troubleshooting or information gathering would be greatly appreciated. Has anyone seen anything like this?

joeschmuck · March 14, 2025, 1:48am

It’s been known for quite a while that TrueNAS does not support NVMe SMART testing, in spite of the GUI looking like it supports it. This is still true on 25.04 RC.1 I have filed several jira tickets over the years, yes I think it has been years.

A Scrub schedule should be automatically created for the new pool. A Scrub doesn’t care about if the pool is NVMe, HDD, SSD, Flash Drive, it is a zpool function. The scrub schedule by default is once every month, only on Sunday. The boot-pool is typically once a week but it shows up in it’s own location.

You can use the pool. If a failure is noticed then TrueNAS will notify you of it. That does work.

Check out my link for Multi-Report. My script does test NVMe drives (provided they support SMART Self-tests), it tests all drives except some USB attached drives but I;m working to resolve that, just for fun.

FYI, Multi-Report by default does not test drives and this is because many people still use TrueNAS GUI to setup a schedule. It is very easy to change the setup and if you have questions, toss me an email, it is in the Multi-Report thread.

If you have another related question, ask it, I’m easy about sharing information and if you need help setting up Multi-Report, just ask. It is easy to setup. Since you are using SCALE, pay attention to the Quick Start Guide, the command line is unique.

Cheers

kode54 · March 14, 2025, 7:09am

A known problem with S.M.A.R.T. on NVMe in general is that even up until recent versions, smartmontools didn’t know it needed to be pointed at the root nvme device instead of one of the namespace devices, or else some of the commands would break. Meaning that even though you may have nvme0n1p# partitions, you can’t resolve the smartmon target to nvme0n1 like you would a sata device by removing the partition marker. Gotta remove the namespace, too, resulting in a query against the nvme01 device.

Assuming they actually fix this, considering your attestation to literal years in bug reporting being unanswered.

SinisterPisces · March 15, 2025, 6:59pm

Thanks for the replies. It’s … aggravating … that this is a well-known issue, but it’s also not surprising. My bug report’s been moved to Backlog after initially being tagged for 25.10, so … alas.

SMART was never meant for NVMEs, from my understanding, and it’s always been a bit of a kludge. Some NVME still don’t do it correctly/at all, so I suppose I can understand TrueNAS not even trying by default. Though, I’d still rather it make the attempt and fail before asking me if I wanted to exclude the drive from the test schedule because it’s not cooperating. And it doesn’t surprise me to read @kode54 's note about smartmoxntools not really being on the ball, either.

@joeschmuck Thanks for the link to your script. I’ll definitely check it out and get in touch if I have questions. Does it still work on Fangtooth? It seems like CLI access is a bit more restricted there.

(Before I saw your message, I was going to put a recurring task on my calendar to run a manual LONG test once a week.)

You can use the pool. If a failure is noticed then TrueNAS will notify you of it. That does work.

Which failures does it still report? ZFS errors?

I’m still learning to read crontab entries. This is the auto-generated Scrub job’s schedule. It looks like it’s every week?

Or, does it run a weekly task to see if it’s been a month, and then if it’s been a month, it runs the scrub?

I’ll have to go back and see exactly when I created this pool, but it probably hasn’t been a month yet.

vectorsigma ~% sudo zpool status
[sudo] password for johntdavis: 
Sorry, try again.
[sudo] password for johntdavis: 
  pool: QuickDrawer
 state: ONLINE
config:

        NAME                                      STATE     READ WRITE CKSUM
        QuickDrawer                               ONLINE       0     0     0
          mirror-0                                ONLINE       0     0     0
            8067abbd-eaed-47bd-a5fd-36e74e1902e5  ONLINE       0     0     0
            6149c16e-c592-4ac9-8681-f82e89b77d9a  ONLINE       0     0     0

errors: No known data errors

  pool: Tank
 state: ONLINE
  scan: scrub repaired 0B in 01:12:09 with 0 errors on Sun Mar  9 01:12:10 2025
config:

        NAME                                      STATE     READ WRITE CKSUM
        Tank                                      ONLINE       0     0     0
          mirror-0                                ONLINE       0     0     0
            51c64836-6173-499a-adce-c3878825a1a4  ONLINE       0     0     0
            65a2bb92-efc7-4fda-9a6a-cb692e51cd42  ONLINE       0     0     0
          mirror-1                                ONLINE       0     0     0
            0394f980-f868-4362-90c9-74dfb14e05e9  ONLINE       0     0     0
            309cd2cb-acfd-4fe3-bb09-1a8931d6fab6  ONLINE       0     0     0
          mirror-2                                ONLINE       0     0     0
            f7e81ec3-66fc-4e70-9d3f-16caed970912  ONLINE       0     0     0
            4ebe785b-bd92-4759-9a6b-5823068e10fe  ONLINE       0     0     0
          mirror-3                                ONLINE       0     0     0
            651abeda-fc1d-4ccd-9732-2007cd0264df  ONLINE       0     0     0
            ac80babe-f35f-49b2-a9c3-9380bb325ebc  ONLINE       0     0     0

errors: No known data errors

  pool: boot-pool
 state: ONLINE
  scan: scrub repaired 0B in 00:00:12 with 0 errors on Fri Mar 14 03:45:14 2025
config:

        NAME         STATE     READ WRITE CKSUM
        boot-pool    ONLINE       0     0     0
          nvme2n1p3  ONLINE       0     0     0

joeschmuck · March 15, 2025, 9:25pm

Good question and the answer is, Yes and No. It will run, it may not send you an email. iXsystems did something to break the email function again however I already have that fixed in Multi-Report v3.17 but I have not released it yet. I have all the problems from v3.16 fixed (that I am aware of) but I am restructuring all of the -config questions and answers to make configuration significantly easier. Aside from that, the script is out being tested by a few folks who volunteered to help troubleshoot any problems they may run into, and I do have one person with an issue, I suspect it is a Non-Standard SMART data reporting messing up the script. It rarely happens now but it happens.

It will be released well before 25.04 Stable is released. There is another change with email where I understand you can use MS Office OAuth now so that will be addressed in the sendemail.py script once the author gets to it, which could be this weekend.

FYI, you can run a SMART Short Self-test from the CLI smartctl -t short /dev/nvme0 to run a short test. Use long for a long test.

SinisterPisces · March 28, 2025, 12:54am

Thanks for the additional info on this. I’m going to try to set it up this weekend. There’s 500 replies on that thread and I’d like to read through all of them. Life has been happening at the usual pace and I haven’t had a chance to sit down and dig into it yet. Soon.

FYI, you can run a SMART Short Self-test from the CLI smartctl -t short /dev/nvme0 to run a short test. Use long for a long test.

Thanks for mentioning this. I’d already successfully run a long test on both NVME before I posted. Strangely enough, the tests I’ve manually run show up in TrueNAS’s GUI.

So, TrueNAS can successfully execute read commands via smartctl to get the SMART test record, but can’t use the same tool to actually run the tests?

That … didn’t make a huge amount of sense to me at first–smartctl needs root permissions to fetch the SMART test records, too, so it’s not like it doesn’t have enough permissions to run the tests.

Then I realized that it’s most likely a feature that hasn’t been implemented in TrueNAS’ middleware, not so much an issue with smartctl itself. (I’m guessing.)

joeschmuck · March 28, 2025, 1:59am

There is more involved than just smartctl. You also have smartd which is what TrueNAS uses to obtain routine status. It is part of their API.

Yes, you need to have a privledged account to run quite a few commands. I think it sucks but I fully understand why.

SinisterPisces · March 28, 2025, 2:01am

That makes sense. It’s just frustrating because it feels so close to working…but doesn’t.

joeschmuck · March 28, 2025, 2:40am

It has unfortunately been like that for quite a while which is why my script tests NVMe drives. I have recently expanded it to test USB connected drives and this is an ongoing process so if someone has a USB connected drive that does not work, I will put in the work to see if the drive can be tested at all. Some interfaces fall short, but a lot of the time is is a configuration issue. This does not apply to USB Flash drives, I will not go down that road.

Don’t torture yourself reading 500+ posts. Read the first posting and the last few pages and that should be more than sufficient. If you have a question, just ask. I or someone else familiar with this stuff will generally answer your question fairly quick. I try to be very clear and have no issues helping out. Okay, I don’t care to help out people who are lazy and don’t even try to do something on their own, but if it is apparent they are trying, I will bend over backwards to help. I enjoy sharing knowledge and I also learn in the process. Whoever thinks they know everything is a fool. I know a lot, but nowhere close to enough.