False 'failed a SMART selftest' alerts after upgrading from ElectricEel to Goldeye

Last month I made the jump from Electric Eel to Goldeye following the upgrade guide in the documentation. It went buttery smooth. About a week later, I received a ‘failed a SMART selftest’ alert for one of my mirrored boot drives. Before I had even noticed the alert, it had cleared itself. As soon as I got home, I pulled the smartctl log and it had no record of any failures… odd, I thought. And went about my day. ~36 hours later, the same alert to the other mirrored boot drive… Both times, the alert cleared itself exactly 90 minutes later. This behavior has continued per the table below. What’s going on here?

15-Mar ~00:00 On Electric Eel, latest 0 SMART errors
15-Mar ~00:30 Upgraded to Fangtooth 0 SMART errors
15-Mar 00:57 Upgraded to Goldeye Suddenly many…that have NO RECORD per smartctl
DateTime Alert Drive count duration [min] Since last [min] …per drive
22-Mar 08:06 failed a SMART selftest nvme1n1 1
22-Mar 09:36 cleared nvme1n1 90
23-Mar 18:37 failed a SMART selftest nvme0n1 1 2071
23-Mar 20:07 cleared nvme0n1 90
26-Mar 03:39 failed a SMART selftest nvme0n1 2 3422 3422
26-Mar 05:09 cleared nvme0n1 90
31-Mar 11:13 failed a SMART selftest nvme1n1 2 7654 13147
31-Mar 12:43 cleared nvme1n1 90
05-Apr 21:50 failed a SMART selftest nvme0n1 3 7837 15491
05-Apr 23:20 cleared nvme0n1 90
06-Apr 21:51 failed a SMART selftest nvme1n1 3 1441 9278
06-Apr 23:21 cleared nvme1n1 90

Run these 4 commands:

  1. Run this command smartctl -x /dev/nvme0 > nvme0_smartctl.txt
  2. Run this command nvme self-test-log --output-format=json /dev/nvme0 > nvme0_stl.txt
  3. Run this command echo "==================================" >> nvme0_stl.txt
  4. Run this command nvme error-log /dev/nvme0 >> nvme_stl.txt

This will generate two files: nvme0_smartctl.txt which is what smartctl reports. And nvme0_stl.txt which is what the nvme command reports, both the self-test log and the error log.

Post both files here or feel free to send them to my email joeschmuck2023@hotmail.com.

These should shed some light on the issue, or possibly non-issue.

nvme0_smartctl.txt (2.9 KB)

nvme0_stl.txt (6.5 KB)

Modified commands to get what I think you wanted since the above threw an error:

nvme self-test-log /dev/nvme0 -o json > nvme0_stl.txt

nvme error-log /dev/nvme0 >> nvme0_stl.txt

@joeschmuck I went through your Drive Troubleshooting guide again and still did not find any cause for concern. If you agree, I’ll raise this as a bug report for TrueNAS SCALE 25.10.

Sorry, I’ve been very busy now that I’m retired. I see nothing in the data provided.

Questions: How is the NVMe drive connected to your system?

As crazy as it sounds, I do see there are two SMART tests completed 16 days ago and then 6 days ago.

Did you manually run a SMART Extended/Long test and Short test?

If you say that you didn’t, then it is likely the new “SMART” testing in Goldeye. If this is the case, I’d report it as a bug.

1 Like

:slight_smile: great time of year to be doubly retired in the state I grew up in

They are mounted directly to the motherboard; CWWK N5105 NAS.

I don’t think I’ve run a manual SMART test on that machine since I first built it years ago. I’ll report it as a bug. Hopefully the devs can glean something from it.

Thank you for your help!

If the is the case, then the two SMART self-tests that were run on that drive were from Goldeye. I don’t see anything wrong with the drive and it could be that a new untested drive is giving the software an alarm. Maybe Goldeye was looking fo rthe last smart test and said “Holy Cow! I need to run a smart test, signal the alarm and run the test.” Once for the Extended test first, then the Short test second. I’m just guessing, I have not looked at the code to see exactly what it is doing.

If they tell you it is a problem, please post the problem report number here.

It was confirmed to be a bug.

1 Like