False 'failed a SMART selftest' alerts after upgrading from ElectricEel to Goldeye

Last month I made the jump from Electric Eel to Goldeye following the upgrade guide in the documentation. It went buttery smooth. About a week later, I received a ‘failed a SMART selftest’ alert for one of my mirrored boot drives. Before I had even noticed the alert, it had cleared itself. As soon as I got home, I pulled the smartctl log and it had no record of any failures… odd, I thought. And went about my day. ~36 hours later, the same alert to the other mirrored boot drive… Both times, the alert cleared itself exactly 90 minutes later. This behavior has continued per the table below. What’s going on here?

15-Mar ~00:00 On Electric Eel, latest 0 SMART errors
15-Mar ~00:30 Upgraded to Fangtooth 0 SMART errors
15-Mar 00:57 Upgraded to Goldeye Suddenly many…that have NO RECORD per smartctl
DateTime Alert Drive count duration [min] Since last [min] …per drive
22-Mar 08:06 failed a SMART selftest nvme1n1 1
22-Mar 09:36 cleared nvme1n1 90
23-Mar 18:37 failed a SMART selftest nvme0n1 1 2071
23-Mar 20:07 cleared nvme0n1 90
26-Mar 03:39 failed a SMART selftest nvme0n1 2 3422 3422
26-Mar 05:09 cleared nvme0n1 90
31-Mar 11:13 failed a SMART selftest nvme1n1 2 7654 13147
31-Mar 12:43 cleared nvme1n1 90
05-Apr 21:50 failed a SMART selftest nvme0n1 3 7837 15491
05-Apr 23:20 cleared nvme0n1 90
06-Apr 21:51 failed a SMART selftest nvme1n1 3 1441 9278
06-Apr 23:21 cleared nvme1n1 90

Run these 4 commands:

  1. Run this command smartctl -x /dev/nvme0 > nvme0_smartctl.txt
  2. Run this command nvme self-test-log --output-format=json /dev/nvme0 > nvme0_stl.txt
  3. Run this command echo "==================================" >> nvme0_stl.txt
  4. Run this command nvme error-log /dev/nvme0 >> nvme_stl.txt

This will generate two files: nvme0_smartctl.txt which is what smartctl reports. And nvme0_stl.txt which is what the nvme command reports, both the self-test log and the error log.

Post both files here or feel free to send them to my email joeschmuck2023@hotmail.com.

These should shed some light on the issue, or possibly non-issue.

nvme0_smartctl.txt (2.9 KB)

nvme0_stl.txt (6.5 KB)

Modified commands to get what I think you wanted since the above threw an error:

nvme self-test-log /dev/nvme0 -o json > nvme0_stl.txt

nvme error-log /dev/nvme0 >> nvme0_stl.txt

@joeschmuck I went through your Drive Troubleshooting guide again and still did not find any cause for concern. If you agree, I’ll raise this as a bug report for TrueNAS SCALE 25.10.

Sorry, I’ve been very busy now that I’m retired. I see nothing in the data provided.

Questions: How is the NVMe drive connected to your system?

As crazy as it sounds, I do see there are two SMART tests completed 16 days ago and then 6 days ago.

Did you manually run a SMART Extended/Long test and Short test?

If you say that you didn’t, then it is likely the new “SMART” testing in Goldeye. If this is the case, I’d report it as a bug.