Multi-Report

joeschmuck · August 2, 2024, 5:04pm

Glad that worked, now I don’t have to test it again.

BarefootWoodworker · August 26, 2024, 10:07pm

Running into an odd (and somewhat random, I guess) issue where the last test time on my SAS drives just goes bonkers.

I’ve also noticed this error getting sent along with the report at midnight:

nvmecontrol: get log page request returned error
nvmecontrol: get log page request returned error

Side note: is there any way (other than buying new drives) to get the “last test time” past 65,535 hours? A number of my 2TB drives might just be a tad over the magic 16-bit number for the last run time.

dak180 · August 26, 2024, 11:06pm

I would be interested to see if you have similar issues with TrueNAS Report as well.

joeschmuck · August 27, 2024, 1:11am

@BarefootWoodworker
I suspect that drive da12 is not having smart tests run on it, it has been 217 days ago (+2). Check and make sure you have smart testing enabled and the drive actually selected. You can first try to run the command smartctl -t short /dev/da12 and wait 3 minutes, then run Multi-Report script again. See if that alarm goes away.

The other drive looks to have a single Raw Read Rate Error, and that is likely valid. However it is a “Rate” which is an average over the amount of reads the drive performs, so this could go back to zero, or it could go up. There is a variable you can change to allow this drive to have a value of 1 and it will not generate the alarm, but it will if the value increments. Send me the dump requested below and I’m certain I can make it better. I always provide quick service, with a smile and typically will provide you a solution in less that 24 hours. I still have a day job for 35 day, then retirement.

The next time, post the column titles so others can understand what they are reading.

As for the nvmecontrol error, odds are your NVMe drives do not support some portion of requesting data. Earlier models didn’t need to.

I would appreciate it if you would run the script and add -dump email to the end. This will generate an email to me and provide me all the data I need in order to give you solid positive answers and also figure out what I may need to change for the NVMe drive data. It may be a drive that does not respond and I have to block it.

Dumping me the data provides no personal information except your email address and I am a vault, a 62 year old vault but still, I don’t share.

Hope to be hearing from you.
-Mark (AKA Joe)

Tha_reaper · August 27, 2024, 8:22am

Ive set up multi-report a couple of days ago, and it ran great. But yesterday i have added an M2. SATA disk in an external USB enclosure, and now when i run multi-report i get the following error:

NVMe status: Invalid Field in Command: A reserved coded value or an unsupported value in a defined field(0x2)
NVMe status: Invalid Field in Command: A reserved coded value or an unsupported value in a defined field(0x2)

I have reconfigured the script, but the error remains. Any thoughts?

Davvo · August 27, 2024, 9:13am

USB often doesn’t work well with SMART, but joe will surely have a more detailed answer. I would start by running the script with the -dump email parameter to help troubleshoot.

Tha_reaper · August 27, 2024, 9:20am

I think the USB is the problem indeed, but i would not care if it just gave me a warning and skipped that drive. But the whole script stopped working, and thats an issue.
I have done the dump, so i hope joe is able to help out. thanks.

EDIT: ok, weirdly when i used the -dump email parameter, the report does get created and sent to my mail. I dont think its an option to add that to my cron job though unless Joe likes spam, lol

joeschmuck · August 27, 2024, 9:31am

Tha_reaper:

Ive set up multi-report a couple of days ago, and it ran great. But yesterday i have added an M2. SATA disk in an external USB enclosure, and now when i run multi-report i get the following error:
NVMe status: Invalid Field in Command: A reserved coded value or an unsupported value in a defined field(0x2)
NVMe status: Invalid Field in Command: A reserved coded value or an unsupported value in a defined field(0x2)
I have reconfigured the script, but the error remains. Any thoughts?

This is not you and I doubt the enclosure. I also had those messages so Multi-Report does have a built in fix. This has to do with TrueNAS and the NVMe. In a nut shell, TrueNAS sends the NVMe a command in which the NVMe does not recognize and the NVMe sends back the response you see.

In the multi_report_config.txt file you will find the option called:

NVMe_Ignore_Invalid_Errors="disable"	# Set to "enable" to ignore "Invalid Field in Command" messages.  Google this message to see if you are comfortable ignoring it.

Change the value “disabled” to “enable”.

Let me know how it goes. And as @Davvo said, if you would run the script using the -dump email and put in the message something to indicate this is a dump for nvme data info, or something, then I can add the nvme data (just the SMART data, not personal data) to my group of drive test data. I run the script against a large variety of drive SMART data and I have very few NVMe drives in the group. These will become more prevalent so more strange things will pop up.

Cheers

EDIT: Just found your data in the SPAM folder. I will examine it in about 10 hours when I return home from work. I think the answer I gave above will still fix the issue.

Tha_reaper · August 27, 2024, 9:43am

Thanks. i will try this out. By the way, as i said, its an M2 sata disk, so an NGFF disk, maybe thats why it doesnt respond as an NVMe is expected to?

EDIT: With the value changed in the config file the script still refuses to run from the cron job.

joeschmuck · August 27, 2024, 9:06pm

Please be clear as this is not the original problem. Does the script run from the CRON or not? If not, did you follow the special instructions for SCALE 24.04 to run from the CRON? The Quick Start Guide has it also in the instructions. If you are just trying to run from the command line, use the Shell window in the SCALE GUI, not the CRON Job. No thank you, no need to SPAM me.

If you are saying the “Invalid Field in Command:” is still happening, let me know. I will need to contact you via email to obtain some additional data. Unfortunately the -dump email option does not include that fail code, which is odd but when it comes to nvme drives, we are still learning. Looks like possibly another modification to capture debug information. Until recently smartmontools didn’t even support nvme drives.

Which drive is reporting the invalid field? nvme0 (serial ending in 403) does not support self-tests so I did not receive a self-test log for that drive, but that makes sense. nvme0 drive also looks good in spite of having no self-test accomplished. nvme1 (serial ending in 022) looks good and the self-test passed and is current.

I just want you to be aware, until NVMe standard 2.0 came out, SMART Self-tests were not required on any nvme drive. I just realized standard 2.1 is out now. I’m sure it will be some time before we see those materialize in nvme drives, but some light reading for me tomorrow.

You know, there is one more thing I should clear up… The “Invalid Field in Command:” has absolutely nothing to do with the Multi-Report script. The nvme drive recorded an error in which TrueNAS sent it an invalid command. Multi-Report just reads the error log and reports what it read. I have enabled an option to ignore these invalid errors in Multi-Report, which is what that configuration change does. Don’t believe me? (I’m sure you do but it is nice to have people verify I’m not misinforming them) then Google the error message and read some of the information out there. It is not just TrueNAS.

I hope you are still awake after reading this long posting.

-Joe

Tha_reaper · August 28, 2024, 8:24am

sorry, to be clear: this morning i received the report from the cron job, just as when i did it over an SSH session, so the reports are working again and changing that configuration setting seems to have worked. I also still get the error messages by mail that accompany the report:

NVMe status: Invalid Field in Command: A reserved coded value or an unsupported value in a defined field(0x2)

thats the only thing that i see, so it doesnt point to a specific drive, but the only drive missing from the reports is my /dev/sdg disk, which is the latest drive that i added. the sata M2 NFGG drive.
I assume that’s the one causing the errors.

I’m willing to provide more information if that can help you finetuning the script further, but im not super technical, so i’ll probably need some clear instructions on how i can help, lol.

EDIT: i can do a SMART test on that /dev/sdg drive using the truenas GUI though, and it seems fine. I also see the short tests that the script prompts the drive to do, and they also passed fine.

joeschmuck · August 28, 2024, 9:29am

I’m not sure what your USB adapter is capable of however if TrueNAS sees it as an NVMe, that option is not listed in the GUI, which is one reason for using my little script, until it is supported. However you can try to run a SMART test manually for the heck of it smartctl -t short /dev/sdg but I am curious why sdg does not show up in the report and I have a theory. sdg is on the report, listed in the Text portion of the report. The serial number is "\x03D (part of your apps pool) which is definitely not normal and I suspect this is the USB adapter doing things to make it not compatible, or you are using TrueNAS on a VM and passing through is a problem. I will contact you via email soon (after I return from my day job) to ask some specific questions and ask for some test outputs from the command line. One output you should look at is smartctl -x /dev/sdg so if you could sent that to me it would speed this up. If this is the USB drive adapter, you may not have a fix for it with Multi-Report. But let’s see where it goes.

Tha_reaper · August 28, 2024, 9:34am

well, the output of that command is very short:

smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.6.32-production+truenas] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org

/dev/sdg: Unknown USB bridge [0x152d:0x0581 (0x4204)]
Please specify device type with the -d option.

Use smartctl -h to get a usage summary

I rus TrueNAS on bare metal. no VM.

Davvo · August 28, 2024, 10:43am

That’s USB of Doom.

joeschmuck · August 28, 2024, 10:36pm

@Tha_reaper
That USB adapter is the issue as @Davvo has indicated. The NVMe drive should be presented as ‘nvme2’ for example but it is not. I will send you an email, maybe I can make it work however it will require more work on your part than on my part. I can’t simulate what you have so you have to do all the leg work, if you choose to. And I would like to get the script to work for the USB drive but I leave that up to you, if you want to spend days going back and forth to figure out what needs to happen and then to test a few Beta version. I have a Beta in testing now for the next release and I was hoping it would come out many weeks ago but life happens so playing with the script always gets last priority over family.

To be honest with you, I would not use that USB drive except maybe as a boot drive. Use Multi-Report in it’s default configuration to get an email every Monday which contains your TrueNAS config data to help you restore the boot drive if you ever need to. But if TrueNAS is working for you and Multi-Report seems to be working with that one exception, that is perfectly fine.

Chat soon,
-Mark (aka Joe the Schmuck)

BarefootWoodworker · August 28, 2024, 11:43pm

@joeschmuck

It runs on a daily basis. It just pops up randomly like for whatever reason, either Multi-Report is puking, the drive won’t return the time properly, or something.

Email dump should be on its way. If you need statistical data or anything that comes along with the report, just let me know.

joeschmuck · August 29, 2024, 1:28am

Exactly what pops up? Please do not let me assume anything, it’s better for the both of us. My assumption would be that you meant drive da15 randomly lists Reallocated Events of “1” and then “0” and keeps doing that. I would be surprised if that happens, never see that. Normally once a sector is reallocated, it remains that way forever. That isn’t the same with Current Pending Sectors as that can go away and come back. The same for any “Rate” value.

As of this message I have not received a dump. Make sure you used -dump email as that will send you and me the email, if you just use -dump then only you get the email. Bed time for me so if I see it in the morning, then I will look at it while at work. Hopefully it is something obvious.

Tha_reaper · August 29, 2024, 9:04am

to be clear, it is not an NVME drive. its a SATA drive and doesnt support NVME protocol. But it uses an M2 connection.

getting a working result was easy by adding the -d sat parameter:

smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.6.32-production+truenas] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     INTENSO SSD
Serial Number:    1802403003001824
Firmware Version: W0704A0
User Capacity:    512,110,190,592 bytes [512 GB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
Form Factor:      2.5 inches
TRIM Command:     Available
Device is:        Not in smartctl database 7.3/5528
ATA Version is:   ACS-2 T13/2015-D revision 3
SATA Version is:  SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Thu Aug 29 11:00:19 2024 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
AAM feature is:   Unavailable
APM feature is:   Disabled
Rd look-ahead is: Enabled
Write cache is:   Enabled
DSN feature is:   Unavailable
ATA Security is:  Disabled, NOT FROZEN [SEC1]
Wt Cache Reorder: Unavailable

=== START OF READ SMART DATA SECTION ===
SMART Status not supported: Incomplete response, ATA output registers missing
SMART overall-health self-assessment test result: PASSED
Warning: This result is based on an Attribute check.

General SMART Values:
Offline data collection status:  (0x02) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (  120) seconds.
Offline data collection
capabilities:                    (0x11) SMART execute Offline immediate.
                                        No Auto Offline data collection support.
                                        Suspend Offline collection upon new
                                        command.
                                        No Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        No Selective Self-test supported.
SMART capabilities:            (0x0002) Does not save SMART data before
                                        entering power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        (  10) minutes.
SCT capabilities:              (0x0001) SCT Status supported.

SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
  1 Raw_Read_Error_Rate     -O--CK   100   100   050    -    0
  5 Reallocated_Sector_Ct   -O--CK   100   100   050    -    0
  9 Power_On_Hours          -O--CK   100   100   050    -    62
 12 Power_Cycle_Count       -O--CK   100   100   050    -    22
160 Unknown_Attribute       -O--CK   100   100   050    -    0
161 Unknown_Attribute       PO--CK   100   100   050    -    100
163 Unknown_Attribute       -O--CK   100   100   050    -    4
164 Unknown_Attribute       -O--CK   100   100   050    -    1258
165 Unknown_Attribute       -O--CK   100   100   050    -    2
166 Unknown_Attribute       -O--CK   100   100   050    -    1
167 Unknown_Attribute       -O--CK   100   100   050    -    1
168 Unknown_Attribute       -O--CK   100   100   050    -    5050
169 Unknown_Attribute       -O--CK   100   100   050    -    100
175 Program_Fail_Count_Chip -O--CK   100   100   050    -    0
176 Erase_Fail_Count_Chip   -O--CK   100   100   050    -    0
177 Wear_Leveling_Count     -O--CK   100   100   050    -    0
178 Used_Rsvd_Blk_Cnt_Chip  -O--CK   100   100   050    -    0
181 Program_Fail_Cnt_Total  -O--CK   100   100   050    -    0
182 Erase_Fail_Count_Total  -O--CK   100   100   050    -    0
192 Power-Off_Retract_Count -O--CK   100   100   050    -    17
194 Temperature_Celsius     -O---K   100   100   050    -    43
195 Hardware_ECC_Recovered  -O--CK   100   100   050    -    0
196 Reallocated_Event_Count -O--CK   100   100   050    -    0
197 Current_Pending_Sector  -O--CK   100   100   050    -    0
198 Offline_Uncorrectable   -O--CK   100   100   050    -    0
199 UDMA_CRC_Error_Count    -O--CK   100   100   050    -    0
232 Available_Reservd_Space -O--CK   100   100   050    -    100
241 Total_LBAs_Written      ----CK   100   100   050    -    6641
242 Total_LBAs_Read         ----CK   100   100   050    -    1534
245 Unknown_Attribute       -O--CK   100   100   050    -    3780
                            ||||||_ K auto-keep
                            |||||__ C event count
                            ||||___ R error rate
                            |||____ S speed/performance
                            ||_____ O updated online
                            |______ P prefailure warning

General Purpose Log Directory Version 1
SMART           Log Directory Version 1 [multi-sector log support]
Address    Access  R/W   Size  Description
0x00       GPL,SL  R/O      1  Log Directory
0x01           SL  R/O      1  Summary SMART error log
0x02           SL  R/O      1  Comprehensive SMART error log
0x03       GPL     R/O      1  Ext. Comprehensive SMART error log
0x04       GPL,SL  R/O      8  Device Statistics log
0x06           SL  R/O      1  SMART self-test log
0x07       GPL     R/O      1  Extended self-test log
0x10       GPL     R/O      1  NCQ Command Error log
0x11       GPL     R/O      1  SATA Phy Event Counters log
0x24       GPL     R/O     88  Current Device Internal Status Data log
0x25       GPL     R/O     32  Saved Device Internal Status Data log
0x30       GPL,SL  R/O      9  IDENTIFY DEVICE data log
0x80-0x9f  GPL,SL  R/W     16  Host vendor specific log

SMART Extended Comprehensive Error Log Version: 1 (1 sectors)
No Errors Logged

SMART Extended Self-test Log Version: 1 (1 sectors)
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%        57         -
# 2  Extended offline    Completed without error       00%        38         -
# 3  Short offline       Completed without error       00%        33         -
# 4  Short offline       Completed without error       00%        14         -

Selective Self-tests/Logging not supported

SCT Status Version:                  3
SCT Version (vendor specific):       0 (0x0000)
Device State:                        Active (0)
Current Temperature:                    43 Celsius
Power Cycle Min/Max Temperature:     43/43 Celsius
Lifetime    Min/Max Temperature:     25/54 Celsius
Specified Max Operating Temperature:   100 Celsius
Under/Over Temperature Limit Count:   0/0

SCT Data Table command not supported

SCT Error Recovery Control command not supported

Device Statistics (GP Log 0x04)
Page  Offset Size        Value Flags Description
0x01  =====  =               =  ===  == General Statistics (rev 1) ==
0x01  0x008  4              22  ---  Lifetime Power-On Resets
0x01  0x010  4              62  ---  Power-on Hours
0x01  0x018  6       435262825  ---  Logical Sectors Written
0x01  0x020  6         8450722  ---  Number of Write Commands
0x01  0x028  6       100566767  ---  Logical Sectors Read
0x01  0x030  6         2146091  ---  Number of Read Commands
0x07  =====  =               =  ===  == Solid State Device Statistics (rev 1) ==
0x07  0x008  1               0  ---  Percentage Used Endurance Indicator
                                |||_ C monitored condition met
                                ||__ D supports DSN
                                |___ N normalized value

Pending Defects log (GP Log 0x0c) not supported

SATA Phy Event Counters (GP Log 0x11)
ID      Size     Value  Description
0x0001  4            0  Command failed due to ICRC error
0x0002  4            0  R_ERR response for data FIS
0x0005  4            0  R_ERR response for non-data FIS
0x000a  4            1  Device-to-host register FISes sent due to a COMRESET

joeschmuck · August 29, 2024, 9:31am

I was glad the email I sent you lead you to the solution. Now I have a path to go down and I may be able to make Multi-Report allow for this. I will be examining the script again as I am pretty sure I have the -d sat parameter in the script already (it’s been a while since I added it) and to see how to make it work. I will be sending you a few things to try so the script will know to try the -d sat command, then I will send you a Beta to test. Pretty easy now, well for you. It may take me several hours to make sure I do not introduce a problem while updating the script.

And thanks for clarifying the drive type as well.

dak180 · August 29, 2024, 11:51am

disk-burnin-and-testing may interest you for this.