Multi-Report

You’ve been scammed. The drive had been in use for 21k hours more than advertised. 2 years and a half.

1 Like

Thanks folks
I will contact Seagate

Now to answer your specific question… Once you have setup the script, it will not automatically change anything. If drives change, that would be due to you including or removing drives on your system.

Thanks for the info. So now that I have made some adjustments to the SMART Tasks in TrueNAS, Multi-Report won’t change them?

Multi-Report will manage Multi-Report fine. The Drive_Selftest script (part of Multi-Report) scans every drive in the system and will test based off of the settings. For example, if you use the default to perform a Short test each time the script is run (preferably daily) then a Short test will be run on all drives, regardless of if they were recently added or not. And if you used the default of once a week Long testing, it will test each drive once a week, AGAIN provided the script is run daily. Running this script daily is important or some drives will be skipped if they were scheduled on that day.

My comment was for the people who use TrueNAS to schedule SMART testing. If you select All Disks in the TrueNAS setup then it “should” test all drives. However currently if you have NVMe drive, TrueNAS does not actually support SMART testing even though it is listed in the GUI. Maybe 25.04 final will support it but as of now, it does not. Multi-Report/Drive_Selftest does support NVMe drive testing, provided the NVMe drives actually do support SMART testing. Some drives do not if they are built prior to the v1.4 NVMe Spec.

Hope that clear everything up.

I plan to release a new version of both scripts soon. If you have any USB drives (not “Flash” drives) that previously could not be tested, you might be able to test those now. If not then we can work together to see if it is even possible and if it is, then I might be able to update the script to add that one. It was something I kind of accidentally did to help out a person and it became an option now. With that said, if the USB interface does not pass the commands, I cannot magically make it work. Right now I’m figuring out how to make this a user updatable function so I do not need to push out an update each time for every variation. I could tell you to edit line XYZ but with over 10,000 lines, why risk it.

No worries, thanks for the explanation.

So at the moment I am running the Multi-Report daily, so in that case should I remove these tasks from TrueNAS:

@joeschmuck I’m unsure if you will check GitHub, so I also put the link here.

edit: I can’t add a link. please take a look at “Multi-Report/issues/20”

Yes, I get an email from Github and then I go check. I would have answered sooner but I just woke up.

P.S. Welcome to the TrueNAS forums. I hope your time here is positive.

1 Like

I’m a little puzzled. My SSDs are SMART tested by TrueNAS (core 13.1). Yesterday, the mail I get from Multi-Report, showed this:


Today’s mail showed this:

Unless the laws of time changed while I was asleep, I can’t really see how “last test age” could go from zero to 49 overnight. The two SSDs in question are my boot pool, organised as a mirrored pair.

No worries, thanks for the explanation.
So at the moment I am running the Multi-Report daily, so in that case should I remove these tasks from TrueNAS:

@joeschmuck is there an straightforward way to disable the SMART tests in Multi-Report if I already have them scheduled in TrueNAS?

Yes there is. Oh, you want me to tell you how.

Run the script using multi_report.sh -config and select option D. Then select Option E and move forward from there.

Or you could manually edit the multi_report_config.txt file and change the line External_SMART_Testing="true" to false.

This will disable all SMART testing by Multi-Report.

@unseen That is very odd. Do you see what I see, very obvious. The reported hours on Power On Hours is not the same as in the Last Test Age. What I don’t understand is why it dropped 1192 hours. There is apparently some calculation error.

Could you run the script again to verify if you still have this problem? If you do, please send me a dump using -dump emailextra and hopefully I will see the cause. If you cannot recreate the error, please forward me the problematic email to joeschmuck2023@hotmail.com so I can figure out what went wrong and fix it before the next version comes out.

Thanks!

Very interesting. Without me changing anything, today’s report was normal.
Let’s just assume this was a random, cosmic ray problem. (Although my ECC DRAM should rule that out…)

If I see it again, I’ll try reproducing the problem.

If you have the text still, would you send it my way? Cut and paste in a message here is good. I need all the text as well. I will try to diplicate it by faking out the drive values and the computer date and time. It usually happens when certain things align. It is crazy had to track these things down. I blame BASH.

@Deeda Thanks for the data, Investigating.

I forwarded the e-mail to you yesterday. Hopefully it arrived?

Today’s weekly report has, for the first time ever, a title of

WARNING SMART Testing Results for TrueNAS WARNING

and it refers to my temporary Crucial CT240BX500 SSD.

This single SSD, mounted in an external SATA/USB caddy, contains non-critical data; last week (Thurs 20th) it accidentally lost power during a short power cut whereas my UPS supported my TrueNAS system. Whilst I expect this caused the problems shown below, the multi-report output for Fri 21st showed no problems and the wear level for this SSD said 16.

However, a week on and today’s multi-report says

**WARNING LOG FILE 
Drive: 2402E88E5F48 - Wear Level = 6%**

and the SMART assessment for this drive says


########## SMART status report for sdd drive (CT240BX500SSD1 : 2402E88E5F48) ##########

SMART overall-health self-assessment test result: PASSED

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   100   100   000    Pre-fail  Always       -       0
  5 Reallocate_NAND_Blk_Cnt 0x0032   100   100   010    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       5209
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       26
171 Program_Fail_Count      0x0032   100   100   000    Old_age   Always       -       0
172 Erase_Fail_Count        0x0032   100   100   000    Old_age   Always       -       0
173 Ave_Block-Erase_Count   0x0032   006   006   000    Old_age   Always       -       941
174 Unexpect_Power_Loss_Ct  0x0032   100   100   000    Old_age   Always       -       24
180 Unused_Reserve_NAND_Blk 0x0033   100   100   000    Pre-fail  Always       -       12
183 SATA_Interfac_Downshift 0x0032   100   100   000    Old_age   Always       -       0
184 Error_Correction_Count  0x0032   100   100   000    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
194 Temperature_Celsius     0x0022   058   053   000    Old_age   Always       -       42 (Min/Max 23/47)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_ECC_Cnt 0x0032   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   100   100   000    Old_age   Always       -       0
202 Percent_Lifetime_Remain 0x0030   006   006   001    Old_age   Offline      -       94
206 Write_Error_Rate        0x000e   100   100   000    Old_age   Always       -       0
210 Success_RAIN_Recov_Cnt  0x0032   100   100   000    Old_age   Always       -       0
246 Total_LBAs_Written      0x0032   100   100   000    Old_age   Always       -       27022695455
247 Host_Program_Page_Count 0x0032   100   100   000    Old_age   Always       -       844459232
248 FTL_Program_Page_Count  0x0032   100   100   000    Old_age   Always       -       15362457072
249 Unkn_CrucialMicron_Attr 0x0032   100   100   000    Old_age   Always       -       0
250 Read_Error_Retry_Rate   0x0032   100   100   000    Old_age   Always       -       0
251 Unkn_CrucialMicron_Attr 0x0032   100   100   000    Old_age   Always       -       3100253548
252 Unkn_CrucialMicron_Attr 0x0032   100   100   000    Old_age   Always       -       84
253 Unkn_CrucialMicron_Attr 0x0032   100   100   000    Old_age   Always       -       0
254 Unkn_CrucialMicron_Attr 0x0032   100   100   000    Old_age   Always       -       0
223 Unkn_CrucialMicron_Attr 0x0032   100   100   000    Old_age   Always       -       2

Last week’s multi-report for the same SSD said


SMART overall-health self-assessment test result: PASSED

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   100   100   000    Pre-fail  Always       -       0
  5 Reallocate_NAND_Blk_Cnt 0x0032   100   100   010    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       5045
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       26
171 Program_Fail_Count      0x0032   100   100   000    Old_age   Always       -       0
172 Erase_Fail_Count        0x0032   100   100   000    Old_age   Always       -       0
173 Ave_Block-Erase_Count   0x0032   016   016   000    Old_age   Always       -       848
174 Unexpect_Power_Loss_Ct  0x0032   100   100   000    Old_age   Always       -       24
180 Unused_Reserve_NAND_Blk 0x0033   100   100   000    Pre-fail  Always       -       12
183 SATA_Interfac_Downshift 0x0032   100   100   000    Old_age   Always       -       0
184 Error_Correction_Count  0x0032   100   100   000    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
194 Temperature_Celsius     0x0022   058   053   000    Old_age   Always       -       42 (Min/Max 23/47)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_ECC_Cnt 0x0032   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   100   100   000    Old_age   Always       -       0
202 Percent_Lifetime_Remain 0x0030   016   016   001    Old_age   Offline      -       84
206 Write_Error_Rate        0x000e   100   100   000    Old_age   Always       -       0
210 Success_RAIN_Recov_Cnt  0x0032   100   100   000    Old_age   Always       -       0
246 Total_LBAs_Written      0x0032   100   100   000    Old_age   Always       -       24937024703
247 Host_Program_Page_Count 0x0032   100   100   000    Old_age   Always       -       779282021
248 FTL_Program_Page_Count  0x0032   100   100   000    Old_age   Always       -       13844826432
249 Unkn_CrucialMicron_Attr 0x0032   100   100   000    Old_age   Always       -       0
250 Read_Error_Retry_Rate   0x0032   100   100   000    Old_age   Always       -       0
251 Unkn_CrucialMicron_Attr 0x0032   100   100   000    Old_age   Always       -       1484866081
252 Unkn_CrucialMicron_Attr 0x0032   100   100   000    Old_age   Always       -       73
253 Unkn_CrucialMicron_Attr 0x0032   100   100   000    Old_age   Always       -       0
254 Unkn_CrucialMicron_Attr 0x0032   100   100   000    Old_age   Always       -       0
223 Unkn_CrucialMicron_Attr 0x0032   100   100   000    Old_age   Always       -       2

My assessment, then, is that the outage did something to the SSD which has increased the wear level but seems not to have perturbed anything else.

Is there anything I should now be doing to the SSD in terms of running a script on it, or stress testing it, or replacing it (whilst remembering that it is not a critical part of my system and it exists merely for experimental/hobby reasons and with the full knowledge that singe drives, USB connected, are a bad idea) ?

(My main backing store for my actual “valuable” data is in a 4 drive RAIDZ2 configuration with a 3-2-1 backup protocol in place).

@E_B The drive appears to be fine. 94% life remaining.

I am sending you a seperate message to collect a little data to find out what the script hit on to make it think you only have 6% left. I have an idea but I need to prove it and then provide a possible solution.

Great! I will try do anything I can - just ask.

Are you sure?
In the second SMART report, the one from a week earlier, the raw value was 84. If this number is meant to be the % remaining, why is it going up and not down over time?

Got the data and just now had time to look at it.

Here is what is happening:
Examine ID 202 “Percent_Lifetime_Remain”

202 Percent_Lifetime_Remain 0x0030   006   006   001    Old_age   Offline      -       94

Raw value is 94 and with a title of “Percent Lifetime Remaining” we believe it is how much life remains. Sounds good, right? Based on this and the other values (lack of errors) I have to make the judgement call that the drive is actually 94% good and 6% used.

But then you look at the VALUE=006, WORST=006, and THRESH=1 (what the script looks at and has been fairly reliable over the years).

THRESH is the point at which a failure is expected. the closer we get to 1, the worse we are.

WORST is the worst value recorded.

VALUE is the current value.

If Thresh is a low number then Value being a low number means we are on the wrong side of good. Typically I see Thresh for Wear Level set at 25 or 75, never 100 or 1 for this value so it is a bit surprising to think this is a valid piece of data for this drive.

The end result is the data contradicts itself.

As for the script, there are many different ways a SSD reports wear level, too many, and I had to choose an order on which ones to look for first. If one works, that is the value we use. Sometimes it is wrong, but that is why I have the Custom Drive Configuration in the script. The end user can make an adjustment for an individual drive.

In this situation, you have one option which would work to fix this, use reversing the VALUE (100-VALUE=result)

This is actually very easy to do using the advanced configuration page, but to offer a very fast way I will just provide the part needed to be inserted into the multi_report_config.txt file:

Custom_Drives_List="2402E88E5F48:55:65:0:9:0:0:5:5:100:5:100:2:0:100:r:d"

My advice, use the above value and watch the wear level. If it moves oddly, let me know. I cannot account for every odd decision a manufacturer makes but I try to provide a work around.

Great work - thanks! I have implemented it and I’ll watch to see what happens over the next few weeks.

It’s puzzle to know why the value seems to have changed rapidly during this last week (and I note what you said about the power cut being an unlikely culprit). I have a couple of other Crucial SSDs in a different machine and I can run smartctl on them to see what they say, if that would help in any way. Crucial is made by Micron which is a well known make so I would presume commonality between my various SSDs.

Thanks for this help and, again, let me know if I can do anything to help you.