Multi-Report

PhilD13 · April 17, 2024, 6:54pm

The Spencer script does still work or at least for me it does. I didn’t get an email this morning from a system for the Multi-report. At lunch when I had time, I manually ran the Multi_report script which reported script already running. After removing the lock directory as per instructions in Multi-report this is the command line run result.

Multi-Report v3.0.1 dtd:2024-04-13 (TrueNAS Scale 23.10.2)
Checking for Updates
Current Version 3.0.1 – GitHub Version 3.0.1
No Update Required
2024-04-17 13:24:36.954454 - Spencer V2.0 - BETA - 8/22/23.
2024-04-17 13:24:36.954524 - Writen by NickF with Contributions from JoeSchmuck
2024-04-17 13:24:36.954540 - Spencer is checking log files for errors and pushing output to Multi-Report.
2024-04-17 13:24:36.957831 - Operating System version determined: TrueNAS SCALE, Version: 23.10.2 - Debian GNU/Linux 12 (bookworm) (TrueNAS SCALE)
2024-04-17 13:24:37.020029 - Found 4 new matching errors in the logs
2024-04-17 13:24:37.020299 - Loaded 31 previous errors from the error file
2024-04-17 13:24:37.020612 - Wrote the current error counts to the error file
2024-04-17 13:24:37.021544 - Generated the error table for the email content
2024-04-17 13:24:37.021610 - Selected email subject: [SPENCER] [PREVIOUS ERROR] No new Errors Found, Previous Errors Exist in Log for owen
2024-04-17 13:24:37.021645 - Email sending is skipped due to USE_WITH_MULTI_REPORT setting.

kjacques1 · April 20, 2024, 11:41am

I’ve used this script on my Core server for a long time with no issues. I just bulit a Scale Server and when I try to run the ./multi_report.sh -config option, its giving the error:

‘fdisk’ is missing, Contact Joe referencing this message.

I’m running TrueNAS-SCALE-23.10.2.

Any ideas?

joeschmuck · April 20, 2024, 2:12pm

Every version of SCALE should have fdisk in the debian distribution so I’m perplexed why you would get that message, and I know 23.10.2 does have it. The only thing I can think of at this time is you are not running as user ‘root’ ? If in the GUI → Shell you run fdisk by itself you should see a message that states “fdisk: bad usage” which indicated fdisk ran even though we did not provide all the parameters it required. If you are logged is as ‘root’ that should work, if a different unprivileged user then I expect it to fail.

kjacques1 · April 20, 2024, 3:23pm

Switching to root solved it. Thanks!

joeschmuck · April 20, 2024, 6:06pm

I will add a comment in the error message to state to ensure the user is using a privileged account, that should help someone else in the future.

PhilD13 · April 20, 2024, 6:41pm

You may still need to use SUDO in some cases with a Cobia install to execute commands when using the admin account or re-enable (enable and set a password for root user) the root account that was disabled during install and log in as root. I got tired of the mumbo jumbo dice roll if the admin account was going to work correctly or not and just re-enabled the root account. Anytime I need to install or edit an app, or use the command line I login as root and have a lot more success.

garyc · May 7, 2024, 2:59pm

Hi Joe-

BLUF:
Would you be able to easily add the feature on the “drives in this pool report” to add the serial number of each drive in the pool? Thanks

Details on my setup and reason for need:
I appreciate the effort you put into multi-report. I still use it to send me emails everyday. I have recently upgraded my main system to a Supermicro SC-847 4U 36 Bay double sided chassis with BPN-SAS2-846EL1 24-port 4U SAS2 6Gbps single-expander backplane and one BPN-SAS2-826EL1 12-port 2U SAS2 6Gbps single-expander backplane and 2 of LSI 9300-8i.
It is fully populated with 36 sata drives and is working very well.
In doing testing and setup I noticed that it became tedious to identify which disk belonged to which pool without the serial numbers. Your Multi-Report helped greatly with this. However, I think it would have helped me even better if the serial numbers were referenced in the Drives report for each pool along with the sdxx reference. I did pull this out from other areas in the report and matched them up. That was a little tedious with 36 drives. (it was a lot easier when I only had 8 or 10!)
Example excerpt from report:
Drives for this pool are listed below:
a8489f55-6d95-4315-b329-101e44cf327b → sdc1
7d58ab47-aecb-4241-83c8-f13e63340eb2 → sdd1
16ab36b4-cdc8-4e54-b72f-6f7a0a20314f → sde1
1d93e89a-891a-4328-822a-125628b3b49e → sdf1
8e2145be-414d-4b3f-b17e-2277ac1c9d10 → sdi1
59ef8ecf-72ab-4b18-8225-d20327c08124 → sdo1
dde28b41-c609-4d32-8e48-52d49fbbfdcf → sdp1
9447fab4-01ac-44b0-afac-c4a8a755038d → sdu1
3db08bee-8420-401b-93a4-c3fcaef02569 → sdv1
5322755d-830d-4a1f-9268-c220147afaf5 → sdai1

sfatula · May 7, 2024, 4:21pm

With lock file scripting, look into flock, It auto releases the lock even in failure. It avoids the old if the file exists please delete it scripting.

Here’s an example of a cron job:

/usr/bin/flock -n /var/lock/BackupToServarica.lock /mnt/tank/Scripts/BackupToServarica.sh

Davvo · May 7, 2024, 7:08pm

In my CORE system I have drives’s serials for each drive. SCALE should hardly be different.

garyc · May 7, 2024, 8:17pm

The serials are listed per disk elsewhere in the report. It just does not list them directly in a way associates them with each pool. I have to go back and forth to manually lookup the serials for each drive in a particular pool. This tedious with 36 drives spread over 4 pools.There are 18 10tb drives, 10 2tb, and 8 4tb. Its just a convenience thing that I found I needed while building and testing.

joeschmuck · May 7, 2024, 8:48pm

You should have seen it when the gptid was not listed, you had to do all the work manually. That was fun!

I am trying to finish up 3.0.2 which is a bit different, I have a problem I’m trying to fix which seems to occur with a lot of drives or something down that line. I was hopeful for last weekend but maybe this weekend.

garyc · May 8, 2024, 5:59pm

That would be tougher! I look forward to seeing 3.0.2! Thank you.

PhilD13 · May 8, 2024, 7:15pm

The Multi-Report does list drive serial numbers in the report and it’s what I used to help make an Excel sheet to create a map of the drive to a slot in the 36 drive backplane so when the GUI shows a device id degraded/offline/etc, I know what serial it is and where in the system that drive resides, what pool, what dataset and what level of protection (raidzlevel) it’s in.

This is a partial first line of the drive table in the Multi-report sent every morning.



Spinning Rust Summary Report

Device ID 	 Serial Number              Model Number 	                       HDD Capacity 	RPM

/dev/sda 	 NAHWDPDX 	      HGST HDN726060ALE610 	6.00T 	        7200

etc.

The other id’s are also listed in the report further down for the datasets.

NAME STATE READ WRITE CKSUM Pool1 ONLINE 0 0 0 raidz2-0 ONLINE 0 0 0 6fb89273-9dee-4e57-9b26-e5a41cdea694 ONLINE 0 0 0 etc.

And Drives for the pool



Drives for this pool are listed below:

7a4b6a6d-e076-487a-9e63-dc4d8d30c10f → sda2

6fb89273-9dee-4e57-9b26-e5a41cdea694 → sdc2

etc.

Not sure what more you would want as it’s all in the report. There are similar reports for SSD drives but I didn’t include examples of those as they are pretty much the same. There is also a csv statistic report each week that also gets reported in email along with a couple other files. Those reports are also available on the server in the directory where the script is stored and run from.

joeschmuck · May 8, 2024, 9:48pm

I did make the minor change, only because it is minor but I agree, all the data is there and if you use the statisticalsmartdata.csv file, it contains the drive ID and S/N, an easy to open and look there and at the pool. Now in my mind I want to add some way to identify which pool each of the drives belong to (kidding) I have bigger problems I’m trying to fix in 3.0.2. If someone wants a Beta copy, you need only toss me an email (listed in many places) and I can send it your way. Trust me, I’m abusing you as a Beta tester. I really dislike putting out a fix to a fix. 3.0.2 is a bug fix, originally but while waiting on verification that the fixes worked, I started adding something I hope will be valuable. It should be Cool to the home users and Good statistical data for those IT folks.

Here is what the new output section looks like for the serial number request:

########## ZPool status report for farm ##########
pool: farm
state: ONLINE
status: Some supported and requested features are not enabled on the pool.
The pool can still be used, but some features are unavailable.
action: Enable all features using ‘zpool upgrade’. Once this is done,
the pool may no longer be accessible by software that does not support
the features. See zpool-features(7) for details.
scan: scrub repaired 0B in 00:12:14 with 0 errors on Fri Apr 19 18:19:10 2024
config:

NAME           STATE     READ WRITE CKSUM
farm           ONLINE       0     0     0
  raidz1-0     ONLINE       0     0     0
    nvme2n1p2  ONLINE       0     0     0
    nvme1n1p2  ONLINE       0     0     0
    nvme0n1p2  ONLINE       0     0     0
    nvme3n1p2  ONLINE       0     0     0

errors: No known data errors

Drives for this pool are listed below:
91accc34-9f4d-11ee-b703-000c29c58878 → nvme3n1p2 → S/N:511230818150000089
91ab3355-9f4d-11ee-b703-000c29c58878 → nvme0n1p2 → S/N:511230818150000096
91ae7448-9f4d-11ee-b703-000c29c58878 → nvme2n1p2 → S/N:511230818150000088
91b0659e-9f4d-11ee-b703-000c29c58878 → nvme1n1p2 → S/N:511230818150000051

Cheers,
-Joe

joeschmuck · May 12, 2024, 7:49pm

Version 3.0.2 is out on Github.

If you are using automatic updates, you may still have an error message but that problem should be fixed once the update has been applied.

To manually update the script run the script using ‘-update’ and it will download and validate the file.

New things:

Quick Start Guide
A few fixes
SMR Drive Checking
GPT Partition Checking
Total Data Read/Written for the entire Pool (up to 9.2 YB)
Total Data Written for individual drives (By last 30 days or current month)
Minor update for TrueNAS 13.3 Beta 1 for nvmecontrol

Note: Total Data Written (30 days) for the individual drives will first start as matching the current value, but the second run of the script shows actual. This is because a new set of values are added to the statistical data file, the first entry becomes the base which we subtract out after the first run.

The Bad:

Dragonfish (24.04.0) does not run Spencer (now with pineapple) via the CRON. Waiting for a solution. Recommend renaming your spencer.py script to “spencer1.py” until the fix is discovered. Suspect it is a permissions issue within the CRON.
TrueNAS 13.3 and 24.04.0 both have smartmontools 7.4, however the GUI is still not configured to setup SMART Self-tests for NVMe drives. BUT Multi-Report will still run the Self-tests for you (some good in there).

As usual, please feel free to ask questions. I currently have no plans to release another Multi-Report update for a while. I am rewriting from the ground up and will properly post on Github. I’d like to do a survey one what features are and are not being used so I can clean up the script a whole lot.

PhilD13 · May 12, 2024, 9:45pm

I don’t know if any of the following will help any.

I checked my systems as both run the Multi_report from cron. Both systems have the scripts reside in a SMB dataset within a folder called scripts. They are accessible from windows (my laptop). One system enables the spencer script and the other system does not. On the one that does run the script it records an error but otherwise gives no indication. I think it might even be running the spencer script just not recording results as the script cannot operate on spencer.log

Along with some other stuff showing what lines on the script are generating the issue, the base error shown in /var/log/jobs is:
OSError: [Errno 30] Read-only file system: ‘/spencer.log’

Looking at the permissions of some of the files where the scripts reside:

admin@owen[~]$ ls -l /mnt/–>/scripts/spencer.log
-rw-rw-rw-+ 1 root root 280 Apr 17 13:24 /mnt/ → /scripts/spencer.log

admin@owen[~]$ ls -l /mnt/–>/scripts/multi_report.sh
-rwxr-xr-x+ 1 root root 420795 Apr 14 00:00 /mnt/–>/scripts/multi_report.sh

admin@owen[~]$ ls -l /mnt/–>/statisticalsmartdata.csv
-rwxrwxrwx+ 1 nobody builtin_guests 264537 May 12 00:04 /mnt/–>/scripts/statisticalsmartdata.csv

It appears that there is no execute set on the spencer.log file and so if the script is trying to do something that requires execute on the file it will fail.

Spencer enable is true in the multi_report config file

Spencer Integration

spencer_enable=“true” # To call the Spencer.py script if “true” or “false” to not run the Spencer.py script.

joeschmuck · May 13, 2024, 12:56am

Just so others are aware, @PhilD13 is trying to run Spencer.py on Dragonfish. If that is an incorrect statement, please toss me a message or send me an email.

@PhilD13 The spencer.log file is not executable, it is a log of what the spencer.py does. Good hunting!

PhilD13 · May 13, 2024, 1:03pm

@joeschmuck is correct. Work just keeps getting in the way of digging into it more and the multi_report works good without it.

So far I found the spencer.log file cannot be modified as it is apparently read only to the spencer.py script as shown in the shown in /var/log/jobs log. Since that log reports the lines in the script where the errors are occurring.

joeschmuck · May 13, 2024, 2:09pm

Sorry for the bad news but v3.0.2 got stuck in a loop downloading the SMR script. It has been fixed in v3.0.3. If you did not upgrade to v3.0.2, then you will go directly to v3.0.3.

However it is very possible you have v3.0.2 job stuck in a loop. Run using the ‘-update’ switch from the CLI. ./multi_report.sh -update and this will grab the SMR file and stop the looping. Or you could reboot but many people would not like to do that. Then when the script runs again it will pull down the good version of the script (v3.0.3) to fix the issue.

Wish I could blame this on something else but now I know to wipe the test computer clean of all files vice leaving one or two files laying around, and in this case, the SMR file was left because I had already downloaded it. I didn’t see this issue before posting, and strangely enough, the error only presents itself when running from CRON. I guess it could be the Solar Flares messing with my brain?

garyc · May 15, 2024, 8:18pm

Thanks Joe! I appreciate you putting those in!