System reboots randomly, could use some help desperately please!

techdan91 · June 3, 2024, 8:37pm

so ive made a few post on issues ive been dealing with since i started my new journey installing truenas scale on my spare pc…so sorry for being new and asking a lot of questions, but i figured out a lot of my issues my self this past week working for like 10hrs a day this past 7 days at least lol, im very dedicated…

ive finally got my windows 11 vm to be stable with my gpu passed through to host plex, im almost done transfering my 10tb of movie data to my smb share to use for plex app so i dont have to mess with using the vm, and ive got all my configs set pretty close to perfect for how i want it…so i really dont want to have to wiipe all my progress and start all over again for like the 5th time this week…

please ask for any info you need to help me diagnose my issue or if you can point me in the right direction…but ill try and post what i think is needed…

vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvMAIN ISSUEvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
so my main issue is random reboots but usually its been spread out between a few hours, so ive been able to get by and finish downloading bulks of my media until i finish to fix this issue more headon…

during the bootup, ive noticed an error that will appear in the boot command phase, and its says something like “vfio: module verification failed signature and/or required key missing - tainting kernel”…and then the boot will freeze at that error line but then i can access the gui as normal, until it decides to reboot a few hours later(sometimes sooner)…

after ive logged in to gui i also see in the jobs section some error where it failed to clone the truenas apps catalog and the truecharts catalog (those are the only two i have)…so idk whats going on there…

I also noticed in the audit log, that after my system reboots, there a failed authentication token, guessing its related but have no idea where to go from there

and during this lates reboot lol, ive just noticed as well on the configure setup screen theres a line at the top that says something about a missing VPD tag and assume eeprom or something…

any help will be appreciated, i will post my specs below and some screenshots…

nvidia gpu is isolated…no apps installed or running, one windows 11 vm with 16gb ram allotted…

version- Dragonfish-24.04.1.1

etorix · June 3, 2024, 9:10pm

A complete hardware list would help.

Is there an UPS? Is the PSU properly sized?
Was the system brunt-in? Memory checked? Cooling under load?

techdan91 · June 3, 2024, 9:18pm

well damn lol, yes there is a ups…the psu is 850w gold, not entirely sure what you mean by burnt in but i flashed a boot drive on a usb and installed on 1tb nvme…i dont think ive done any ram test but they are 16gb stick corsair vengeance pro that i got new a year ago…under heavy load it seems to be topping out at 60c, i have a 240mm aio and pretty good air flow

etorix · June 3, 2024, 9:42pm

I mean thoroughly testing the system before deploying “in production”.
A few days of running MemTest would be a good start.

And listing the whole system. Motherboard? How many drives? How are they connected? (Any “PCIe SATA card” by any chance?) NIC?

techdan91 · June 3, 2024, 10:00pm

oh alright, yeah ill give it a shot soon then…

mobo: msi b660 tomahawk ddr4
drives: (4) 4tb sata hdd 3 wd one red 2 blue one drive segate in raidz1 zfs pool…18tb usb WD easystore and 1 1tb WD m.2 as boot pool and a 2tb 980 pro that im not using but its in there
nic: 2.5gb realtek

etorix · June 4, 2024, 6:24am

Is this attached by USB? USB is not reliable enough for a permanently attached data drive.

Realtek NICs are known to collapse under high load.

But none of that is an obvious candidate to crashing the system and causing a reboot.

Stux · June 4, 2024, 7:00am

do you mean a water cooler radiator with a pump on the CPU?

sometimes, these can result in low airflow around the cpu where the VRMs and other devices are. These can overheat…

Also the chipsets.

Also, try turning off some sleep settings in the BIOS, sometimes there can be BIOS bugs when entering deeper idles.

Also, try the memory test, one reason why random reboots can happen is memory failures…

And another reason can be the power supply failing… sometimes they just begin to spontanesouly cut the power.

techdan91 · June 4, 2024, 7:18am

huh, interesting…yeah shoulda went with air cooled from the start…

but okay ill check bios, didnt know they had sleep settings there too, thanks.

but alright will do…i feel like ive had an issue with this ram when i first got it, but it was in a different system…but yea now definitely gonna test it since i remembered that lol…

but i hope its not the psu…il keep an eye out…would my ups have any kind of interference with the psu or power? its a brand new apc backup pro 1500

but can any of this explain the token authentication failure? im curious as to whats going on there if anyone knows about it?..like i said it always seems to be in the audit log right before the next authenticated sign in after the random reboots

but thanks again guys, ill def check my ram asap…

Stux · June 4, 2024, 7:22am

I assume that’s your login token expiring.

I set mine to 24 hours so I don’t have to continuously re-login.

Settings → Advanced → Access → Configure → Token Liftime = 86400

techdan91 · June 4, 2024, 7:25am

yeah that was like the first thing i changed day one lol…but i have it at the max-ish…maybe ill lower it?..but alrighty thanks though

techdan91 · June 7, 2024, 5:16pm

so im almost certain i figure out the problem causer…it was the RAM like you guys mentioned…

so i had 4 16gb corsair vengeance pros in…same speeds/voltage/clock…the only thing different about them is the version, one is 3.x the other 4.x…so i guess the system didnt like that inconsistency because i took the two 3.x’s out and havent had a reboot in 20hrs, which is the longest ive been able to run constantly…

so sucks i cant use 64gb but for now i really dont need that much…only have plex and one vm running but want to add another vm and a couple more apps, so hopefully it should be enough…only issue i have not is a disk holding my media has a lot of errors according to zfs, so i gotta do some learning on all that to understand it better…but it is an oooold data center refurbish lol…so im definitely going to save for a new 12tb to keep my data safer.

hope this helps someone else in the future

neofusion · June 7, 2024, 6:12pm

You should still run memtest if you haven’t already. Multiple full passes.
Just removing a stick of RAM isn’t going to tell you if you have lingering memory issues or not.

What RAM profile do you use, an XMP one? If so, that’s typically discouraged in server use cases.

techdan91 · June 8, 2024, 10:54pm

yea im going to try and test all 4 stick tomorrow…

and i was actually wondering if xmp might have been the issue as well…jusut weird cause its still on with just 32gb in dual channel and its been running 2 days straight…but i will feel 100% after i run all them with memtest and hopefully find now errors…

but thank you for the tip, ill definitely turn it off regardless