Extremely unstable system

Hey All, I’ve been experiencing some major issues with my Server over the past month and cannot figure out why the system keeps crashing/shutting down completely randomly sometimes in a few days others within 5mins. the fans run and the network lights flash but i cannot open the web UI or shutdown though a command and need to hard restart the computer. I’ve made a post similar to this in the post and it was solved by simply reseating the ram and had no issues for months and had almost 40 days uptime only shutting down to move equipment and update. Initially I thought it was the memory again, so I tried reseating it again to no success then ran memtest86 and it passed, then I remembered that people had trouble with cheap ethernet cards mine was from eBay (some Unknown brand). The removal of that and running the mobo ethernet still had no success, I’ve downloaded a few of the current de-bug reports but have no idea what I’m looking for or where to start. I also ran a Bios Drive test on the only drive I could select which as the cache but knew that was probably going to pass.

System
System Version: TrueNAS-Scale-22.12.3.2
AMD Ryzen 5 5600
32GB 2x16GB Kingstone Fury DDR4-3200
MSI B550 Gaming Gen 3 Mobo
gt 710
2 x 2TB WD Red (Mirror)
240GB WD Green Sata SSD (Boot)
PNY 500GB M.2 (Cache)
EBay 2.5gbe Network (wgetech W2511-SR)
Seasonic X series 650W


First I’d like to say that it could be many different things causing the issues so you will need to be patient while testing your system.

Please clarify. Update what exactly?

If you updated TrueNAS and the system started to fail, have you rolled back to the previous version to rule that out as the problem?

System stability is very difficult to troubleshoot if you have nothing really to start with. The debug reports you have, post some of it. Maybe we can find a clue in the data.

If you are running VMs, turn them off. Try to identify if there is any common thing going on such as a file transfer, a scrub, you get the point.

While moving a component, it is very possible that a screw is rolling around or the motherboard is not fully secured and a short happens. Wires become lose, card edge connectors wiggle just a little too much.

Really inspect the entire computer chassis. Look for hardware mounting issues for example.

While retesting your system, ensure the case is all closed up to ensure airflow remains the same. Sometimes the problem is airflow/cooling. Also ensure all the same hardware is completely plugged in, example, the hard drives are drawing power. You want to duplicate the failing conditions to the best of your ability. The power supply could be failing.

If all of that is good, run an extended CPU stress test such as prime95, or one of several out there. Run it for a good 6 hours so it all get good an hot. If it passes, then you have 1 test complete.

Did you run the Memtest86 with the case closed? If yes, two tests completed.

When the power is on, if the power supply has a fan, is it spinning?

This being a gaming board, have you overclocked anything? If yes, run the board using the BIOS default setup. If you think it is the CPU, you could underclock it slightly to see if that solves the problem. I have had to do this for some people.

These are a few of many things you may need to try. You could also try to bootstrap Ubuntu Live and just let that run for a few days, see if it fails. This may help isolate the problem to software.

Is this a different issue to TrueNAS Scale Crashing every few days - #4 by Davvo

Do you have the latest mobo bios? https://www.msi.com/Motherboard/B550-GAMING-GEN3/support

Disable anything that you’re not using in the BIOS, remove and overclocks, ensure you have no temperature issues and disconnect any non-essential hardware while trying to determine what is wrong.

What chipset is the EBay 2.5gbe Network (wgetech W2511-SR)?

I shutdown the system twice before the issue started. Once to move the system to the side so a contractor could access a section of a wall, and the other was to reboot after I updated plex when I thought I had issues, but it turned out to be my TV. I don’t have many applications or VM’s running only Plex, a terraria server and Netdata all through the native applications. I’ve stopped Plex and the Terraria server and the system still shuts down. When looking at the Bios everything looks the same since my initial set-up no xmp profile for the ram, and un-touched CPU settings which are all set to standard or off when available in regard to overclocking the cpu. I’ve completely removed the ebay Network card and current server uptime is 1 day. There is a S.M.A.R.T. Test scheduled for this coming Monday (4 days away for me) and if my system lasts that long before shutting down, I will see if that is causing my issue. \

How would I got about Up-loading the De-bug logs for you too look at. do i just dump the entire file or is there a specific way?

In the meantime I might run a CPU stress test and post the results.

Yeah I think it’s a different issue, its defiantly presenting similar though. I initially thought it was that’s why I followed the steps in the previous forum; reseat, memtest, disable XMP. None of that has solved my issues, that’s why i am thinking its something new. unless the reseating of memory didn’t fix anything and made me think it was the solution to the problem but then again i never had any problems with the machine until recently and I cannot think of any major changes that have been made.

I think the Ebay network card is a Realtek rtl8125bg judging by the small writing on the chip under the heatsink.

I had a similar issue where the server would reboot on a regular basis, sometimes after 30 minutes and max runtime was 4 hours. I connected a monitor and it kept showing fatal errors with ehternet.

I had a ocie 2.5GbE card installed. this had a Realtek chipset and the general consensus is that TrueNAS doesn’t like Realtek ethernet chipsets so I bought a replacement from ebay with an intel chipset. Problem has gone.

Do one thing at a time. You are attempting to do two things if you run the stress test as well. If you removed the NIC and the system appears to be running fine, let it run. If it fails then you know the NIC was not the issue, if it passes then the NIC was the issue.

1 Like

What motherboard BIOS is installed?

1 Like

Update, I’ve left the system running and I haven’t lost access to the UI but the system says has been running from 0700 and its currently 2200 where I live so 15hrs. The system should have been running continuously for 2+ days now. could it be possible that when whatever is causing the crash to occur
the Network card could not ‘reset’ properly when the system automatically booted back up causing my access issues? that answers part of the puzzle I’m having but does not explain why the system is crashing to begin with.

AMI BIOS
7B86vP6

You are focused on the wrong thing for the troubleshooting effort. My first question would be “Why is the system saying it has been running for 15 hours when it should be over 48 hours?” To me that indicated the system was rebooted. Maybe you rebooted it and didn’t tell us but at face value, that would be my concern. Not if the NIC was causing the UI to be inaccessible. That question comes after finding out why the system rebooted, even if it rebooted and the system was still available.

1 Like

So a two year old bios, multiple NICs, no communication of what you have disabled etc etc. Update you system, disable non-essential hardware and start being systematic about your troubleshooting otherwise you’re just throwing darts.

2 Likes

Just Checked this morning and the system has crashed and is unreachable again.

Its Possible I would have had a power outage or something as it was stormy not far from where I live and the system is set to boot to last state so if there was a power outage the system would have re-booted. But I can assure you i did not re-boot the system.

So far this is what has been tested:

  • Memory with memtest86 and it passed
  • stopping of all serves and the system crashed
  • removal of Realtek NIC, system still crashed

The next step I am going to take is to update the Bios from the suggestion of @Okedokey. I am Trying to remeber to take one step at a time. I will give you an Update on what bio’s settings are running so Im not

and can more helpful to the people that are trying to help me.

Update:

I’ve installed the newest Bios AMI BIOS 7B86vP9 released 2025-04-11.

The only Bios setting i have changed was disabling secure boot so trueness could boot up.

  • ECO mode is disabled
  • precision boost overdrive is disabled
  • XMP profiles are also disabled
  • All other settings are disabled or automatic.

They were all set like this from default. If there is any other bios setting i should change let me know. But for now, I will wait and monitor the system to see if it crashes/Shutsdown.

I hope that the BIOS update fixes the problem. If the system remains stable and running for 1 un-interrupted month (no power off or reboots), you can consider you discovered the BIOS was the issue and it is now fixed. Why 1 month? Some problems take time to manifest such as memory leaks. I know that isn’t your problem here but it is a rule I use.

Best of luck to you!

Thanks you and all that helped. :grinning:

I would:

  • Download and save your TN configuration file. Ensure you have it somewhere accessible and not on the TN Server.
  • Follow pages 8 - 13 of the manual to ensure all power and other cables are firmly installed. Make sure there is no unreasonable mechanical forces on any cables.
  • Check you have a servicable CMOS battery or replace

BIOS SETTINGS

  • Save current BIOS Settings (use the Save Overclocking Profile).
  • Update the bios via the “Flash BIOS Button” function* (pp. 45). This will ensure you’ve practiced and have the means to recover any BIOS corruption. You’ll be covering two things this way, an update and a recovery solution in one.
  • Reset all settings to default (F6), save and restart
  • Enter the UEFI settings, check the boot settings are correct, save and restart.

Enter TN GUI and check you are still with a working system.

FURTHER TROUBLESHOOTING:*

The following steps will disable any peripheral settings (hardware usually) that may be either malfunctioning or is incompatiable:

  • Disable the Parralel Port
  • Enable the Serial Port if you plan on SSH or similar via a serial cable, otherwise disable
  • Enable Smart Fan Mode (BIOS>HARDWARE MONITOR)
  • Remove one of the RAM modules and clean the edge connectors with IPA or a damp tissue. Be gentle but IPA or deoxit works best, if you have the later spray a small amount in both RAM slots. I
  • Insert a single RAM stick, remove it, and reinstall . This helps remove any oxides or dirt on the conductive parts. Make sure you have inserted the RAM modules in the correct bank (seep pp.26 of manual).
  • Make sure your GPU if you have one, or other full bandwidth PCIe adapator card is in PCI_E1 wherever possible.
  • Ensure the CPU fan is connected to CPU_FAN1 (top of board left of ram modules).
  • Consider physically disconnecting an y LED strips you may have connected.

Login to the TN web GUI and use TN for a while and see how the minimum settings and hardware operates.
If you get a restart, swap the ram. If it remains stable after a few days, add the other RAM module and continue.
If unstable you need to checking hard disk drives and other following other procedures.

*Please note your motherboard does have a “Flash BIOS Button” for easy BIOS recovery. This button allows you to flash a new BIOS version or revert to a previous one using a USB drive, even without a CPU, memory, or graphics card installed. If this is your onlyl computer I’d recommend working out how to make use this function and prepare a USB stick for this purpose.

The motherboard, of which you helpfully linked the manual, doesn’t have a parallel port or a serial port. In fact I doubt any motherboard released the last 15 years has had a parallel port. Serial ports are a thing on server motherboards, this isn’t one of them.

Edit, the above was incorrect, the board has internal connectors.

Even if you have one, the serial port does not need to be enabled (or indeed exist) for one to be able to ssh in. ssh is also not required when you do want to use a serial port for console access.

The board has both serial and parallel headers. See pp. 35 I also dont assume what the OP may be using.

1 Like

I don’t have any Ryzen systems and am not that up on the the series, but I see a lot of “Ryzen 5 5600x systems random crash/reboot/freezing/studdering” type of results when searching on the processor. Isn’t this one of the Ryzen chips that just did not work well?

Internal headers, I missed those, my mistake.
You are quite right.

This is typically fixed by changing the way the PSU handles idle current states. You set the Power supply idle control in the BIOS to Typical current idle.

That fixes many, but not all instability issues and is an obvious first step.

After that comes the usual battery of tests. Bad RAM, bad PSU, bad CPU (bent pins?) and so on.