So, I rarely post on forums as I’m usually able to figure out my issues through thorough web search, but this one is beyond me
Basically, TrueNAS SCALE install crashes halfway through, but you can read a quick recap below.
I hope someone will be able to provide some guidance, I feel kind of lost right now…
(first post here, please don’t hesitate to tell me if I overlooked some good practice)
Quick history:
The installation was flawless the very first time (~6 months ago). I did some very basic initial config on TrueNAS, then had to reorganise some of my drives. When I tried to reinstall TrueNAS, I ran into weird ACPI errors (see attached picture) and the install process, depending on the flashing method, could either not pass it at all or proceed anyway but crashed during the actual install.
I suspected BIOS compatibility issues, so I contacted Beelink to update my BIOS and then, despite the ACPI bugs still showing up, the install could successfully go to the end, so I shut down via the GUI to put the mini PC back on its shelf.
However, upon restart, I now directly fall into GRUB shell. No error showing up apart from load kernel first when I type the boot command.
I tried to reinstall again, but now I’m back to 1. as the install process crashes again.
The ACPI BIOS Errors I see when I boot the install media:
Nope, I assumed it was stable as this is an unmodified product (apart from the drives) from a well-known brand. Good point, I’ll run memtest86.
EDIT:@neofusion I just tried several times to run memtest86 (booting from Ventoy), although no errors are displayed it seems to cut halfway through also It reaches 50% of the tests every time and then black screen slightly after
I would think you are having an overheating issue with ram, and or the NVMe drives. Inside the system, the 2.5″ drive mounting bay. This bay if populated with an ssd could block airflow to the bottom memory and the SSD fan.
Any way to test that? I mean apart from removing the 2.5’’ SSD?
During memtest86 tests the RAM temp was around 62°C.
Because if this is really the case, I have to go back to the drawing board for my hardware choice…
Thank you guys for the support so far, although not everything is working correctly I have more insights to provide you with.
Update
@PhilD13 I tried running memtest86 again with the 2.5’’ SATA SSD removed and the case opened for better airflow. It reached test #7 this time (58%), RAM reached 62°C, and it cut off. So I may have a thermal issue indeed.
@Fleshmauler I could reinstall successfully using a different device (different ACPI errors and bugs + systemd failures from my very first picture still showed up though), TrueNAS could boot normally , I uploaded my previous configuration and upon reboot I fall into UEFI interactive shell.
Would 100% focus on ram based on your memtest results. At worst see if you can somehow fit a fan to simply blow on the memory. Otherwise consider the option of hating yourselves & manually configuring memory speed, timings, and voltage if the bios lets you until you find something stable. Always FUN.
I’ll try that tomorrow, opening the case completely and blowing cool air with an hairdryer on the RAM stick during another memtest.
But I’m wondering, how could this be related given that I could work with no issue on the first boot of a freshly reinstalled system? Only rebooting doesn’t work
No clue - but with memtest failing we at least have something to troubleshoot until we can get it to work. Until memory is proven stable all bets are out the window imo.
If it still doesn’t work after that? No clue, we’ll burn that bridge when we get there
So, no underlying memory issue but a clear thermal constraint. I guess the question is now stress test vs. real life situation (like how often my home server would reach the RAM thermal threshold, and if there is anything to do about it).
Wow, that is beautiful! I guess next step is boot her up & see if everything is stable with your temporary solution; if it is - figure out how to make it permanent.
Ideally, replace it with a computer fan that uses a fraction of the W.
In theory, you could buy a big 200mm variant (or similar) and just rest it on top of the case.
Noctua sells versions that can be powered using USB, I’m sure other brands do as well.
Or just solder a usb connector onto an old 3pin fan yourself! I love Noctua, but this is likely much cheaper than whatever they’ll sell it for Also a benefit of running at 5v will make the crappy old fan even quieter.
I would say that system memory should be able to withold with much higher temperatures and my strong suggestion wouls be to try with another dimm module (if nothing else just to be sure that your dimm is ok).
See:
p.s. I also hope that you have updated your bios/uefi to the latest avaliable version (it might happen that your dimm requires bios update) and that your ram uses optimal settings (if you overclocked memory, there is a good chance that this settings are creating your problem), so if you have anything except default settings, reset that first and then test (without this inovative cooling technique)
Thanks for the feedback @neofusion and @Fleshmauler, I don’t see why the impossibility to reboot would be linked to a temperature issue as the system doesn’t even have time to warmup, but if that really solves the problem I’ll definitely look into a nice DIY cooling solution (that would lead to an even more interesting-looking server )
You could play around with manually setting speeds, sub timings, and voltages instead… but I hate that so much that I’d rather keep your hot rod cooling solution than bother doing that.
But yeah, hard to say where the thermal prob is on the memory - maybe a specific section of it is running way too hot & by the time the sensor hits 60*c (for example) it means that something was already running way too hot for way too long.
Thanks for your insight! This is what I thought as well, 62°C doesn’t seem too hot for a stressed computer. Guess I’m good to take apart an old laptop to salvage the RAM!
Regarding the BIOS, everything is default and I updated it last week after talking to Beelink support: they provided me with the latest version which is from 2023-12. But I’ll double check and reinitialise everything just to be extra sure.
I’ll be a bit more busy in the coming days but I’ll try this asap and keep you posted
So, I randomly found where the reboot comes from: in TrueNAS GUI, if I click Reboot the system shuts down but, upon restart, the BIOS doesn’t detect the boot drive anymore, so it boots to the only option remaining which is the UEFI shell. To get it back, I have to unplug it and plug it again. However, if I click Shutdown and then manually boot the system by pressing the power button, there is no issue and the boot drive is correctly detected.
I have litteraly no clue why, if someone is able to shed some light on that… or give me a trail that I can explore to understand and try to fix it, I’ll appreciate
To do:
changing DDR5 RAM stick with an old DDR4 one, it generates less heat apparently (because given memtest86 results there is clearly an overheating issue, don’t know if that’s common on mini PC, but I may ask Beelink directly)
Would it be useful if I can send screenshots of boot / shutdown / reboot logs? I can see some errors being mentioned but the text flashes up pretty fast, no sure this is relevant though