This morning, my Truenas scale server was really slow. I was able to connect with the web but most of the information usually displayed were not showing and I was unable to connect with SSH to see what would be the process that was using all the CPU (a Xeon 4 core 3.2 Ghz). So I decided to reboot the server using the reboot button on the GUI. The server never came back. It always stop at the grub screen. From there, I did a reboot to the truenas ISO, open a shell and look for the pool. All the pools are there and I can import them. There is no warning or error on any of the pool. If I get out of there and try to do the upgrade from ISO, if I select only one drive of the boot-pool, truenas is saying that as long as the other drive of the pool is not erased, it won’t work. If I select both disks of the pool, the message I’m getting is that everything will be erased which doesn’t seems like an upgrade. So, even if the pool is there, it seems that the ISO is somewhat recognizing it when selecting only one disk for installation but if selecting both, it just want to reinstall.
So, I have two questions. Is there some commands that I can do to fix the issue and allow grub to pursue the boot process? The other question is about the ISO upgrade process. Is it normal that it’s saying it will erase everything or it should be gentle and tell that it recognize the boot-pool and will just do the update? I’ve always done update from the web GUI before so don’t know how the ISO is supposed to react.
Any assistance or hint would be appreciated. Thank you.
What version of Scale were you on and what version of the ISO are you trying to use.
We really need hardware details to know what your current setup is. Don’t overwrite or install anything at this point.
Post screenshots of the error or where you are stuck.
If you can’t post images,
Browse some other threads and do the Tutorial by the Bot to get your forum trust level up. Then post images
TrueNAS-Bot
Type this in a new reply and send to bring up the tutorial, if you haven’t done it already.
I think I was at version 24.10.1 but can’t really check now. The ISO is the one for 24.10.2.
It’s a good old HPE ML310e gen8 v2 with an added MR9271-8i adapter added. Here is a screenshot that show the disks and pools that are in there. The boot-pool is on two SSD (Crucial and WD) mirrored by ZFS. The setup has been running for more than a year and updated regularly. I just don’t know why it now stop at the welcome grub with no error message. I’ve imported the boot-pool just to see the status and it’s as you can see below.
I’ve also attached capture about what the installer is saying which seems to show that it detect the pool but stil want to erase everything. It may be because it is the same version (as I said, can’t remember for sure but I’m really under the impression that it was not up to 24.10.2.
Do you get a boot device choice with BIOS or UEFI? If your primary boot device fails it doesn’t automatically boot the other in the mirror. Try selecting the second boot drive. Skip the ISO for now
Up to now, I’ve not found a way to select a specific drive to boot on. So I’ve swapped the drives and have the same result. There is a last scenario to test but I don’t expect much from it.
I had failed boot drive in the past and I was getting proper information about it. Removing the failed drive always fixed the issue. Now, I have no logs and no reason to think that one of the drive has failed (both are less then 12 months old) but I’m still encline to test as it can be a possibility.
Any way of getting to the logs that are somewhere on the system?
You have to find the documentation for your server and look for something like IPMI or whatever HP calls the out of band BMC, Baseboard Management Controller. The BMC / IPMI may give clues as to the current hardware status and faults.
Most likely, that controller card is not fit for use with ZFS and TrueNAS unless it can be flashed to IT Mode.
Do you have a current backup of all your data on your TrueNAS setup?
What’s all the noise about HBA’s, and why can’t I use a RAID controller?
The boot disk are not on that card so it has nothing to do with that. As said, this server has been up and running with Truenas scale for over two years (just checked the buy date of the NLSAS drives) without issue. The RAID card has always been used in IT mode.
ILO is the IPMI on a HPE server. I can access it and the hardware is telling me that everything is in order. Internal NLSAS controller also tells me the disks are OK. And, even ZFS is telling me that the boot-pool is having a normal status with no error. So the issue seems to be something like if the grub config file was just not there anymore. I’m not saying it is the case, but feels like it.
I have a backup of version 24.10.0. I guess I can start with reinstallation and restoring the backup.
Possible. I checked and I was at 24.10.0.2. I’ve installed 24.10.0, done an update to 24.10.2 and now looking to restore the backup. Hopefully everything will be in order. It is a back from 4 days ago so it’s not too bad.
As a follow-up, the NAS itself is back online. There was a disconnect for some apps that was fixed by updating those that had been updated between the backup and the problem. The only thing is that I had a VM running. The config for this VM is gone so I’ve recreated it and reattach the zvol. This work fine. However, I lost the config for the network bridge which made the VM unable to talk to the truenas scale server itself. I’m not trying to find the recipe to fix this while not loosing connectivity as I’m remote from the server and can’t reach it any other way. I did this about a month ago but can’t remember the exact sequence of step to do it. Will continue to look for it.