Stuck in a boot loop After years of stability

Anon27 · August 17, 2024, 9:12pm

Hello!

I am sorry to just create an account and then randomly post, been a long time lurker first time poster.

OS: TrueNAS Core 13.0-U5.3
Host: ESXi with HW passthrough
HBA: LSI 9300-16i SAS9300-16I
Array1: Zraid 1 14TB (5 HDDs)
Array2: Zraid 2 50TB (10 HDDs)

So I have a TrueNAS core system that has been running with no issues for years. It is a VM in an ESXi system, I have passed through my HBA to the VM so no middleman stuff going on.

Today out of the blue TrueNAS after years of no issues just hard crashed and just keeps boot looping. I did some testing and found that when I have physically removed the connection to the drives in A2 (array 2) it will boot no issue and I can get into A1 (array 1). But if i reach 8 disks or more of A2 boom it fails. It looks like it allows the disks in if it doesn’t see them as a pool but the second it has enough to see it as a pool it fails and causes this panic and loop.

Currently I have A1 running and still using the NAS at this time with no issues but I have no idea what is up with A2 or how to get it back. Any help will be very much appreciated.

Below are the pictures of the errors I haven’t found anything like this on the forum.

Captain_Morgan · August 17, 2024, 10:40pm

It might be a corrupted file system… space map specifically.

Could be caused by hardware or software…

General approach we have recommended previously is to try importing the pool read-only, since it does not require space map access, evacuate the data and recreate the pool.

Anon27 · August 17, 2024, 11:09pm

I tried to run it on the temp VM I made and got this error.

I am guessing I/O error means I am toast doesn’t it?

Anon27 · August 17, 2024, 11:42pm

Ok I got a little bit more info was running some commands and found this (command being zpool import)

(I figure even if I am unable to save this may as well document stuff for another lurker someday to see and figure out what I did)

Anon27 · August 18, 2024, 12:28am

ok

I got it to import in command line it says its healthy and has no issues. But I cannot find it in pools how do I access this?

I run zpool import -F -f -o readonly=on mass

it says it is imported but I cannot find it in pools or access it how do I access it?

Captain_Morgan · August 18, 2024, 12:44am

Anon27 · August 18, 2024, 12:48am

I pointed out earlier if I go through the process listed above it gives me the issue, it is why you advised me to mount it read only…

I have already mounted the zpool through command line as read only as you had advised, and as per the CLI it is already mounted and there. But I am unable to see it under pools or know how to access it to pull the data off.

When I go through the add pools section it only shows my other volume as it sees the other one is already there. That is why I am asking how do I access a mounted Read Only volume that was added through CLI.

Captain_Morgan · August 18, 2024, 2:38am

Not sure… I’ve never had the problem. Perhaps someone else has?
You might try exporting via CLI and then reimporting via webui now that the pool is OK.
Otherwise, it’s use cli tools.

Anon27 · August 18, 2024, 1:11pm

Currently reviewing some reddit thread seems someone else had the same issue.

I am curious though since none of the drives left an error in Truenas and all of them report healthy but the array dead. What on earth would have caused this!??! Like what would cause ZFS to just be like Nah mate I had a good life see ya and then die like this?!?

Captain_Morgan · August 18, 2024, 5:19pm

Its usually caused by misbehaving hardware or OS… memory, HBA.
But bugs are possible… the good thing about CoW is that data is recoverable.

I’ve not seen this issue on iX hardware.

Anon27 · August 18, 2024, 6:19pm

So I did some poking around reddit and found out how to access it

I will be writing this next part a bit weird just so others who face this issue can find this through google and save them some headache.

If you suddenly have TrueNAS enter a bootloop due to a panic over the Space Map read below

The issue from my analysis was caused by an overheating HBA card, the overheating of the HBA caused the space map to corrupt. The fix is active cooling for the HBA, so just a PC fan attached to the HBA should suffice. It seems the HBA card when it is overheating can write bad metadata resulting in this issue, its a known issue as per others I found on reddit.

To mount your drive read only zpool import -o readonly=on (ARRAY NAME HERE) -f

if you cannot find the name of your array run zpool import
This will give you the name of your arrays, either use its name or its ID. The ID is a huge pile of numbers.

Now doing it over CLI mounts it differently than through the GUI. For some reason it won’t mount under /mnt, but instead it just drops it in the root directory. So just use the console cd / then ls and boom it will be right there under the name of the pool.

From there you need to do the recovery over CLI. I just plugged in a 4TB external HDD, made it a pool and then started using Rsync to pull all data since my most recent backup.

From my research I cannot for the life of me find any way to recover this and looking through github this is a long known problem with ZFS. Sadly the OpenZFS devs refuse to acknowledge and refuse to fix the issue. So if you came here looking for a way to fix this and get the array back up… I hate to break it to you but you are SOL, the ZFS devs don’t care, this is just a risk of using ZFS

let this be a reminder

ZFS AND RAID ARE NOT BACKUPS STILL RUN BACKUPS

P.S
My comments on the devs are NOT to the truenas devs or ixsystems, they both do an excellent job. My comments are to the devs who maintains OpenZFS it self.

Captain_Morgan · August 18, 2024, 7:16pm

Great write-up…

When bad data is written (e.g non-ECC RAM), its very hard for a file system to cover these issue. Its difficult to test and there’s always a possibility of 2 or more corruptions. The solution is well-behaving hardware. Its OK not to write data, but don’t corrupt it.

The ZFS advantage is no overwrites of data… so it can often be recovered.

Redcoat · August 19, 2024, 12:23am

Thanks for taking the time to raise our awareness of a problem. Please define what is “the long-known ZFS problem” that the deva refuse to acknowledge or fix so that we can understand the matter.