Getting Bootloop on TrueNAS CORE

Demon4231 · April 4, 2024, 3:44am

A few days ago I got an alert that the system had an unexpected restart but nothing was wrong so I just kind of ignored it. The system is behind a UPS so the random restart wasn’t due to power loss. The next day I realized it was stuck in a boot loop with a message before restarting

fault code = supervisor read data, page not present
processor eflags = interrupt enabled, resume, IOPL = 8
curret process = 5 (txg_thread_enter)
trap number = 12
panic: page fault
cpuid = 8

The system would no longer boot so I tried installing TrueNAS Core on a new partition as an upgrade, but the same issue happened. I then tried doing a clean install and while I could get to the GUI as soon as I tried to import the pools the system would restart and go back to the same error. I tried following a few different solutions found on the forum like importing as readonly but it did the same thing. After reinstalling TrueNAS when I try to import the pool it says poolname not found, and when I try to list the pools it only shows boot-pool, unless its a clean install I can see the pool through the GUI. So maybe I’m just using the wrongs commands since I’m not very familiar with this.

I’ve had the system running for over a year without any problems so I’m not sure what happened since I didn’t do any updates or anything like that when it started happening. Below are my system specs. Its been a while since set it up but I’m pretty sure its setup as RAIDZ2, but could be RAIDZ1.

Does anyone have any ideas how to fix this?

What I tried :
Ran S.M.A.R.T. test on all drives and it passed without any errors
Ran MemTest and it passed without errors
Upgrading TrueNAS Core from previous installation
Installing TrueNAS Core from scratch
Importing pools as readonly

System Specs
Motherboard : MSI PRO B660M-A Micro-ATX
RAM : 2x 16GB DDR4 3600 CORSAIR Vengeance
CPU : Intel Core i5-12600K
Storage : 8x Seagate 10TB IronWolf SATA III, 1x 500GB SSD (for running TrueNAS)
Expansion cards : 1x 10GB PCI-E Network Card (onboard had issues), 1x M.2 PCIe 6 x SATA Adapter Card (motherboard doesn’t have enough SATA ports)

Davvo · April 4, 2024, 8:32am

Please elaborate.

Could also be the boot drive dying, but I would guess a memory-related issue… which is a bit strange since memtest did not detect any error.

Demon4231 · April 4, 2024, 3:33pm

Here is the SATA adapter I am using ORICO M.2 PCIe M Key to 6 x SATA (Amazon link)

I only have 2 sticks of memory so I tried using just one at a time, and had the same issue.

I was messing around with it and noticed it booted if I had 3 or more drives unplugged, which I’m going to guess is because it couldn’t import the pool. After plugging all the drives back in it booted up but it says “ONLINE (Unhealthy)”. All the drives are there and seem to be working, could it be “unhealthy” because of the adapter?

mav · April 4, 2024, 4:01pm

@Demon4231 We’d need to see much more of the panic message to say anything.

Considering you have desktop-class system with non-ECC RAM it may be a pool corruptions, so make sure to at least run good long memory test.

Demon4231 · April 4, 2024, 5:05pm

Here’s a picture I took of the error. Its kind of blurry since I had to record the screen but it should be mostly legible.

I did run MemTest for about 5 hours, so not sure if that is long enough or not.

Davvo · April 4, 2024, 5:27pm

This does not look something you should use.

mav · April 7, 2024, 1:55am

As I see, ZFS runs pool scrub, and I suppose in process hits some metadata block with valid checksum, but with block pointer containing some incorrect values, that causes panic in dedup code. It still may be a result of a pool corruption. You should be able to import the pool read-only and evacuate the data.

Stux · April 7, 2024, 2:12am

There is talk in this Reddit discussion about the controller overheating if there is no heatsink. Other versions on Amazon seem to have a heatsink. Maybe add one if it is g there already.

Demon4231 · April 7, 2024, 7:45pm

As I see, ZFS runs pool scrub, and I suppose in process hits some metadata block with valid checksum, but with block pointer containing some incorrect values, that causes panic in dedup code. It still may be a result of a pool corruption. You should be able to import the pool read-only and evacuate the data.

When I ran zpool status two of the drives had a 1 under the “UM” column. I cleared the errors and they haven’t come back since the system has been running.

There is talk in this Reddit discussion about the controller overheating if there is no heatsink. Other versions on Amazon seem to have a heatsink. Maybe add one if it is g there already.

That’s interesting and I haven’t thought about it overheating. The system is in my basement which stays pretty cool and has constant airflow, but I do have a fairly large heatsink on the CPU which covers the M.2 slot. I might try to move it to the other M.2 slot to see if that makes it any better. The drives themselves sit around 30℃.

Stux · April 7, 2024, 9:44pm

Does the asmedia 1166 chip on your m2 sata adapter have a heat sink?

Demon4231 · April 8, 2024, 1:13am

Unfortunately it does not. The motherboard came with a heatsink but its for a nvme drive so it won’t work.

Stux · April 8, 2024, 2:42am

Something like this

Demon4231 · April 8, 2024, 11:56pm

I can try getting something like that to put on it. Couldn’t hurt.