Pool Offline After Every Reboot

mango · October 29, 2024, 1:03am

Hello, I installed a brand new TrueNAS Scale instance on Electric Eel 24.10-RC.2. Every time I reboot, one of my pools goes offline. I have reinstalled multiple times and tested different variations of disks, pools, and vdev layouts. The issue is not tied to any one specific disk, vdev type, or pool.

The only consistency I can find is that this happens to the ‘first pool’ seen by the system (excluding boot-pool). I say that because I noticed the offline pool has the ‘System Dataset Pool’ icon when it is in online. However, moving the System Dataset Pool to a separate pool does not cause the issue to follow to the new pool. For example, I changed this to the boot-pool since I have mirrored SSDs. But, it’s worth mentioning since the issue happens seemingly to what TN sees as the first pool.

When it happens, I click Export/Disconnect > only check confirm export > export. I am then able to import the pool and everything is fine until the next reboot.

I understand this is a Release Candidate version. I’m tempted to test it on a stable version, but the whole reason of me building this system is to leverage the native Docker functionality in EE.

I’m running bare metal with these specs:
Core i9-12900k
96GB DDR5 4000MT/s
ASUS Z790-V AX Prime
Intel Arc A380
LSI 9305-16i
Intel NC360T Dual 1G NIC
2xTB WD Reds CMR (beginning of large mirrored vdev pool)
2x1TB NVMe SSDs (SN770 / SN850x - app pool)
1x1TB SATA SSD (dump pool)
2x256GB SATA SSDs (boot pool)

Appreciate any suggestions!

sirTegyr · October 29, 2024, 5:17pm

I’d like to confirm that this is still the case for me after the update to ElectricEel-24.10.0

ABain · October 29, 2024, 5:17pm

have you filed a bug ticket? If not please do so and upload a debug using the private link which will be provided.

sirTegyr · October 29, 2024, 6:54pm

I have not personally done so, but I believe others have (I can’t post links, but there’s a reddit post with the title “Electric Eel - Pool Offline after Reboot” which is about the same issue)

neofusion · October 29, 2024, 6:59pm

That user replied with news about them trying a nightly and that everything was now fine.

I am not suggesting you try a nightly, just that there’s a fair risk of them never having posted that bug report.

wynter_ca · October 29, 2024, 11:16pm

Same issue for me, on a note I also have Optane drives as a special metadata vdev in a mirror.

sfatula · October 29, 2024, 11:28pm

Without a bug report, you likely won’t get an answer. It’s the process.

techdan91 · October 30, 2024, 12:06am

i was actually having this issue as well a few weeks ago or so…but it just magically stopped…it was happening to just one of my pools i believe at the time…strange but sorry idk what cause it to happen or how it stopped…hope you get a fix

mango · October 30, 2024, 2:47am

Thanks for all the replies! I’ll file a bug report.

FYI I installed the new Electric Eel stable update today and the issue persists.

Tassos_Tzezairlidis · November 5, 2024, 8:56pm

Same issue here. Brand new 24.10.0 installation on recommended hardware.
Pool 3 x MIRROR | 2 wide | 7.28 TiB goes offline after every reboot (either CLI, shell or GUI button.
I found a solution to bring it back online every time with no errors, but it is annoying. GUI pool export/disconnect. GUI pool import. Then all is good until next reboot.

mango · November 5, 2024, 10:21pm

I was just getting around to creating the bug report but it looks like you beat me to it. Do you know if there’s a way for me to add anything to that report? I don’t see an option to after creating an account.

Also I have the same thing. It is without fail on reboot, to the same pool (although it happened to different drives/pools in a previous install where the same drives still exist). As mentioned, it could be 1st pool seen issue. All I have to do is export WITHOUT deleting existing data, then import it. I’ve done this about 5 or 6 times now due to some new hardware installs and testing, and it works every time without issue.

Thankfully this is just an annoyance. Although, two things happen - the widget for this pool errors on out the Homepage even after re-import, and I had to modify the auto start behavior of some Docker containers that point to this pool. They go haywire if they can’t see the files in the pool.

James_Norman · November 6, 2024, 1:38pm

Same issue here my drives disappear after a reboot. I also created a bug report and uploaded debug logs. Trying nightlies did not work so I’m waiting for a fix.

Sgal · November 14, 2024, 10:25am

When I wipe the disk, it works for me!

donnyG · November 14, 2024, 9:26pm

Hi Guys, To cut to the chase here, I fixed this behavior on my rig with a little bios poking around. On initial setup, I limited my x10srh-cf mb to boot to efi os because that supposed to be newer and better and limited the mb to not boot to legacy os.
What turned out to be the problem is that the pci setting in mb bios should be oprom efi, and I had to turn off legacy bios in the attached sas3 3008 controller bios in the mb bios. So far, I’ve rebooted several times and it works like a beast.

The lesson here might be if you’re going legacy keep all bioses to legacy, and if you’re going efi, turn off legacy.

I was having this problem on truenas scale 24.10.0.2 that when I rebooted my pool got disconnected. I couldn’t add them back to their pool. I had to disconnect the dead pool in truenas and reimport the now exported pool. This stuff is very complicated. I have the impression this base firmware of the bios and sas3 was designed a long time ago and is set in stone, so you have to set it properly. I have the supermicro cse-847 case with two sas3 expanders, the 24 port one and the back 12 port one.
Hope this helps someone else. I noticed other users aren’t using sas.
Take care, Don.

cgfrost · November 27, 2024, 1:55pm

I’ve also been having this problem. I followed the advice in this bug report TrueNAS - Issues - iXsystems TrueNAS Jira

Rather than blowing it all away and starting again or spending a long time figuring out how to use the ‘blkid’ and ‘wipefs’ commands, I just detached the drives in my pool 1 at a time, fully wiped the drive to 0s, then reattached. Thankfully they are all SSDs so the re-silvering didn’t take long. This has fixed the issue for me. Thanks to those who raised the bugs on Jira. I hope my fix helps anyone else with this problem.

Initium · December 5, 2024, 3:46pm

Today installed on fresh Dell Poweredge X12 server with HBA355i and 4x 15,36 SAS SSD Drivers. and Intel Optane 900 as boot. Same issue here. Until Reboot/restart work flawless. but afterward is pool Offline and not usable.
When will be there solution?

bacon · December 5, 2024, 4:03pm

The problem is fixed in the next version. But I don’t think fix will cover pools that already have the issue - it prevents the issue from appearing in new pools. We can try to summon @HoneyBadger for more information

You have two options in my opinion:

Destroy the pool, wipe the disks (a “quick” wipe should be enough), and recreate the pool. And yes, this will destroy all data.
Carefully use wipefs to clear the file system markers from the offending disks. If done carefully, this will fix the pool without loosing any data.

If you want a guide on how to do Nr. 2), please post the output of the following commands for each of your disks. Replace YOURDISK with the first partition of each disk in the pool (for example /dev/sda1):

sudo blkid --probe /dev/YOURDISK
sudo wipefs --no-act /dev/YOURDISK

HoneyBadger · December 5, 2024, 4:31pm

To force a no-op, use sudo wipefs -n /dev/YOURDISK - while the default option for wipefs without parameters is currently “do nothing” it’s always better to explicitly specify that in case behavior changes upstream.

@Initium @mango Happy to help provide the details here.

bacon · December 5, 2024, 4:56pm

I edited the command to add the more explicit long option --no-act. Just to be super safe. Also I added that the commands must be done on partitions - not on the disks directly.

Initium · December 5, 2024, 5:12pm

What version of Truenas Scale you mean if this issue is here already 3 months?
Because it start to be backup server an have no real data then can test anything.
Can you please provide exact commands what need to be done for option 1 and 2?