Pools OFFLINE

Hi,

It’s not the first time I have this problem, the issue is solved after a reboot but not after a “restart via truenas” (I have to shutdown then boot the server). What can I do to understand and prevent this problem, please ? :slight_smile:

“The web interface could not be accessed” then it was but not on the usual IPv4 only via IPv6.

OS Version:TrueNAS-SCALE-23.10.2
LSI 9302-8I

Please add the other hardware to your post and post the output of zpool status
Since you differentiate between the two reboots: are you running truenas virtualized?

I agree - insufficient information provided about your configuration to be able to help you.

image

No, truenas is not virtualized.

Zpool status (but it is “working” now, since I have rebooted the server) :

Wait for it to fail again and save the output again. Also see what lsblk yields then.

HBAs are not my wheelhouse, maybe it needs cooling and starts overheating?

My best guess would be, judging from your screenshots that your HBA drops out and all disks are gone then. However it seems like your nvme drive for the apps pool, which is not connected to the HBA also drops out.

Would be interesting to scan the logs (potentially dmesg?) for clues.

Just to make sure I understand, that happens sometimes after boot? Or happens while machine is running, out of the blue?

First of, I would suggest upgrading to latest Dragonfish before diagnosing anything further.

It happens “out of the blue”.

I’m running the latest Bluefin release, you believe Dragonfish is a better option ? I’m not familiar with the differences between those two.

Indeed, the 2 SSD are running on the motherboard, and the HBA is only for 8 HDD (it has 8 slots) the other 8 HDD are connected directly to the MB.

Only reason it would happen is firmware resetting and OS losing access to the disks. That could be either environmental, hardware or software.

Upgrading would stress the software piece, obviously. Since we are not seeing reports like that its a starting point. Its easier to diagnose and get help on up to date software.

OK, I don’t have to worry switching to Dragonfish (since it’s considered “BETA”) ? Or should I backup some data first ?

Dragonfish is not beta, its in release version 24.04.1.1. There are several thousand of people running that version already.

I would not upgrade to Dragonfish whilst you have a problem unless you have reasonable evidence that such an upgrade would fix the problem.

As a general rule, changing stuff whilst trying to diagnose a root cause simply confuses things and makes it more difficult to determine a root cause.

1 Like

Is the 9302-8i on the latest firmware and in IT mode? There are some bugs in older firmwares.

I would not only suggest an uplift to Dragonfish as your CPU is fairly new, but also look into whether or not you might be subject to some of the potential causes of the i9-14900K stability issues - I can’t speak in great detail to these but it may be a contributing factor. Also, ensure that your HBA is receiving sufficient cooling, especially as you seem to be using a closed-loop liquid cooler so your case fans might not be pushing as much air as expected.

This is where I bought my card, I think it’s already in IT mode. I don’t know about the firmware, how can I check it out ? and should I ?

Mode Informatique. Cute :slight_smile:

Not part of your issue, but IT stands for Initatiator Target in this case, and not Information Technology.

Regarding not updating, pragmatically, you’re going to get better results if you can confirm the issue on the current software. And if you can’t, which is quite likely, then your issue is fixed.

You should at least update to Cobia, which is 1) on the upgrade path to Deagonfish, and 2) quite stable I believe.

And depending on your usage, it’s fairly easy to revert back to a previous boot environment.

Defiantly check the cooling on your HBA. Passively cooled HBAs need strong airflow, make sure you have a blowing over the heatsink.

One of the guides is here:

I’ve seen quite a few systems in the forums whose problems were fixed when they got the firmware current.

1 Like

Hey,

I’m having the problem right now, before rebooting I just the command you asked :

image

image

Update : having the problem for the second time today…
image
first happened at 07:16

OK, third time today I have this problem…
image
It’s becoming kind of serious now ! :sweat_smile:

for your hba the command would be:

sas3flash -list 

(though I might be mistaken & your LSI card might use sas2flash, give that a try if sas3flash returns nothing useful)