System shut down irregularly and won't boot properly anymore

dandrea · February 27, 2025, 3:02am

TrueNAS Scale 24.10.

The system’s was off this morning, seemingly shut down by itself or there was a power outtage. When I powered it on, i got the following message on monitor, instead of the typical numbered menu:

Websocket client error: ConnectionResetError(104, 'Connection reset by peer')

Unknown middleware error: ClientException('Websocket connection closed with code=None, reason=None') Press Enter to open shell

If I try to connect to the web admin, i get this message:

Connecting to TrueNAS … Make sure the TrueNAS system is powered on
and connected to the network.

I ssh’d the system and I can access it. Couldn’t figure much by myself from the logs in /var/log/

And I got a zpool error:

zpool status -v boot-pool                                                                                                                                        ok | root@ada | 09:34:36
  pool: boot-pool
 state: ONLINE
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
  scan: scrub repaired 0B in 00:00:08 with 2 errors on Wed Feb 26 09:27:56 2025
config:

        NAME        STATE     READ WRITE CKSUM
        boot-pool   ONLINE       0     0     0
          sdb3      ONLINE       0     0 1.42K

errors: Permanent errors have been detected in the following files:

        /var/log/journal/011b222bfebe4950a7b3e97ba3bab50c/system@00062f0a21456006-8daad9cba5bfa62d.journal~

Pool data-pool seems unharmed.

In my layman view, that’s just a log error that wouldn’t justify the webadmin going haywire, but then again, that’s an error (and a permanent one) and system is behaving like that. The smartctl command shows nothing wrong with the SSD.

All I know is there a problem with the boot-pool in my SSD, but it’s pretty hard to pinpoint it’s exact extension and if there’s a fix for this.

Any tip is appreciated.

SmallBarky · February 27, 2025, 3:35am

Try booting to a previous Boot Environment, if you have one. You should see a list in GRUB when booting.

Protopia · February 27, 2025, 9:03am

Could be your boot device has reached end of life. What is your boot device?
If you get corruption on your boot device then simply reinstall TrueNAS of the right version and restore your configuration file. You do have a backup of your configuration file?
A temporary solution is to boot to a previous installation (if you have upgraded since you first installed) as a TEMPORARY solution to get you running for a short while and to allow you to backup your configuration file if you need to. But this is NOT a permanent solution, particularly if you have no idea why you have over 1400 files corrupted with a risk of more becoming corrupted at any future time.

dandrea · February 28, 2025, 5:11am

Hello all, thanks for the inputs.

After taking a step away from the problem, I figured it out.

No, the drive if far from end of life. I had a power outage and it compromised a single file in the /var/log/journal folder. I was kind of suspicious it could be the culprit because journal service was up and seemingly ok, and something else should be broken. Removing the ofending file — actually all the files in the folder — and scrubbing again the pool then restarting the journal service allowed middlewared to be restarted as well and the system was normal again.

Lesson learned: journal files can be big deal for TrueNAS.

Arwen · February 28, 2025, 7:58pm

Glad you were able to solve your problem.

One thing that people with less experience on ZFS confuse, is that ZFS can’t corrupt an existing file on power loss. (Or OS crash, or hardware failure unrelated to storage…)

Sun Microsystems purposefully designed ZFS with power loss protections in software, using Copy On Write. An update to an existing file either succeeds or does not, no partial changes. The writing program MIGHT need more written to make a complete update, but that would not show up in ZFS as a “Permanent error”.

Of course, hardware failures can cause existing data to be lost. Like single disk pools that have a bad block. Or Non-ECC memory causing a bit flip. But, ZFS was originally designed and used on Enterprise Data Center computers, (by Sun Microsystems). All of which include ECC memory and can have redundant storage.

What I am trying to say, is that ZFS detected a real bad block in the file. Your removal of the bad file and allowing the OS / Middleware to recreate a new version is an acceptable fix if the file contents was not needed. SATA devices do block sparing on write of bad blocks, so your fix might be permanent. Meaning that bad block won’t show up again because it was likely spared out.

Thinking about this a bit, I think I will start a thread on single boot device installations, that then use “copies=2” to reduce situations like this.

Thanks for following up. It is always nice to see the conclusion.

dandrea · February 28, 2025, 10:04pm

Hello! Thanks for taking interest in my case.

As a matter of further information to whoever may care:

I’ve said it was a power outage because 1) I had a power outage and 2) it was suggested as a possible cause by ChatGPT (and yes, I know it can lie happily so I tread carefully on its responses). We arrived at this conclusion after ruling out a failing device by long-testing it and getting zero errors and having a 92% remaining lifetime from SMART.

Based on your response, I decided to run further tests using webadmin tools and this:

/ # badblocks -sv /dev/sda
Checking blocks 0 to 234431063
Checking for bad blocks (read-only test): done
Pass completed, 0 bad blocks found. (0/0/0 errors)

I hope this might be of any use.

Arwen · February 28, 2025, 11:23pm

Great. It does look like the bad block was spared out.

Or, possibly that the block was written with a either a bad checksum or bad data block due to a bit flip caused by Non-ECC RAM. I say this because the error was checksum, and not read error. In which case, no block needed sparing out.

ZFS verifies the checksum of all blocks read. And will report such as needed. If no redundancy is available, then it likely ends up as a permanent error.

I say “likely ends up as …” because some checksum errors are due to data path problems. Bad cables, disk controller chips over heating, loose data cable connections, etc… So if the problem returns, you can check those items.

edisondotme · March 1, 2025, 12:19am

I’m interested in implementing this as I have a deployment that frequently suffers power outages.

Do you have more specific guidelines on implementing your suggestion?

joeschmuck · March 1, 2025, 1:02am

If you have ftequent power outages then you need a quality UPS and to ensure the setup does an orderly shutdown upon power loss, then powers on automatically when it comes back.

Just saying.

Arwen · March 1, 2025, 4:38am

There are NO situations where a power loss will lead ZFS to loose existing data.

As I said, it has to be a different cause, like hardware fault. With lots of TrueNAS SOHO users having consumer hardware, that is likely the cause. For example, Non-ECC RAM, storage path issues, (loose cables), or less than ideal storage devices, (cheap SSDs or using SMR HDDs).

All that said, power losses or OS crashes can cause data in flight in process to be written, to be lost. That applies to all file systems.

Log in as root, then use;
zfs set copies=2 boot-pool
Log out, you are done.

Next update should use 2 copies. But the existing files will remain with 1 copy.

Understand now, that means updates take up twice the space. So smaller boot devices, like 16GByte ones, will store less boot environments. Possibly even 16GB might not be suitable. That is why I mentioned larger boot devices in;

dandrea · March 4, 2025, 12:14am

I’ve just get an UPS. You should consider this.

dandrea · March 4, 2025, 12:16am

I ended up taking the absolutely over the top measure: I got an UPS and a pair of small SSDs to create a mirrored boot-pool. I’ll end up checking this all.

dandrea · March 11, 2025, 7:16pm

ZFS is weird (in a cool way). I still didn’t spared any time to read more deeply about it, but whenever I see people explaining its settings and what it can do, I get the feeling I’m looking at a liquid raw-disk database pretending to do what partitions usually do.

Arwen · March 11, 2025, 10:56pm

Yes, ZFS has taken some of the concepts from Databases, like transactions, and implemented them.

When Sun Microsystems was transitioning to larger storage, (late 1990s), they found that their main existing file system, UFS, was inadequate. So a couple of engineers started a new project, which ended up being ZFS. Fortunately they did not waste their efforts and we got something good out of their time and effort.