Middleware service not starting

Nervous_Cheesecake38 · January 7, 2025, 6:01am

Has anyone been able to resolve this issue on know what is actually happening? I’ve tried reloading old configs with no success.

I have also tried installing ubuntu on my boot drive and the system works flawlessly, so it doesn’t appear to be a hardware issue.

TheColin21 · January 7, 2025, 6:59am

My system still takes many minutes to boot up as it stays a long time at ‘setting up plugins (audit)’ but it does boot in the end.
I did look at some logs but I couldn’t find anything useful in syslog and middlewared.log.
This is probably rather due to me not knowing how middlewared and it’s plugins work and therefore not knowing that to actually look for than due to there being no useful information.

Is there anyone here (preferrably a TrueNAS dev) who could explain what exactly happens at this stage in the boot process and what to look for?

TheColin21 · January 7, 2025, 7:21am

I’ve taken a look at middleware/src/middlewared/middlewared/plugins/audit at master · truenas/middleware · GitHub which led me to the path /audit on my NAS (quiet obvious, yes).

I think my system might take so long to start because I have a multi-gigabyte SUDO.db there.
Nearly all entries are caused by my monitoring agent which is not part of TrueNAS.
At least for me the issue may be selfmade.

I sadly can’t find an option to exclude a certain user from sudo audit logging, this option only seems to exist for SMB.
I would like to empty or at least largely reduce the databse.
When reducing the retention days the database does not seem to be cleared.
It is also not possible to set the quota to a level which would trigger the warning state with the current size.

Although the issue may be selfmade I still think a large auditing database should not make TrueNAS take that long to boot (especially on mirrored SSDs) or even outright fail (if this is actually the cause for other people’s issues as well).
I would also like to see some more options for audit logging (ignoring users, freeing space…).

andrewww · January 12, 2025, 5:17pm

I got stuck on middlewared: setting up plugins (audit) after upgrading from 24.10.0.2 → 24.10.1. I was blocked from fully booting the console as a result.

I “fixed it” by removing all of the existing audit databases:

cd /audit
for f in *; do mv "$f" "$f.save"; done

and then rebooting. So, I lose out on the previous log databases, but my system works again.

I spent a tiny amount of time investigating some of the databases in a postmortem, and the .db files were able to be read and dumped by sqlite3, so I doubt there was any corruption there. Maybe it could’ve been related to the size of the databases (the biggest one, IIRC, was around 10GB with the WAL), but without diving into the audit plugin a little, that feels unlikely.

I didn’t spend too much time in postmortem, since I don’t really care about those databases in their previous state, and a reboot causes the audit plugin to make new databases.

andrewww · January 12, 2025, 5:22pm

Oh, also potentially relevant for someone wanting to do more digging on this. There weren’t any logs in /var/log/middlewared.log from before the upgrade that were particularly concerning. After the upgrade during startup, I had this:

[2025/01/12 08:23:37] (ERROR) middlewared.reader():708 - Error in ShellWorkerThread.reader
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/middlewared/main.py", line 706, in reader
    ).result()
      ^^^^^^^^
  File "/usr/lib/python3.11/concurrent/futures/_base.py", line 456, in result
    return self.__get_result()
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/concurrent/futures/_base.py", line 401, in __get_result
    raise self._exception
  File "/usr/lib/python3/dist-packages/aiohttp/web_ws.py", line 315, in send_bytes
    await self._writer.send(data, binary=True, compress=compress)
  File "/usr/lib/python3/dist-packages/aiohttp/http_websocket.py", line 688, in send
    await self._send_frame(message, WSMsgType.BINARY, compress)
  File "/usr/lib/python3/dist-packages/aiohttp/http_websocket.py", line 601, in _send_frame
    raise ConnectionResetError("Cannot write to closing transport")
ConnectionResetError: Cannot write to closing transport

The only relevant log from journalctl -u middlewared was about the event loop exiting too soon.

Jan 12 08:27:14 localhost systemd[1]: Starting middlewared.service - TrueNAS Middleware...
Jan 12 08:27:17 localhost middlewared[1680]: /usr/lib/python3/dist-packages/pydantic/json_schema.py:2158: PydanticJsonSchemaWarning: Default value <object object at 0x7ff145f08d40> is not JSON serializable; excluding default>
Jan 12 08:27:17 localhost middlewared[1680]:   warnings.warn(message, PydanticJsonSchemaWarning)
Jan 12 08:31:19 localhost systemd[1]: middlewared.service: start operation timed out. Terminating.
Jan 12 08:31:19 localhost middlewared[1680]: Traceback (most recent call last):
Jan 12 08:31:19 localhost middlewared[1680]:   File "/usr/bin/middlewared", line 33, in <module>
Jan 12 08:31:19 localhost middlewared[1680]:     sys.exit(load_entry_point('middlewared==0.0.0', 'console_scripts', 'middlewared')())
Jan 12 08:31:19 localhost middlewared[1680]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Jan 12 08:31:19 localhost middlewared[1680]:   File "/usr/lib/python3/dist-packages/middlewared/main.py", line 2220, in main
Jan 12 08:31:19 localhost middlewared[1680]:     ).run()
Jan 12 08:31:19 localhost middlewared[1680]:       ^^^^^
Jan 12 08:31:19 localhost middlewared[1680]:   File "/usr/lib/python3/dist-packages/middlewared/main.py", line 2034, in run
Jan 12 08:31:19 localhost middlewared[1680]:     self.loop.run_until_complete(self.__initialize())
Jan 12 08:31:19 localhost middlewared[1680]:   File "/usr/lib/python3.11/asyncio/base_events.py", line 652, in run_until_complete
Jan 12 08:31:19 localhost middlewared[1680]:     raise RuntimeError('Event loop stopped before Future completed.')
Jan 12 08:31:19 localhost middlewared[1680]: RuntimeError: Event loop stopped before Future completed.
Jan 12 08:32:50 localhost systemd[1]: middlewared.service: State 'stop-sigterm' timed out. Skipping SIGKILL.
Jan 12 08:34:20 localhost systemd[1]: middlewared.service: State 'final-sigterm' timed out. Skipping SIGKILL. Entering failed mode.
Jan 12 08:34:20 localhost systemd[1]: middlewared.service: Failed with result 'timeout'.
Jan 12 08:34:20 localhost systemd[1]: middlewared.service: Unit process 1680 (asyncio_loop) remains running after unit stopped.
Jan 12 08:34:20 localhost systemd[1]: middlewared.service: Unit process 1692 (python3) remains running after unit stopped.
Jan 12 08:34:20 localhost systemd[1]: Failed to start middlewared.service - TrueNAS Middleware.
Jan 12 08:34:20 localhost systemd[1]: middlewared.service: Consumed 4min 40.497s CPU time.

bonoch · January 13, 2025, 12:07am

This fixed it it for me. Appears to have been caused by large db’s in the audit/ directory.

I have turned off SMB audit in my SMB shares. as that was the largest files

nka · January 16, 2025, 4:45pm

Thanks, this fixed my issue. Look like my SMB audit were huge (15GB) after syncing a lot of files. I just disabled SMB audit ! Booted pretty fast after renaming the log to .save

agent · February 1, 2025, 1:00am

Thank you for this! Ran into the same issue when upgrading from 24.10.1 to 24.10.2 and your solution worked.

katbyte · February 5, 2025, 3:26am

I just ran into this to going from 10.0.2 → 10.2

middleware never started and I had to drop in and delete /audit/* - it was 10gb total and totally preventing the middleware from starting even if I dropped back to the previous version (server had been up for 100+ days)

that said even when deleting audio in the new version the system was broken, I had to get into 10.0.2 and re-apply the upgrade for 10.2 to find my pool and boot successfully

hefty-merv · March 10, 2025, 12:38am

Awesome work Andrew, fixed it for me too.

Makes sense now as I had a system crash on my security system and moving over 2t of data would fill those logs pretty quickly.

Cheers.

William_Geiger · April 23, 2025, 5:14pm

Trying to upgrade from 24.05 to 25.04 and getting this error. Removed the existing audit databases even though they were not very big, but that did not resolve the problem. Not sure what else I can try other than doing a fresh install.

awalkerix · April 23, 2025, 6:17pm

We have seen some cases where some poorly designed applications log on and log off of SMB shares at rates of up to 100x or more per second. This can flood audit entries, but generally should be investigated by the server admin as it can generally cause performance issues on the server even without auditing.

Pierre1 · April 30, 2025, 11:26am

Had the same issue this morning after my Proxmox host experienced a power failure exceeding the UPS capacity overnight. The solution that worked was this one suggested by Brave browser AI :

One solution is to remove the existing audit databases to resolve the boot issue. You can do this by navigating to the /audit directory and moving the files to a .save extension:

cd /audit
for f in *; do mv "$f" "$f.save"; done

Started fine on the next reboot.

OnAir3 · May 1, 2025, 1:31pm

Thanks much. This worked for me too. I am surprised this issue has continued for so long.

sjohn · May 1, 2025, 6:00pm

I have used FreeBSD for over 30 years and Debian and FreeBSD are my 2 favorite OS. Our Dell r710 Windows file server went down due to 2 failed drives. Replaced the drives init th raid and downloaded the most recent TrueNAS-SCALE-25.04.0.iso and did a fresh install on the system partition (the data raid array is still ok). It failed to boot, removing or moving the audit did nothing to solve the issue so not very impressed at all. But will be trying the core version and see how it goes. Not really the best way to get new clients

William_Geiger · May 7, 2025, 1:24pm

Update. I tried upgrading from 24.05 to 24.10, and that worked without issues. I then upgraded from 24.10 to 25.04, and that was successful. Thankful I did not have to do a fresh install.

Stuart_Thomas · May 12, 2025, 9:29pm

So I have also run into th middleware service not running issue after upgrading to 25.04. Unfortunately for me the removal of the audit db’s did not resolve it. I have however been able to boot into the system by selecting the older version during bootup, so am going to stick with that for now.

On that note though, how I can remove the upgraded version and force it to stick to the version that is working?

sandwich · May 12, 2025, 10:30pm

I have run into this issue on a fresh install.

Kelly_Leavitt · May 15, 2025, 12:29pm

Under System->Boot
Click on the actrivate icon to make the older version active.
Once you reboot and verify everything works you can delete the failed upgrade.
I have been lead to believe the next maintenance update (due on 5/27/2025) should address the middlewared problem.