Daily server crashes (systemd-journald.service)

plssrs.be · February 6, 2025, 2:39pm

Hello everyone

I am having some trouble with my home-built server and I can’t seem to find the problem. My server will run fine for a day and then suddenly crash. For example it can run completely fine for about 10 hours and then stop working while showing the following error messages (see images below for more since I cannot copy paste this):

systemd [1]: systemd-journald.service: Found left-over process (systemd-journal) in control group while starting unit. Ignoring.
systemd [1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
systemd-journald: File /var/log/journalsystem.journal corrupted or uncleanly shut down, renaming and replacing.
WARNING: Pool 'boot-pool' has encountered an uncorrectable I/O failure and has been suspended.

I tried some things already. All of which did not solve the problem.

I already tried replacing the old RAM (don’t know if it was any good) with a new stick of ECC ram. Crashes still happen.
New SSD and SATA cable with a clean install of TrueNAS. Afterwards reloaded my config. Still crashing.
Tried a different power supply. Again still not working.

Anyone have an idea what could cause these crashes? I am afraid I don’t know where to look anymore.

Thank you in advance

Geert

plssrs.be · February 6, 2025, 2:46pm

PC Specs:
CPU: AMD Ryzen 5 3400G
RAM: 16 GB (ECC)
MOBO: MSI B550-A PRO
DISKS:
1 x Crucial BX500 (240GB) SATA SSD (boot-pool)
2 x M.2 SSD (500GB) (1 x Mirror for apps)
3 x SATA HDD (4GB) (1 x RAIDZ1 for data)

somethingweird · February 6, 2025, 3:54pm

I would check the ‘boot-pool’ since it mention in the WARNING. check the drive make sure it good.

Just want to mention ECC ram on that MB - MB will support but won’t do any ECC checking. based on MSI B550-A PRO ATX Motherboard with PCIe 4.0 for AMD Ryzen Processors - MSI-US Official Store specs

plssrs.be · February 6, 2025, 4:00pm

Thank you for pointing me to the ECC support.

I did a full S.M.A.R.T. test on the drive and also replaced the SSD in the past. Both SSD’s caused the error, so I think the ssd itself would be good? I also replaced the SATA cable already. Still happens.

Protopia · February 6, 2025, 4:27pm

I agree with the boot pool having an i/o error and being suspended as the most likely cause.

You might want to check your BIOS settings.

As a bit of a workaround there is a ZFS zpool setting failmode that defines what happens when you get an I/O error - and the options are wait, continue or panic. IIRC the default for the boot pool is wait, and setting it to continue might help (assuming that you can get your system to boot far enough to allow you to set this).

(Also Crucial SSDs are getting increasing numbers of complaints. I have myself had an issue with a Crucial SSD and their tech support was terrible and they disclaimed any responsibility and accused / blamed me for running a bitcoin miner on it - which I don’t. I will never buy Crucial again and will always recommend to others that they boycott Crucial too.)

plssrs.be · February 7, 2025, 8:08am

Any BIOS settings in particular that I should be aware of?

Also thank you for pointing to Crucial’s reliability. I have ordered a Samsung SSD to try out. I have also set the failmode to continue.

Thanks

plssrs.be · February 10, 2025, 6:52pm

Aaaand we’re back to zero…

As @Protopia suggested, I tried the failmode setting, but this wasn’t any help. It just kept spamming the same error and the server went down anyway.

I also tried a third SSD, this one from Samsung, but the crash happened again.

I have no idea where to look. Are there logs somewhere where I can see the first messages before the system-d-journald-service errors start spamming?

danno · March 18, 2025, 6:18pm

Not sure if you’re also the author of the post on Reddit, but I’ve had the same issue with TrueNAS ElectricEel-24.10.1 (also 24.10.2). In my case, it only happens intermittenly - 2 different times about a month apart. I’ve similarly ruled out power supply, RAM (which is actual ECC - Thinkserver), SATA cable and port, and boot drive (SSD). I’ve tried looking for diagnostic data in the /var/log files (syslog, messages, error, debug) but haven’t found anything. I’m following this thread now in case someone answers your questions about what logs, etc will prove helpful.