I am having some trouble with my home-built server and I can’t seem to find the problem. Myserver will run fine for a day and then suddenly crash. For example it can run completely fine for about 10 hours and then stop working while showing the following error messages (see images below for more since I cannot copy paste this):
systemd [1]: systemd-journald.service: Found left-over process (systemd-journal) in control group while starting unit. Ignoring.
systemd [1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
systemd-journald: File /var/log/journalsystem.journal corrupted or uncleanly shut down, renaming and replacing.
WARNING: Pool 'boot-pool' has encountered an uncorrectable I/O failure and has been suspended.
I tried some things already. All of which did not solve the problem.
I already tried replacing the old RAM (don’t know if it was any good) with a new stick of ECC ram. Crashes still happen.
New SSD and SATA cable with a clean install of TrueNAS. Afterwards reloaded my config. Still crashing.
Tried a different power supply. Again still not working.
Anyone have an idea what could cause these crashes? I am afraid I don’t know where to look anymore.
PC Specs:
CPU: AMD Ryzen 5 3400G
RAM: 16 GB (ECC)
MOBO: MSI B550-A PRO
DISKS:
1 x Crucial BX500 (240GB) SATA SSD (boot-pool)
2 x M.2 SSD (500GB) (1 x Mirror for apps)
3 x SATA HDD (4GB) (1 x RAIDZ1 for data)
I did a full S.M.A.R.T. test on the drive and also replaced the SSD in the past. Both SSD’s caused the error, so I think the ssd itself would be good? I also replaced the SATA cable already. Still happens.
I agree with the boot pool having an i/o error and being suspended as the most likely cause.
You might want to check your BIOS settings.
As a bit of a workaround there is a ZFS zpool setting failmode that defines what happens when you get an I/O error - and the options are wait, continue or panic. IIRC the default for the boot pool is wait, and setting it to continue might help (assuming that you can get your system to boot far enough to allow you to set this).
(Also Crucial SSDs are getting increasing numbers of complaints. I have myself had an issue with a Crucial SSD and their tech support was terrible and they disclaimed any responsibility and accused / blamed me for running a bitcoin miner on it - which I don’t. I will never buy Crucial again and will always recommend to others that they boycott Crucial too.)
Not sure if you’re also the author of the post on Reddit, but I’ve had the same issue with TrueNAS ElectricEel-24.10.1 (also 24.10.2). In my case, it only happens intermittenly - 2 different times about a month apart. I’ve similarly ruled out power supply, RAM (which is actual ECC - Thinkserver), SATA cable and port, and boot drive (SSD). I’ve tried looking for diagnostic data in the /var/log files (syslog, messages, error, debug) but haven’t found anything. I’m following this thread now in case someone answers your questions about what logs, etc will prove helpful.