25.04.2.6 keeps crashing every night after update from CORE

I have Upgraded from CORE up to 25.04.2.6 SCALE few days ago. At first I did not noticed anything strange but then I notided weirdly low Uptime. After that I investigated and found out that system will reboot/crash every night between 0-5am.

There are no containers/VM/Apps.

MB: Supermicro SSG-520P-ACTR12L/X12SPI-TF

CPU: Intel(R) Xeon(R) Silver 4310 CPU @ 2.10GHz

Memory: 128 GB

Storage:
Data VDEVs
2 x MIRROR | 2 wide | 16.37 TiB - HDD
Metadata VDEVs
1 x MIRROR | 3 wide | 894.25 GiB - SSD

and
1 x MIRROR | 2 wide | 2.91 TiB - NVME

and
boot root is mirrored SSD

Network:
2× Intel X710 separatelly bonded in 2 LACP groups

System did worked for 2+ years under CORE without any down time so HW failure is in my opinion highly unlikely, IPMI also shows no abnormality in HW and no power failures.

There is complete /var/log/messages attached log-messages.txt (4.1 MB)

Only thing I can see from logs is that there is always oom_kill_process and out_of_memory.

Crashes can be (in my opinion indentified by ‘Linux version 6.12.15-production+truenas’)

There is list of boot times:

Dec 28 22:16:31 (first boot of 25.04.2.6)
Dec 30 21:55:52
Jan 1 02:25:45 (crashed)
Jan 2 01:29:02 (crashed)
Jan 3 00:22:28 (crashed)
Jan 4 00:21:03 (crashed)
Jan 5 05:41:27 (crashed)
Jan 6 05:03:35 (crashed)

I search through cron and other tasks to see if there is any pattern in these times corelating with any tasks but I found nothing, there are only replications and backup running at these times, but these also run in diferent times of the day.

Can anybody please take a look at the logs and see if You can decode anything from them?

Looks to me as if your system was constantly OOM-killing before that - maybe it killed something really important :sweat_smile: it’s very weird though - never seen anything like that. Can you take a look at the Memory available graph in reporting and see if thats actually close to 0 before the crashes?

I cannot say for sure, right before crashing (when it shows 0 and than all avalible again) it is usually howering around 10-20GiB free.

This is detail of today’s crash:

That does look to me like something actually uses all available RAM and since you say you don’t have any containers/Apps/VMs this might actually be a bug. I’d file a ticket on Jira…

1 Like

Ok, will do thank You for Your time.

1 Like

Sadly it seams I’m not able to report a bug as I get error when trying to Create it while logged in jira: “You are not authorized to perform this operation. Please log in.”

Have you tried to file it from the webui or directly in jira? I remember getting that error from jira but could file it from the webui

1 Like

Thank You! Submitting Bug from TN wUI Actually allows You access to space in Jira.

1 Like