System Crash during replication

JR87 · September 7, 2024, 7:40am

Hi there,

my question corresponds to my previous request:

I’ve upgraded to Dragonfish where the lru_gen setting is 0 by default.
So this solution might not apply in this case.

I’ve got the same behaviour. I needed to reinitialize the replication - so do a full backup/replication to the backup NAS.
After a while (2 - 4 TB) already transfered, the connection breaks and the target (backup) TrueNAS Instance is not reachable anymore (e. via Web GUI).
After a reboot it works fine again.
This happens every time i try to do a full backup. The past months doing delta runs worked fine - most likely due to the lower amount of data to be transfered.

I’ve got an ASUS XG-C100C network adapter in both servers which proved itself reliable over time.

I even upgraded to Electric Eel Beta 1 on the backup truenas to see if this makes any difference. But same here.

Do you have any Idea what could be the reason for those crashes and how I can fix this issue?

Thanks a lot and BR
Julian

Captain_Morgan · September 7, 2024, 4:51pm

Can you provide full hardware descriptions on each end. (particularly backup).
I assume you are on 24.04.2

While doing a backup have you checked memory state…is there anything that looks unhealthy?

JR87 · September 8, 2024, 1:50pm

Hi,

yes correct

TrueNAS Target (Backup):

Intel Xeon CPU E3-1220 v5
32 GB ECC RAM
ASUS XG-C100C
8x 12 TB HDD in 1 Pool RAIDZ1
The data to be transfered is about 65 TB (3 Datasets in source → 1 in target system)

The memory is not under pressure - it’s mostly free on the target system.
Highest used memory is around 4 GB and cached around 3 GB out of 32 GB total)

Captain_Morgan · September 10, 2024, 1:38pm

Hardware is vanilla.
It should be reliable.

3 datasets in source… 1 in target. Can you break those up and see if its a specific dataset that causes the crash?

After that, suggest you report a bug.

JR87 · September 10, 2024, 2:04pm

Hi,

actually - no - the very small (500 GB) Dataset works fine of course due to the lower amount of data. The other two not - it doesn’t crash at a specific time or amount of data of a specific dataset. Feels kind of random.

For testing purposes, I switched to the onboard 1GbE Adapter. This replicated smoothly for the past couple of days.

But I don’t think the network adapter itself is the problem. But am not sure either.

joeschmuck · September 10, 2024, 2:21pm

It does sound like a NIC issue. Maybe a driver or maybe it is too hot? You should rule out the cooling as that should be simple. Can you roll back to the working version to verify it still works? That would verify or eliminate the NIC or driver.

Good luck

JR87 · September 10, 2024, 2:30pm

Hi,
thanks for the hint. Is it somehow possible to read out the temperature of the NIC?

JR87 · October 9, 2024, 11:29am

I think it’s related to the NICs temperature. Saw that the thermal pad on top of the chipset wasnt properly set up.
Aftermodeling a bit it seems to work properly now!

Thanks a lot for your help

joeschmuck · October 9, 2024, 12:54pm

Glad you solved the issue and hopefully it stays fixed.