I’ve got a Problem with the replication from my main TrueNAS to my backup TrueNAS.
This worked fine until I upgraded to TrueNAS Scale Dragonfish and as well on Dragonfish for small delta-replications.
But now, I tried to replicate a dataset freshly.
I get this message after about 3h replication and the target (backup) TrueNAS is not responding anymore. (even via Web Interface)
It ransfered a couple of TB up until then but didn’t finish.
After a hard reboot of the target server, the system works fine again.
But during each replication try, it crashes again.
I already installed the target TrueNAS fresh and restored the config. But didn’t help.
Just as a side note:
I’ve reduced the throughput of the replication to around 256 MB/s (around 2 Gbit/s). This apparently helps as the replication is now running for 24 hours straight without crashing the target server.
May this be caused by some hardware issue? Memory for example - which cannot handle the throughput?!
Can you look at the logs on the destination server, particularly /var/log/messages , just before it crashes? I suspect the clues we need will be there.
Then I suggest disabling lru_gen at a minimum on both sides if running dragonfish.
The issue is most likely that lru_gen is conflicting with the arc during the replication causing a chronic swap situation which causes unresponsiveness.
I’ve set both - swap and lru_gen - to disabled and started the replication again.
In case it crashes once more - I’ll have a look into the “message” logs and see what’s mentioned in there.
so - as mentioned, I tried the replication again with those settings changed (disabled).
It seems that solved it - the replication went through without any interruption.
May I ask - as i couln’t get useful info in my own search - what does the disabling of “lru_gen” do? Do I need to change it back in the next release or so?!