Replication stop with error, restarting hang waiting for the faillling one to stop?

Nono · September 4, 2024, 1:58pm

Hi there o/

I’m on TrueNAS-SCALE-24.04.2 and this morning, the replication job failled with the following:

replication.run

Error: [EFAULT] Network connection timeout.

/var/log/zettarepl.log gave me this :

[2024/09/04 03:38:48] WARNING [replication_task__task_12] [zettarepl.replication.run] For task ‘task_12’ at attempt 5 recoverable replication error RecoverableReplicationError(‘Network connection timeout’)
[2024/09/04 03:38:48] ERROR [replication_task__task_12] [zettarepl.replication.run] Failed replication task ‘task_12’ after 5 retries

From the UI, I’ve then re-lauch the replication but it hangs with:

Updating

replication.run0.00%

Fetching data…

It turns out, it’s waiting from the previous job (task_12) to stop first ?!

[MainThread] [zettarepl.zettarepl] Replication task <Replication Task ‘task_12’> can’t execute in parallel because ‘Waiting for retention to complete’, delaying it

The question are now:

How can I interrupt (nicely) the “failed job/task” in order for the new one to run correctly ? Is there a way to know the PID attached to a “task name” as shown on /var/log/zettarepl.log ?
This seems to be the only place with useful details, but I’ll be happy to know is there are other places/logfile which would give me more details.
Why I don’t have such details on the UI directly ? As from there, the previous jobs is clear “failed”, throw an error and doesn’t seems to run at all ? Could this be an bug and the process/task should have been interrupted ?