Hi there o/
I’m on TrueNAS-SCALE-24.04.2 and this morning, the replication job failled with the following:
replication.run
Error: [EFAULT] Network connection timeout.
/var/log/zettarepl.log
gave me this :
[2024/09/04 03:38:48] WARNING [replication_task__task_12] [zettarepl.replication.run] For task ‘task_12’ at attempt 5 recoverable replication error RecoverableReplicationError(‘Network connection timeout’)
[2024/09/04 03:38:48] ERROR [replication_task__task_12] [zettarepl.replication.run] Failed replication task ‘task_12’ after 5 retries
From the UI, I’ve then re-lauch the replication but it hangs with:
Updating
replication.run0.00%
Fetching data…
It turns out, it’s waiting from the previous job (task_12) to stop first ?!
[MainThread] [zettarepl.zettarepl] Replication task <Replication Task ‘task_12’> can’t execute in parallel because ‘Waiting for retention to complete’, delaying it
The question are now:
- How can I interrupt (nicely) the “failed job/task” in order for the new one to run correctly ? Is there a way to know the PID attached to a “task name” as shown on
/var/log/zettarepl.log
?
This seems to be the only place with useful details, but I’ll be happy to know is there are other places/logfile which would give me more details. - Why I don’t have such details on the UI directly ? As from there, the previous jobs is clear “failed”, throw an error and doesn’t seems to run at all ? Could this be an bug and the process/task should have been interrupted ?