Curious replication error

Constantin · March 8, 2025, 1:42pm

Good morning,

My primary NAS is struggling to run a replication to a remote NAS and I am getting the following message that make little sense to me. The replication appears to start normally, all the changes are collected, then the replication hangs on a Time Machine dataset from 2025-02-06.

Error: [2025/03/08 01:01:00] INFO [Thread-144] [zettarepl.paramiko.replication_task__task_1] Connected (version 2.0, client OpenSSH_9.2p1)
…
[2025/03/08 02:04:08] INFO [replication_task__task_1] [zettarepl.replication.run] For replication task ‘task_1’: doing push from ‘local/Time Machine/TimeMachine_CvW’ to ‘remote/local_Backup/Time Machine/TimeMachine_CvW’ of snapshot=None incremental_base=None include_intermediate=None receive_resume_token=‘[EDIT: token ID]’ encryption=False
[2025/03/08 02:04:09] INFO [replication_task__task_1] [zettarepl.transport.ssh_netcat] Automatically chose connect address ‘EDIT: local IP’
[2025/03/08 02:04:09] WARNING [replication_task__task_1] [zettarepl.replication.partially_complete_state] Specified receive_resume_token, but received an error: contains partially-complete state. Allowing ZFS to catch up
[2025/03/08 02:04:09] ERROR [replication_task__task_1] [zettarepl.replication.run] For task ‘task_1’ non-recoverable replication error ContainsPartiallyCompleteState()

I have seen posts in the old forum suggesting that the way to deal with this error is to either wait it out, kill the errant process, or to reboot the remote machine. There must be a better way?

Waiting it out does not seem to be an option because my NAS’ fail nightly at getting the task done. I have now rebooted the remote machine twice as part of trying to resolve this issue. What else can I try?

winnielinnie · March 8, 2025, 2:22pm

Does this mean that there is no common snapshot to facilitate the replication of the child dataset “TimeMachine_CvW”?

Constantin · March 8, 2025, 2:30pm

The curious thing is that I’ve had this system work for over a year. As best as I can tell, the current issue is an old-hat problem, i.e. that

The pipe between the two systems gets broken by some event that I cannot control before a given snapshot transfer has completed. This is usually due to a large snapshot being generated (i.e. lots of changes on my laptop → time machine → local NAS → replication to remote NAS)
The alleged-available replication resume function is not being invoked by 24.10.2. Instead, the replication function keeps failing out and only a restart of the remote NAS kills the process that is hanging reliably.

I doubt that iXsystems encounters this issue much since their customers have high-speed connections between local and remote systems and as such the lack of resume function or ZFS replication middleware with its hair on fire is likely a edge case for the tech support team.

winnielinnie · March 8, 2025, 2:35pm

Not knowing the scheduling, but is it possible that pruning is occurring before a replication can complete?

Johnny_Fartpants · March 8, 2025, 2:37pm

Have you tried removing one or two of the most recent snapshots on the receive system to see if it can start again from an earlier point in time?

Constantin · March 8, 2025, 2:38pm

That’s a really good point. I doubt it, as this error was allowed to fester for a while since usually replications fix themselves - i.e. the remote system has a power outage so the task fails today but will complete tomorrow, for example.

In this case, the replications keep failing and the task backlog has grown to about 500MB in content, which suggests a very long transfer time thanks to Comcast. Without the benefit of a working resume function, any interruption kills the entire snapshot transfer and appears to put the remote machine in a funk.

Constantin · March 8, 2025, 2:41pm

I can certainly do that. I did look on the receiving machine whether there were any registered 2025-02-06 snapshots for the particular Time Machine dataset in question and there were none. Otherwise, I would have deleted them since this issue has affected me in the past.

Do I have to do anything special with TrueNAS to “legally” delete the 2025-02-05 Time Machine dataset so it knows to retransmit it or can I just delete the remote snapshot and ZFS replicates it again?

Johnny_Fartpants · March 8, 2025, 2:43pm

What’s your snapshot schedule on your primary system for the dataset in question?

Constantin · March 8, 2025, 2:44pm

Nightly with a two week retention, weekly with a 3 month retention.

The remote sets follow the local cadence.

Johnny_Fartpants · March 8, 2025, 2:45pm

Ok so can you see any snapshots on the receive system relating to this dataset?

Johnny_Fartpants · March 8, 2025, 2:46pm

I presume the receive side is not being limited by space/quota?

Constantin · March 8, 2025, 2:53pm

Sure can… and lookie here, the 2025-02-05 snapshot is showing zero bytes

while the local machine is currently trying to upload the snapshot for 2025-02-06.

Screenshot 2025-03-08 at 09.53.32

Constantin · March 8, 2025, 2:55pm

28% full, oodles of space, no quota. I’m guessing the mangled 2025-02-05 snapshot is causing unhappiness. Can I simply

stop the replication process on the local machine
delete the 2025-02-05 snapshot on the remote system via the GUI
restart the local replication process?

winnielinnie · March 8, 2025, 2:58pm

That only means that there are 0 bytes worth of “differences” between this snapshot and all other snapshots (including the live filesystem).

Constantin · March 8, 2025, 3:00pm

But the local system shows more than zero bytes for this particular snapshot… or am I reading this wrong?

Johnny_Fartpants · March 8, 2025, 3:03pm

I’d be tempted to delete the most recent snap on the receive side and then let replication try to untwist its knickers. I find it’s normally very good at that given the chance.

If you are able to then a reboot on the receive side wouldn’t hurt to force the send to fail and start again.

Constantin · March 8, 2025, 3:04pm

will do. Thank you!

winnielinnie · March 8, 2025, 3:06pm

Is this Core? I know that Core has a longstanding bug in the GUI where you cannot always trust the information about snapshots.

What is revealed if you use the command-line?

zfs list -o name,used,refer <pool>/<dataset>/<child>@auto-2025-02-05_02-00

Constantin · March 8, 2025, 3:17pm

Nope. It now claims it has dependent sets

Johnny_Fartpants · March 8, 2025, 3:18pm

Interesting, so trying to delete that most recent snapshot on the recieve side presents this message?

If so I would disable replication on the primary, restart the receive system and then try deleting it again.