Interrupted SCALE Replication Tasks not resuming

Constantin · June 14, 2024, 3:11am

One of the cool promised features of SCALE is the ability to resume replication tasks after a network dropout or whatever interrupted them. I’m dealing with a large snapshot transfer task that got interrupted multiple times in CORE, I believe it will never finish unless the replication task can resume, as is advertised for SCALE. The transfer has been running in background for a week or so.

When I played with the network interfaces to allow SCALE jails (thank you, @stux!), the replication transfer got interrupted… and the the task did not resume when network traffic worked again, rather (like CORE) it started over.

Do the pool ZFS feature flags have to be upgraded to the latest standard before replication resume is a possibility or am I missing an important setting?

Stux · June 14, 2024, 3:13am

I did update my pools to Cobia’s version… which is the same as Dragonfish.

The replication task will start sending snapshots from the first dataset, but when it gets to the dataset it was interupted on in my case, it did resume sending the same snapshot it was interrupted on…

ie the snapshot was 4TB or so… and each time it restarted it was signifcantly less remaining, until eventually it finished.

I did have issues when the destination and the source were running at different versions, or when I upgraded the pool on the destination during the replication.

In the end, I upgraded both pools, and it seemed to work as you would expect… ie magically

Constantin · June 22, 2024, 1:22pm

Welp, another remote NAS power outage led to the replication task getting interrupted. When the NAS’ were able to speak to each other again, the remote dataset indicated it was busy and could not be modified / replicated to again (i.e. instant ERROR on local NAS doing the pushing).

I restarted the remote NAS, which cleared the ‘busy’ issue but which also resulted in the two NAS’ restarting the dataset replication from scratch rather than resuming it, as advertised for SCALE. Both datasets have all the latest SCALE pool feature flags enabled, so that’s not the cause.

The next time this happens, is there a way to clear the ‘busy’ signal at the remote NAS without causing the accumulated snapshot data to be lost?