Partial replication from Cobia to Dragonfish won't complete

Stux · April 12, 2024, 1:42am

I have two systems. One is running 23.10.2 (Cobia), the other Dragonfish 24.04 RC1

My Cobia system is attempting to replicate a dataset (docker/volumes) to the Dragonfish system… and its relatively large (86GB), over a relatively slow 40mbps link.

This is a fine thing usually.

The issue is that the transfer failed, and now TrueNAS is attempting to use a resumable replication, which is also a fine thing… but that is reliably failing to resume… which is not fine.

[2024/04/12 00:00:04] INFO     [Thread-4321] [zettarepl.paramiko.replication_task__task_9] Connected (version 2.0, client OpenSSH_9.2p1)
[2024/04/12 00:00:04] INFO     [Thread-4321] [zettarepl.paramiko.replication_task__task_9] Authentication (publickey) successful!
[2024/04/12 00:00:05] INFO     [replication_task__task_9] [zettarepl.replication.pre_retention] Pre-retention destroying snapshots: []
[2024/04/12 00:00:05] INFO     [replication_task__task_9] [zettarepl.replication.run] For replication task 'task_9': doing push from 'tank/bizadmin' to 'tank/replicas/titan/bizadmin' of snapshot='auto-daily-2024-04-12_00-00' incremental_base='auto-daily-2024-04-11_00-00' include_intermediate=False receive_resume_token=None encryption=False
[2024/04/12 00:00:06] INFO     [replication_task__task_9] [zettarepl.paramiko.replication_task__task_9.sftp] [chan 5] Opened sftp connection (server version 3)
[2024/04/12 00:00:06] INFO     [replication_task__task_9] [zettarepl.transport.ssh_netcat] Automatically chose connect address 'chronus.<redacted>'
[2024/04/12 00:00:09] INFO     [replication_task__task_9] [zettarepl.replication.pre_retention] Pre-retention destroying snapshots: []
[2024/04/12 00:00:09] INFO     [replication_task__task_9] [zettarepl.replication.run] For replication task 'task_9': doing push from 'tank/certificates' to 'tank/replicas/titan/certificates' of snapshot='auto-daily-2024-04-12_00-00' incremental_base='auto-daily-2024-04-11_00-00' include_intermediate=False receive_resume_token=None encryption=False
[2024/04/12 00:00:09] INFO     [replication_task__task_9] [zettarepl.transport.ssh_netcat] Automatically chose connect address 'chronus.<redacted>'
[2024/04/12 00:00:13] INFO     [replication_task__task_9] [zettarepl.replication.pre_retention] Pre-retention destroying snapshots: []
... 296 more lines ...
[2024/04/12 00:59:30] INFO     [replication_task__task_9] [zettarepl.replication.run] Resuming replication for destination dataset 'tank/replicas/titan/docker/volumes'
[2024/04/12 00:59:30] INFO     [replication_task__task_9] [zettarepl.replication.run] For replication task 'task_9': doing push from 'tank/docker/volumes' to 'tank/replicas/titan/docker/volumes' of snapshot=None incremental_base=None include_intermediate=None receive_resume_token='1-132c326b1c-f8-789c636064000310a500c4ec50360710e72765a526973030c4b23381d560c8a7a515a79630c001489e0d493ea9b224b518489f586aa883cdfc92fcf4d2cc140686ab8fd50a66b627d57820c97382e5f31273538174625eb67e4a7e72766a917e597e4e696e6ab143626949be6e4a62664ea5ae91819189ae81b1ae9171bc8181ae8101d81e6e0684bf92f3730b8a528b8bf3b3116e050058e6276a' encryption=False
[2024/04/12 00:59:30] INFO     [replication_task__task_9] [zettarepl.transport.ssh_netcat] Automatically chose connect address 'chronus.<redacted>'
[2024/04/12 00:59:31] WARNING  [replication_task__task_9] [zettarepl.replication.partially_complete_state] Specified receive_resume_token, but received an error: contains partially-complete state. Allowing ZFS to catch up
[2024/04/12 01:00:31] INFO     [replication_task__task_9] [zettarepl.replication.pre_retention] Pre-retention destroying snapshots: []
[2024/04/12 01:00:31] INFO     [replication_task__task_9] [zettarepl.replication.run] Resuming replication for destination dataset 'tank/replicas/titan/docker/volumes'
[2024/04/12 01:00:31] INFO     [replication_task__task_9] [zettarepl.replication.run] For replication task 'task_9': doing push from 'tank/docker/volumes' to 'tank/replicas/titan/docker/volumes' of snapshot=None incremental_base=None include_intermediate=None receive_resume_token='1-132c326b1c-f8-789c636064000310a500c4ec50360710e72765a526973030c4b23381d560c8a7a515a79630c001489e0d493ea9b224b518489f586aa883cdfc92fcf4d2cc140686ab8fd50a66b627d57820c97382e5f31273538174625eb67e4a7e72766a917e597e4e696e6ab143626949be6e4a62664ea5ae91819189ae81b1ae9171bc8181ae8101d81e6e0684bf92f3730b8a528b8bf3b3116e050058e6276a' encryption=False
[2024/04/12 01:00:31] INFO     [replication_task__task_9] [zettarepl.transport.ssh_netcat] Automatically chose connect address 'chronus.<redacted>'
[2024/04/12 01:00:32] WARNING  [replication_task__task_9] [zettarepl.replication.partially_complete_state] Specified receive_resume_token, but received an error: contains partially-complete state. Allowing ZFS to catch up
[2024/04/12 01:00:32] ERROR    [replication_task__task_9] [zettarepl.replication.run] For task 'task_9' non-recoverable replication error ContainsPartiallyCompleteState()

It appears to find the resume token, attempts to use it… fails… tries again… then fails finally.

While its failing, the progress dialog just sits at 0% like this:

Any ideas… it seems this should work… I can probably blow away the partial dataset on the destination, but I figured there is an underlying issue here.

winnielinnie · April 12, 2024, 1:51am

This message is ambiguous, and likely unique to zettarepl:

Specified receive_resume_token, but received an error: contains partially-complete state.

It doesn’t even make sense. The point of a receive_resume_token is to resume from a partially-complete state…

Can you confirm (while nothing is running on either end), that a token exists on the destination dataset?

zfs get receive_resume_token tank/replicas/titan/docker/volumes

Does it happen to match the same string as found in the error message?

Stux · April 12, 2024, 2:01am

root@chronus[~]# zfs get receive_resume_token tank/replicas/titan/docker/volumes
NAME                                PROPERTY              VALUE                                                                                                                                                                                                                                                                                                                                   SOURCE
tank/replicas/titan/docker/volumes  receive_resume_token  1-132c326b1c-f8-789c636064000310a500c4ec50360710e72765a526973030c4b23381d560c8a7a515a79630c001489e0d493ea9b224b518489f586aa883cdfc92fcf4d2cc140686ab8fd50a66b627d57820c97382e5f31273538174625eb67e4a7e72766a917e597e4e696e6ab143626949be6e4a62664ea5ae91819189ae81b1ae9171bc8181ae8101d81e6e0684bf92f3730b8a528b8bf3b3116e050058e6276a  -
root@chronus[~]#

And yes, that matches

winnielinnie · April 12, 2024, 2:03am

Were any snapshots pruned on the source side at any timeframe during the replication process?

Stux · April 12, 2024, 2:05am

The source is configured to “save” snapshots that haven’t replicated

winnielinnie · April 12, 2024, 2:09am

This is one for iX then, and possibly a Jira ticket.

Everything looks in order, and would be expected to work as intended.

I would say try to use the command-line, but it might complicate things, since zettarepl operates different than using straight zfs commands. (For starters, as I understand, it does a “passing the baton” method, rather than a single stream of oldest → newest snapshot.)

Apollo · April 19, 2024, 8:33pm

As far as I understand, zettarepl use traditional zfs commands.
I have noticed a few major differences in how zfs handles data and snapshots on replicated side.
There are 2 options I can think of:

destroying the token on the remote.
Rollback the last snapshot which is giving you trouble on the remote side.

Stux · April 19, 2024, 11:09pm

These failures are on the initial replication, I’m not actually sure if it applies after the initial replication.

So, I erased the dataset in question on the destination and then it replicates.

But it got stuck on the next partial. That partial is 1.6TB so far so not quite happy to erase.

I intend to do some more diagnosis to work out exactly when/why it’s failing, but right now it’s feeling like a feature (automatic partial replication resume) is bugged.

Stux · April 24, 2024, 2:59am

(i solved the docker dataset issue by deleting the dataset on the destination, but I didn’t want to delete the other dataset which had 1.6TiB transferred already)

I updated the DragonFish RC-1 system to Release today.

As part of that I upgraded the zpool features on the main pool.

Afterwards, it seems like the replication is now progressing.

It could’ve been the upgrade to Dragonfish Release 24.04.0, or it could’ve been that the source pool was already upgraded.

You see, Cobia and Dragonfish have the same OpenZFS versions. So, when I tested DragonFish beta… I didn’t upgrade the ZFS pool to the latest features yet… meanwhile, I hadn’t ugpraded the source zfs pool since… TrueNAS core something or other…

Since DragonFish Beta was compelling, and Cobia was working well, on the source system, I upgraded the pool. And on the beta system I did not because I didn’t think I’d be able to downgrade to Cobia if I upgraded the pool to DragonFish’s OpenZFS features.

As it turns out DragonFish and Cobia have the same features.

tldr;

I think the issue was because the source pool’s zfs features were upgraded between partial replications, and when the destination pool was also upgraded, then the pending replication could complete.

I think.

(or it could’ve been the bookmark was deleted as part of the upgrade…)

Apollo · April 26, 2024, 4:41am

In the whole bowl wax of things we still don’t know what really is/was thge root cause of the issue.
Time will tell, eventually.