Replication failure with partially-complete state. Is zfs receive -A safe to use?

nastone · February 17, 2025, 1:49pm

TN ElectricEel-24.10.2

Running as VM under Proxmox on TN Mini X+ 3.0
Replication between two identical TN systems

Another replication failure due to partially-complete state. See also:
Ref 1
Ref 2

As one of the referenced post’s points out, it does seem strange that resume replication process doesn’t work with a partially-complete state as the point of resuming is to recover from a partially-complete state!!

That aside I found a ZFS tip online that indicated I could use the following command to clear the partially-complete state and allow restarting the replication task from scratch:

zfs receive -A filesystem|volume

Is this command safe to use? Is there anything I can do in the TrueNAS GUI to achieve the same result? I would really prefer not to blow away and re-create my target dataset as it’s supposed to be my long-lived archive dataset and already has previous replications in it.

Further background info:

Noted that the log contains 60 repeated WARNING level errors about the partially-complete state before finally ending with an ERROR level message when the replication fails. Last WARNING and final ERROR messages:

[2025/02/16 18:40:34] WARNING  [replication_task__task_2] [zettarepl.replication.partially_complete_state] Specified receive_resume_token, but received an error: contains partially-complete state. Allowing ZFS to catch up
[2025/02/16 18:41:34] INFO     [replication_task__task_2] [zettarepl.replication.pre_retention] Pre-retention destroying snapshots: []
[2025/02/16 18:41:34] INFO     [replication_task__task_2] [zettarepl.replication.run] Resuming replication for destination dataset 'Pool_00/Dset_BackupArchive/Dataset_PveBackup'
[2025/02/16 18:41:34] INFO     [replication_task__task_2] [zettarepl.replication.run] For replication task 'task_2': doing push from 'Pool_00/Dataset_PveBackup' to 'Pool_00/Dset_BackupArchive/Dataset_PveBackup' of snapshot=None incremental_base=None include_intermediate=None receive_resume_token='1-1443f22cb7-138-789c636064000310a501c49c50360710a715e5e7a69766a630408176fffa3d2c6bbffe5600b2d991d4e52765a52697303054f342d461c8a7a515a79630c001489e0d493ea9b224b518486fb00b3739c08ca9bf241fe28a67d7d89cbe33bbaff64092e704cbe725e6a6323004e4e7e7c41b18e8bb249624022d8c0f284b754a4cce2e2d70482c2dc9d735323032d53530d235348b07922646607bb81910fe4fcecf2d284a2d2ececf86884940dd09932f4a2c8749310000fbab31aa' encryption=False
[2025/02/16 18:41:35] WARNING  [replication_task__task_2] [zettarepl.replication.partially_complete_state] Specified receive_resume_token, but received an error: contains partially-complete state. Allowing ZFS to catch up
[2025/02/16 18:41:35] ERROR    [replication_task__task_2] [zettarepl.replication.run] For task 'task_2' non-recoverable replication error ContainsPartiallyCompleteState()

It seems the partially-complete state may have been caused by an earlier replication failure due to a “dataset is busy” error on the target:

[2025/02/16 04:42:01] INFO     [replication_task__task_2] [zettarepl.replication.run] For replication task 'task_2': doing push from 'Pool_00/Dataset_PveBackup' to 'Pool_00/Dset_BackupArchive/Dataset_PveBackup' of snapshot='auto-2025-02-16_02-42' incremental_base='auto-2025-01-03_22-42' include_intermediate=False receive_resume_token=None encryption=False
[2025/02/16 16:24:38] ERROR    [replication_task__task_2] [zettarepl.replication.run] For task 'task_2' unhandled replication error ExecException(1, 'cannot receive incremental stream: dataset is busy\n')
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/zettarepl/replication/run.py", line 181, in run_replication_tasks
    retry_contains_partially_complete_state(
  File "/usr/lib/python3/dist-packages/zettarepl/replication/partially_complete_state.py", line 16, in retry_contains_partially_complete_state
    return func()
           ^^^^^^
  File "/usr/lib/python3/dist-packages/zettarepl/replication/run.py", line 182, in <lambda>
    lambda: run_replication_task_part(replication_task, source_dataset, src_context, dst_context,
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/zettarepl/replication/run.py", line 278, in run_replication_task_part
    run_replication_steps(step_templates, observer)
  File "/usr/lib/python3/dist-packages/zettarepl/replication/run.py", line 672, in run_replication_steps
    replicate_snapshots(step_template, incremental_base, snapshots, include_intermediate, encryption, observer)
  File "/usr/lib/python3/dist-packages/zettarepl/replication/run.py", line 713, in replicate_snapshots
    run_replication_step(step, observer)
  File "/usr/lib/python3/dist-packages/zettarepl/replication/run.py", line 793, in run_replication_step
    ReplicationProcessRunner(process, monitor).run()
  File "/usr/lib/python3/dist-packages/zettarepl/replication/process_runner.py", line 33, in run
    raise self.process_exception
  File "/usr/lib/python3/dist-packages/zettarepl/replication/process_runner.py", line 37, in _wait_process
    self.replication_process.wait()
  File "/usr/lib/python3/dist-packages/zettarepl/transport/ssh.py", line 167, in wait
    stdout = self.async_exec.wait()
             ^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/zettarepl/transport/async_exec_tee.py", line 104, in wait
    raise ExecException(exit_event.returncode, self.output)
zettarepl.transport.interface.ExecException: cannot receive incremental stream: dataset is busy

While investigating this on the target system, when I selected the target dataset, I got an error:
“Failed retreiving GROUP quotas”. Expanding that error also revealed a “dataset is busy” error message:

concurrent.futures.process._RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/middlewared/plugins/zfs_/dataset_quota.py", line 76, in get_quota
    with libzfs.ZFS() as zfs:
  File "libzfs.pyx", line 534, in libzfs.ZFS.__exit__
  File "/usr/lib/python3/dist-packages/middlewared/plugins/zfs_/dataset_quota.py", line 78, in get_quota
    quotas = resource.userspace(quota_props)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "libzfs.pyx", line 3800, in libzfs.ZFSResource.userspace
libzfs.ZFSException: cannot get used/quota for Pool_00/Dset_BackupArchive/Dataset_PveBackup: dataset is busy

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.11/concurrent/futures/process.py", line 261, in _process_worker
    r = call_item.fn(*call_item.args, **call_item.kwargs)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/middlewared/worker.py", line 112, in main_worker
    res = MIDDLEWARE._run(*call_args)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/middlewared/worker.py", line 46, in _run
    return self._call(name, serviceobj, methodobj, args, job=job)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/middlewared/worker.py", line 34, in _call
    with Client(f'ws+unix://{MIDDLEWARE_RUN_DIR}/middlewared-internal.sock', py_exceptions=True) as c:
  File "/usr/lib/python3/dist-packages/middlewared/worker.py", line 40, in _call
    return methodobj(*params)
           ^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/middlewared/plugins/zfs_/dataset_quota.py", line 80, in get_quota
    raise CallError(f'Failed retreiving {quota_type} quotas for {ds}')
middlewared.service_exception.CallError: [EFAULT] Failed retreiving GROUP quotas for Pool_00/Dset_BackupArchive/Dataset_PveBackup
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/middlewared/main.py", line 211, in call_method
    result = await self.middleware.call_with_audit(message['method'], serviceobj, methodobj, params, self)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/middlewared/main.py", line 1529, in call_with_audit
    result = await self._call(method, serviceobj, methodobj, params, app=app,
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/middlewared/main.py", line 1460, in _call
    return await methodobj(*prepared_call.args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/middlewared/schema/processor.py", line 179, in nf
    return await func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/middlewared/plugins/pool_/dataset_quota.py", line 48, in get_quota
    quota_list = await self.middleware.call(
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/middlewared/main.py", line 1629, in call
    return await self._call(
           ^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/middlewared/main.py", line 1468, in _call
    return await self._call_worker(name, *prepared_call.args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/middlewared/main.py", line 1474, in _call_worker
    return await self.run_in_proc(main_worker, name, args, job)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/middlewared/main.py", line 1380, in run_in_proc
    return await self.run_in_executor(self.__procpool, method, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/middlewared/main.py", line 1364, in run_in_executor
    return await loop.run_in_executor(pool, functools.partial(method, *args, **kwargs))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
middlewared.service_exception.CallError: [EFAULT] Failed retreiving GROUP quotas for Pool_00/Dset_BackupArchive/Dataset_PveBackup

nastone · February 17, 2025, 2:02pm

Replying to my own post to add:

I was able to clear the error loading group quotas (and I think the dataset busy error) by restarting the TN VM. But I still have the partially-complete state issue.