Replication failed with IO error

Hi,

I’m running a TrueNAS 25.04.2.5 primary and backup server. Data is replicated daily to backup server. We had a network outage that lead to the failure of one of the replication tasks that served a particular dataset. Since then, that particular replication task has stopped working.

I’ve tried deleting the receive_resume_token from the backup server, deleting the snapshot from the backup server that was being replicated during the network outage, and finally, I deleted the complete dataset from the backup server and started the replication task from scratch. No luck unfortunately. With each attempt, some of the 55TBs of data gets transferred for the replication task to fail eventually.

I suspected data corruption on backup but whatever corruption happened, it should’ve been “fixed” by deleting the dataset entirely. Scrub tasks on the primary don’t report any errors on their end, not that there should be any data corruption caused on primary by the network outage in the first place.

Below are some of the logs of replication failure after full dataset deletion on backup.

[2025/11/21 16:15:23] DEBUG    [replication_task__task_7] [zettarepl.transport.local] [shell:1] [async_exec:133548] Running ['zfs', 'get', '-H', '-p', '-t', 'filesystem,volume', 'type', 'StoragePool/ImmutableStorage']
[2025/11/21 16:15:23] DEBUG    [replication_task__task_7] [zettarepl.transport.local] [shell:1] [async_exec:133548] Success: 'StoragePool/ImmutableStorage\ttype\tfilesystem\t-\n'
[2025/11/21 16:15:23] DEBUG    [replication_task__task_7] [zettarepl.transport.base_ssh] [ssh:root@172.10.70.30] [shell:2668] Connecting...
[2025/11/21 16:15:24] ERROR    [retention] [zettarepl.replication.task.snapshot_owner] Failed to list snapshots with <Shell(<SSH Transport(root@172.10.70.30)>)>: DatasetDoesNotExistException(1, "cannot open 'StoragePool/ImmutableStorage': dataset does not exist\n"). Assuming remote has no snapshots
[2025/11/21 16:15:24] DEBUG    [Thread-7982] [zettarepl.paramiko.replication_task__task_7] [chan 0] EOF received (0)
[2025/11/21 16:15:24] DEBUG    [replication_task__task_7] [zettarepl.transport.base_ssh] [ssh:root@172.10.70.30] [shell:2668] [async_exec:133549] Waiting for exit status
[2025/11/21 16:15:24] DEBUG    [Thread-7982] [zettarepl.paramiko.replication_task__task_7] [chan 0] EOF sent (0)
[2025/11/21 16:15:24] DEBUG    [replication_task__task_7] [zettarepl.transport.base_ssh] [ssh:root@172.10.70.30] [shell:2668] [async_exec:133549] Error 1: "cannot open 'StoragePool/ImmutableStorage': dataset does not exist\n"
[2025/11/21 16:15:24] DEBUG    [replication_task__task_7] [zettarepl.transport.local] [shell:1] [async_exec:133558] Running ['zfs', 'list', '-t', 'filesystem,volume', '-H', '-o', 'name', '-s', 'name', '-r', 'StoragePool/ImmutableStorage']
[2025/11/21 16:15:25] DEBUG    [replication_task__task_7] [zettarepl.transport.base_ssh] [ssh:root@172.10.70.30] [shell:2668] [async_exec:133562] Error 1: "cannot open 'StoragePool/ImmutableStorage': dataset does not exist\n"
[2025/11/21 16:15:25] DEBUG    [replication_task__task_7] [zettarepl.transport.base_ssh] [ssh:root@172.10.70.30] [shell:2668] [async_exec:133564] Error 1: "cannot open 'StoragePool/ImmutableStorage/aws': dataset does not exist\n"
[2025/11/21 18:36:08] WARNING  [replication_task__task_7.process] [zettarepl.transport.base_ssh] [ssh:root@172.10.70.30] [replication_process:task_7] Listen side has not terminated within 5 seconds after connect side error
[2025/11/21 18:36:08] DEBUG    [replication_task__task_7.process] [zettarepl.transport.base_ssh] [ssh:root@172.10.70.30] [shell:2668] [async_exec:133863] Stopping
[2025/11/21 18:36:08] DEBUG    [replication_task__task_7.process] [zettarepl.paramiko.replication_task__task_7] [chan 116] EOF sent (116)
[2025/11/21 18:36:08] DEBUG    [replication_task__task_7.process] [zettarepl.transport.local] [shell:1] [async_exec:133864] Stopping
[2025/11/21 18:36:08] ERROR    [replication_task__task_7] [zettarepl.replication.run] For task 'task_7' unhandled replication error SshNetcatExecException(ExecException(1, "cannot send 'StoragePool/ImmutableStorage/aws': I/O error\n"), None) @cee:{"TNLOG": {"exception": "Traceback (most recent call last):\n  File \"/usr/lib/python3/dist-packages/zettarepl/replication/run.py\", line 181, in run_replication_tasks\n    retry_contains_partially_complete_state(\n  File \"/usr/lib/python3/dist-packages/zettarepl/replication/partially_complete_state.py\", line 16, in retry_contains_partially_complete_state\n    return func()\n           ^^^^^^\n  File \"/usr/lib/python3/dist-packages/zettarepl/replication/run.py\", line 182, in <lambda>\n    lambda: run_replication_task_part(replication_task, source_dataset, src_context, dst_context,\n            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File \"/usr/lib/python3/dist-packages/zettarepl/replication/run.py\", line 278, in run_replication_task_part\n    run_replication_steps(step_templates, observer)\n  File \"/usr/lib/python3/dist-packages/zettarepl/replication/run.py\", line 672, in run_replication_steps\n    replicate_snapshots(step_template, incremental_base, snapshots, include_intermediate, encryption, observer)\n  File \"/usr/lib/python3/dist-packages/zettarepl/replication/run.py\", line 713, in replicate_snapshots\n    run_replication_step(step, observer)\n  File \"/usr/lib/python3/dist-packages/zettarepl/replication/run.py\", line 793, in run_replication_step\n    ReplicationProcessRunner(process, monitor).run()\n  File \"/usr/lib/python3/dist-packages/zettarepl/replication/process_runner.py\", line 33, in run\n    raise self.process_exception\n  File \"/usr/lib/python3/dist-packages/zettarepl/replication/process_runner.py\", line 37, in _wait_process\n    self.replication_process.wait()\n  File \"/usr/lib/python3/dist-packages/zettarepl/transport/ssh_netcat.py\", line 210, in wait\n    raise SshNetcatExecException(connect_exec_error, self.listen_exec_error) from None\nzettarepl.transport.ssh_netcat.SshNetcatExecException: Passive side: cannot send 'StoragePool/ImmutableStorage/aws': I/O error", "type": "PYTHON_EXCEPTION", "time": "2025-11-21 18:36:08.030180"}}
[2025/11/21 18:36:08] DEBUG    [replication_task__task_7.monitor] [zettarepl.transport.base_ssh] [ssh:root@172.10.70.30] [shell:2668] [async_exec:133863] Stopping
[2025/11/21 18:36:08] DEBUG    [replication_task__task_7.monitor] [zettarepl.transport.local] [shell:1] [async_exec:133864] Stopping
[2025/11/21 18:36:08] DEBUG    [replication_task__task_7.async_exec_tee.wait] [zettarepl.transport.base_ssh] [ssh:root@172.10.70.30] [shell:2668] [async_exec:133863] Error -1: None
[2025/11/21 18:36:08] INFO     [replication_task__task_7.close_sftp] [zettarepl.paramiko.replication_task__task_7.sftp] [chan 5] sftp session closed.
[2025/11/21 18:36:08] DEBUG    [replication_task__task_7.listen_exec.wait] [zettarepl.transport.base_ssh] [ssh:root@172.10.70.30] [shell:2668] [async_exec:133862] Error -1: ''

I’ve considered intermittent network outages that may cause it to fail but don’t see any outages on the network monitoring dashboards. Besides, other replication tasks have been working as expected it must be something else.

Please help with this and let me know if need further information.

Thanks,

Please describe the hardware at each end…

So the error seems to be replicating one specific dataset and not others?

If so, please describe the dataset…

@Captain_Morgan thanks very much for your reply. The pool consists of two zvols of 30 and 15 disks. (I know the zvol config is horrendous but it’s the one we’re stuck with). dataset size is 55TBs. Please see screenshots for dataset options.

please let me know if you’d like me to share further details.

Full hardware description is always appreciated…

Do you mean zvols.. or “vdevs”…. specify RAID-Z layout.

Part of any disagnosis is looking for configuration or hardware issues…. until other people report the same issue, that is the most likely cause.

sorry yes I meant vdevs. Pool is Raidz-2 with 2 vdevs 30 and 15 disks wide.

Both servers are identical. Here’s the full hardware description:

Server: SuperMicro SSG-6049P-E1CR60H

CPU: Intel(R) Xeon(R) Silver 4208 CPU @ 2.10GHz

Memory: 250GB DDR4

Disks: 45 x 18TB HDDs

Total usable capacity: 630TBs

Not sure if it’s a bug but I suspect the issue to possibly be with some sort of data corruption that isn’t getting picked up by the scrub.

If you suspect data corruption on the sending side I think you could confirm your suspicious by doing a zfs send to /dev/null

Something like
zfs send -w -L -c -p -V dataset > /dev/null
In a tmux session. You could watch progress using ps ax.

But please check dmesg and tail /var/log/messages on both sides. Check for any log messages around the time the replication failed.

Your replication seems to be running found for over two hours. When you resume replication, how long does it take for it to fail? Does the amount of time until replication failure vary a lot?

@bacon thanks very much for your response. It was very helpful. I found out with that it is indeed something on the network end as zfs send dataset > /dev/nulldidn’t run into any errors and restarting the replication from scratch multiple times after deleting the dataset from backup NAS resulted in replication failing at different times.

Now, the problem is, once the replication fails, it should just pick up from where it was left off once I start it again. But, in my case, I get below error and replication fails to start again:

Replication "StoragePool/ImmutableStorage" failed: cannot destroy 'StoragePool/ImmutableStorage/aws': dataset is busy..

I found out that I need to clear the receive_resume_token on the backup NAS to work but as I understand, deleting it will restart replication from scratch which isn’t ideal. Also, backup NAS doesn’t allow me to delete the token and I get the same dataset is busy error. Any way to get rid of the dataset is busy error and resume the replication from where it’s left of?

You shoudn’t have to clear the receive_resume_token. At least not unless you get an error that the token is no longer valid.

Maybe a part of the replication process is still running and is blocking the dataset. I did have such a issue recently when a replication was unexpectedly interrupted.

On the receiving side, check the process list. Maybe do something like ps auxf | grep recv to see if the zfs recv command is still active. A restart should also hopefully resolve the dataset is busy error.

the replication is running as we speak (7TBs out of 55TBs) and I don’t see a zfs recv process on the backup NAS. Is there another way of looking for it?
as for the restart, when I do it, replication restarts from scratch for some reason.

You can use htop to browse processes. There should be a zfs recv process. Somewhere parented under to middlewared (zettarepl) process.

But I have always used the SSH transport and never the netcat based transport. I do not know if process structure is different there.

I have only used replication resume manually on the command line (using zfs send -t ...). But I do not know if zettarepl (the replication software that truenas uses) properly supports resume in this scenario.