Replication failing

I’m running two instances of TrueNAS community edition. One relatively high-powered machine (source), running some containers and virtual machines, and a really basic one (target) that I replicate some important data to periodically; or well, I used to.

I had a replication task set up on the target machine, to PULL the data when it’s online. I intend to power the target device down whenever it’s not replicating.
This worked fine for several months, until it suddenly did not. I think the problem started when I had a drive in the source’s data vdev go offline unexpectedly. I recovered without issues: brought the drive back online, resilved, no data was reported as lost at all.

However, since then I’m getting this error within maybe 20 minutes of starting the replication:

[EFAULT] resume token contents: nvlist version: 0 object = 0x3 offset = 0x12d7c0000 bytes = 0x12e13bf7c toguid = 0xe12b27047ecd28d1 toname = bulk/backups/mydataset@auto-2025-08-09_10-30 compressok = 1 rawok = 1 client_loop: send disconnect: Broken pipe cannot receive resume stream: checksum mismatch or incomplete stream. Partially received snapshot is saved. A resuming stream can be generated on the sending system by running: zfs send -t 1-<removed>.

If I deleted the offending snapshot, I will get the same error with another snapshot.

What I’ve tried:

  • Multiple (5 or 6) scrubs on the source pool. It comes up clean every time.
  • Nuking the dataset on the target, as well as the replication task and starting over. Same result.
  • Cleaning up most snapshots on the source. As mentioned the same problem occurs with a different snapshot.

The source machine is running Scale 24.04.2.5 as I have some Kubernetes stuff that needs to be migrated. The target used to run the exact same version, but when I ran out of ideas I updated it to a newer version. I tried with 24.10.x and currently running 25.04.2.4.

I don’t think I really have enough spare storage to completely recreate the source pool on temporary disks, so I’ve held off on that for now. If ZFS reports everything to be okay, there should be no reason for me to have to do that, right? :neutral_face: I have no other ideas to try at this point, though. Any suggestions?