Broken replication every couple of years

VulcanRidr · November 19, 2024, 3:56pm

I had a situation happen back in September 2022, where all of my TrueNAS to TrueNAS replications stopped working. This just happened to me again today.

I set up replications from our production NAS boxes to our snapshot/cold storage NAS. Each has it’s own ssh key. However, we also got another NAS which we are using for off-site replication. We brought up the offsite box in the local data center and set up a key in August 2024 on the off-site box, to replicate everything before we sent it off-site. Got all datasets replicated and sent to the new home.

Well, all of the replications, both the ones with the September 2022 keys and the ones with the August 2024 keys stopped working today, with the following error:

2024/11/19 00:30:00] INFO     [Thread-59] [zettarepl.paramiko.replication_task__task_15] Connected (version 2.0, client OpenSSH_8.8-hpn14v15)
[2024/11/19 00:30:00] INFO     [Thread-59] [zettarepl.paramiko.replication_task__task_15] Authentication (publickey) failed.
[2024/11/19 00:30:00] ERROR    [replication_task__task_15] [zettarepl.replication.run] For task 'task_15' non-recoverable replication error ReplicationError('Authentication failed.')

Now to fix, I generate a new ssh key and new ssh connection in System…My question is why this is happening, and why at seemingly random points…

Has anyone seen this behavior?

Thanks,
–vr

kris · November 19, 2024, 5:41pm

I haven’t seen reports of this so far. But some questions to clarify:

What version of TrueNAS on both sides of the send/recv?

What type of key format was created? Did you do that through the UI or manually?

Typically keys don’t expire AFAIK, usually when they stop working its because one end no longer accepts an older key format.

Also, have you tested SSH from the CLI to the box in question and confirmed the key is rejected? Usually the side you are connecting into will have information in /var/log/auth.log somewhere that tells you the exact reason the connection was rejected.

VulcanRidr · November 19, 2024, 6:59pm

All NAS’ are running 13.0u6.2. All keys are RSA, created through the UI.

Generally speaking, I don’t think you can expire an ssh key, you can revoke, but there is not, in my experience, a way to set a date for it to expire.

The only reliable way to fix the problem appears to be to create a new key and a new ssh connection, and then update the replication tasks with the new connection. However, as I stated, this happened to me in 2022 as well.

I can ssh perfectly well into the NAS, but root has it’s own key on each server.

Additional question. in order to get the replication to update, I have to also enable replication from scratch. If you click the (?) box in the nas, it gives the warning that

Warning: enabling this option can cause data loss or excessive data transfer if the replication is misconfigured.

exactly as I found in the docs. What kind of misconfiguration is this referencing?

kris · November 19, 2024, 9:09pm

On the receive side system, can you look at /var/log/auth.log after a failed replication? Any SSH auth errors mentioned here?

VulcanRidr:

Additional question. in order to get the replication to update, I have to also enable replication from scratch. If you click the (?) box in the nas, it gives the warning that
Warning: enabling this option can cause data loss or excessive data transfer if the replication is misconfigured.
exactly as I found in the docs. What kind of misconfiguration is this referencing?

This option basically means it will blow away all snapshots and data on the receive side if there is no common snapshot origin to replicate from incrementally. Typically you don’t enable this, only if you know a system got way out of sync.

Example, you have a 7 day snap retention on the sending side, but your receive system was offline for 9 days and no longer has snaps in common. This option will force the replication to “clean up” the receive side first, before starting from scratch again. But usually you don’t want that happening automatically, since that failure is often a warning indication something is wrong with your replication in general.

VulcanRidr · November 20, 2024, 2:07pm

I ran a manual replication from Tasks → Replication Tasks → Run Now. Absolutely nothing appearing in the auth.log, however, I am attaching the output of middlewared.log and zettarepl.log:

middlewared.log

[2024/11/20 08:38:29] (DEBUG) middlewared.plugins.zettarepl._process_command_queue():226 - Running task <Replication Task 'task_1'>
[2024/11/20 08:38:29] (ERROR) middlewared.job.run():367 - Job <bound method accepts.<locals>.wrap.<locals>.nf of <middlewared.plugins.replication.ReplicationService object at 0x819d9ea60>> failed

zettarepl.log

[2024/11/20 08:38:29] INFO     [MainThread] [zettarepl.scheduler.clock] Interrupted
[2024/11/20 08:38:29] INFO     [MainThread] [zettarepl.zettarepl] Scheduled tasks: [<Replication Task 'task_1'>]
[2024/11/20 08:38:29] INFO     [Thread-170] [zettarepl.paramiko.replication_task__task_1] Connected (version 2.0, client OpenSSH_8.8-hpn14v15)
[2024/11/20 08:38:29] INFO     [Thread-170] [zettarepl.paramiko.replication_task__task_1] Authentication (publickey) failed.
[2024/11/20 08:38:29] ERROR    [replication_task__task_1] [zettarepl.replication.run] For task 'task_1' non-recoverable replication error ReplicationError('Authentication failed.')
[2024/11/20 08:38:29] INFO     [Thread-172] [zettarepl.paramiko.retention] Connected (version 2.0, client OpenSSH_8.8-hpn14v15)
[2024/11/20 08:38:29] INFO     [Thread-172] [zettarepl.paramiko.retention] Authentication (publickey) failed.
[2024/11/20 08:38:29] WARNING  [retention] [zettarepl.zettarepl] Local retention failed: error listing snapshots on <SSH Transport(root@seir.example.org)>: AuthenticationException('Authentication failed.')

kris · November 20, 2024, 3:11pm

At this point I’m not seeing anything that looks like an obvious smoking gun. Nor do we have other reports that I’m aware of for this behavior. Go ahead and file a ticket, but be warned, odds are it is already resolved in SCALE and that is going to be the feedback given.

VulcanRidr · November 20, 2024, 4:52pm

Ok, so that begs another question. Where on the filesystem are the ssh keys that are generated by System → SSH Keypairs? Or do they only exist in the middleware?