Best practice - How to use 2nd truenas box as failover?

nautilus7 · February 20, 2025, 2:18pm

Hi, we are using truenas (scale) in work to store our day to day data/files.

Our data are stored in a dataset in the 1st truenas box.
We have four windows computers which access the data via smb in that box.
We have a 2nd truenas box and we replicate the dataset from the 1st to the 2nd truenas box every night.

So far so good. Everything works as expected.

My question is what happens in case of failure of the 1st truenas box?
Suppose, 1st truenas fails. We can easily switch the four computers to read from / save to the 2nd truenas box.

But what happens when the 1st truenas box is repaired and gets back online? How are the data(set) then synchronized from the 2nd to the 1st box? What happens to the existing replication task of the 1st box (which replicates data from 1st to 2nd)?

I have these questions for some time now and I would like to learn what to do before a failure hits. What do you do in your case. How do you deal with such scenarios? Are you using more sophisticated methods?

Thanks.

bugacha · February 20, 2025, 3:02pm

Rsync ?

dan · February 20, 2025, 3:03pm

Syncthing.

nautilus7 · February 20, 2025, 3:11pm

Do what with syncthing? Use it instead of smb? Or instead of replication between the 2 truenas boxes?

Johnny_Fartpants · February 20, 2025, 3:29pm

At work we have Active Directory and DFS act as a global namespace. Both primary and secondary systems are bound to AD both entries in DFS Primary is priority 1 Secondary is priority 2. We take snapshots every hour and send them across. The secondary is in readonly mode by default which we like but if the primary was to fall over all SMB connections mapped to DFS would switch across to the secondary system. In DFS each share/dataset is created as a separate entry. We use AD groups to manage permissions and what people can and can’t see within the namespace.

dan · February 20, 2025, 3:37pm

This. Replication is one-way by design; Syncthing works both ways.

Johnny_Fartpants · February 20, 2025, 3:48pm

A lot of this comes down to how much data you have on each TN and how long your users can be offline for. My systems can hold over a PB of data and commonly have upwards of 500TB so realistically if one of my primary fails to a point the pool can’t be saved within a ‘fast’ period of time I need to have the secondary ready to failover to and realistically Im not going to recover over 500TB to another system within any reasonable time.

Practically if I promoted my secondary system to the primary (changing from readonly to read/write) I am essentially kissing goodbye to that primary and starting again (as Im breaking replication) and once fixed or new hardware introduced start the long slow process of backing up the secondary (now primary).

With smaller amounts of data you have more options.

Good that you’re thinking about this before/if it ever happens.

etorix · February 20, 2025, 4:23pm

If the primary server failed but not its pool, and the secondary pool has been made read-write to serve as failover/new primary, I suppose that the former primary pool could be turned into the secondary and receive incremental replication from the new primary as long as there’s still a common snapshot.

Johnny_Fartpants · February 20, 2025, 4:27pm

Potentially. You’d need to create the snapshot schedule on the secondary (new primary) and then see if would happily send back without starting again.

I may have a play with that…

etorix · February 20, 2025, 4:38pm

I make no assumption about which server handles what, or where the drives will stay. One may set replication tasks in reverse, or swap drives between primary and secondary.
But setting up new tasks should be a lot faster than replicating 500 TB anew.

Johnny_Fartpants · February 20, 2025, 4:41pm

if you can get them to play nice again that’s a much preferred option. You’d have to tread very carefully however because if your secondary had been promoted and marked read/write and lets say a few days later you managed to get your old primary up and running you’d have to disable replication otherwise it would blow away the last few days worth of work.

nautilus7 · February 20, 2025, 5:12pm

woah, lots of replies, I will look into them more carefully soon. Thanks.

For what is worths, we are a small architectural / civil engineering office. We deal mostly with documents (docx ,pdf), photos and of course cad files. All our data are about 450GB and we don’t expect them to grow quickly. Maybe 10-20GB per year.

My initial thought when I created this post was about common hardware failure like psu, motherboard etc. In case of pool failure, there is not much to synch when the old primary server comes back online as a secondary anyway.

Ideally, we would like the switch to the secondary server to be instant. As much time is needed to swap 4 computers to the secondary’s smb share.

somethingweird · February 20, 2025, 6:52pm

crappy way?

use local dns - the 4 computers map via domain names - if you need to switch to the secondary - update the local DNS.

Stux · February 20, 2025, 7:10pm

You can reverse the flow. This is where the options to verify the read-only state of the destination comes in handy. It prevents the old primary from blowing away the secondary, now primary when it comes online.

somethingweird · February 20, 2025, 7:25pm

Now to automate it!

setup a script to monitor the primary nas - if it timeout after X minute update the local DNS to switch over to the secondary. - send out a email/text/sms/pager/fax to inform the sysadmin hey stupid we switch over the the secondary nas - something is very wrong!

from here – it’s up to your sysadmin workflow how to deal with it.

Johnny_Fartpants · February 20, 2025, 7:29pm

Now I need to figure out why recursive readonly=off isn’t working anymore in SCALE like it did in CORE