Thinking through how to actually failover to DR Truenas...help me validate

electricd7 · April 30, 2025, 7:46pm

OK so we have 2 truenas arrays and have truenas1 replicating to truenas2 on a regular basis. I have shares (nfs and smb) setup on truenas1 and can see those datasets on truenas2 as well as all the snapshots just in RO mode. In the event of a true DR scenario. I think the process to failover would be:

Make sure truenas1 is no longer replicating to truenas2
Edit each replicated dataset on truenas2 and set to RW
Create shares on truenas2 that points to the RW datasets
Repoint all users at the shares on truenas2

Then in the event that truenas1 is repaired and back online, I assume I will need to destroy all data (provided there is still data after the failure) and setup replication jobs from truenas2 to truenas1 and repeat the process in a downtime window to failback. Is this basically the process? Am I missing anything?

PK1048 · April 30, 2025, 10:19pm

I would use a logical IP address rather than repoint clients.

Have a production IP that is an extra IP on truenas1 and if it fails, once you break replication and make truenas2 read/write then move the extra IP from truenas1 to truenas2. Once the clients ARP cache times out they will automatically hit truenas2.

P.S. That is how “real” HA failover does it

Johnny_Fartpants · May 1, 2025, 8:14am

Sounds about right.

What about your user accounts? Are they local to your TrueNAS or are you using a directory service like AD?

Johnny_Fartpants · May 1, 2025, 8:42am

This part is a bit more complicated. It starts with your hardware and the design of your system. I personally opt for separate server head and JBOD for precisely this reason. If my server head fails I can relatively quickly introduce a new head, connect it to the JBOD upload my config and we are back in business. If however my JBOD fails (which believe it or not has been known) I move the 90 drives over to a spare JBOD I have and connect back to the head. Both of these scenarios are fairly quick to action and probably no need to make your backup system RW. I use Microsoft DFS to act as a global namespace and have two entries per share primary and backup and in the event the primary vanishes then users auto redirect to the backup server in RO mode. It’s also unlikely that I have lost the pool or any data but if that day ever comes then like you say active your backup system. I like to keep my pools confined to a single JBOD as I can move them around a bit like lego blocks if needed. Things can get messy with all-in-one systems or when your pool spans multiple chassis.

electricd7 · May 1, 2025, 11:56am

User accounts are via AD, so that part is covered.

electricd7 · May 1, 2025, 11:58am

In this scenario is the extra IP the on you point your DNS at then? so I add an extra IP to truenas1 say 192.168.200.10 and then truenas1 is bound to 192.168.200.11 and truenas2 is bound to 192.168.200.12. All jobs and replictions and what not use the .11/.12 combo but clients map to shares via .10? Then in a failure, move .10 to truenas2 and remount shares?

Johnny_Fartpants · May 1, 2025, 12:50pm

If you have AD why not use Microsoft DFS as a global namespace?

electricd7 · May 1, 2025, 1:04pm

I generally think it’s too complicated for what we are doing, but it is an intriguing idea in this scenario to consider.

PK1048 · May 1, 2025, 1:04pm

You’ve almost got it.

Each TN has a ‘local’ address for managing it and there is one ‘shared’ address used by only one of the two TN at a time.

You want the users to use the shared address (and make sure it is configured on only ONE of the two TN at a time).

But you want things like TN1 to TN2 replication and SSH keys to use the individual IP addresses. You do not want TN2 trying to replicate from TN1 via the shared IP address since that address will be assigned to TN2 during a failure (or even maintenance outage) of TN1.

electricd7 · May 1, 2025, 1:06pm

Yep…this makes sense. Thanks for the advice, I will implement this way. As a follow-up, do you configure your SMB service to only listen on the shared IP address, or do you bother with this step at all?

PK1048 · May 1, 2025, 1:29pm

That has pros and cons.

Pros:

Keeps clients (users) from using the ‘local’ IP address

Cons:

Adds a step to the cutover (how important is fast cutover?)
Adds to complexity (violates KISS*), complexity reduces uptime

Both approaches require a restart of the SMB (samba) service.

Why not just leave the SMB service disabled on the secondary TN until it is needed?

*KISS = Keep It Simple (I prefer not to de-reference the second S)

electricd7 · May 1, 2025, 1:47pm

right I would leave SMB service disabled on DR site. Just thinking in the event of a DR, I would move the logical IP from the production TN to the DR TN and then user shares which were mapped to the prod TN would still work on the same IP when its on the DR TN and nothing would have to change as far as user config or GPO mapped shares this way. But I agree with not binding SMB to only that IP, it wouldn’t really matter anyway if prod TN is dead its not listening on that IP anymore anyway.

electricd7 · May 1, 2025, 2:18pm

One more clarification. When adding this logical IP address, do I just add this to the existing physical network that has the local IP address on it? So that interface would have 2 IPs, and then in DR I add the logical IP to the DR TN existing
physical interface?

PK1048 · May 1, 2025, 8:26pm

Yes, you would add it as an additional IP address (Alias) to the existing network interface.

Remember to remove it from the failed TN (assuming it is not completely crashed).

Remember to remove it from the failed TN when you bring it back up