Insight into how DNS works in SCALE

Are TrueNAS DNS servers, as specified in the Namservers section of the Network tab in SCALE, tried sequentially or in a random / round-robin manner?

I have a setup where I’m running a Samba DC inside a jail, which is also acting as the DNS server for my customer’s network. Last Friday, during a reboot, the jail did not come up (I haven’t been able to diagnose the issue, yet), and the whole network lost Internet connectivity on account of not being able to resolve names.

Ideally, I would like to have additional, fallback DNS servers, but in the past having both the AD DNS server (listed first) and another, public DNS server, caused issues during domain joins, and, IIRC, some reboots.

Any suggestions?

P.S.: I’m putting together a quick post init script to check for the jail status, then start it if stopped, and afterwards test name resolution and restart the jail if there are issues. Not a satisfactory solution, but if it works…

I expect it to be the former option.
Tried in order from top to bottom.

More advanced failover typically requires a server side solution, keepalived or equivalent.

At first I expected that to be the case, but the issues joining a domain happen when there are both an internal, AD-authoritative and an external one (like 8.8.8.8), even if the internal is listed first.

It is way more reliable when the AD DNS server is the only one listed.

It could be a coincidence - maybe the first query, to the internal one, times out too quickly - but I have had enough instances of this causing problems to be leery of adding a non-AD DNS server to my list when TN is joined to a domain.

I don’t know the best practices for AD management but my gut feeling is that having a non-AD DNS like that will continue to cause an endless stream of problems.

I don’t know the best practices for AD management but my gut feeling is that having a non-AD DNS like that will continue to cause an endless stream of problems.

This is correct. It’s basically an unsupportable configuration.

In addition to the correct response about not listing an internal and external DNS server from @awalkerix, another thing to remember is that if the first DNS server returns a response, that’s the end of it.

So, if the first DNS server returns NXDOMAIN (meaning the query has no results), the client will take that as the truth, even if the result is in error for some reason. Servers after the first are only contacted if the first doesn’t reply at all.

I understand that but would still like to know how the listed DNS servers are queried, when there is more than one - sequentially, randomly or in a round-robin fashion?

I can shell script around my problem, so chalk it up to mere curiosity.

P.S.: had just posted before reading @nabsltd response - thanks! If I understood this correctly, using a second, external DNS would allow me to maintain name resolution working, but not AD operations, if the first, AD-aware, internal DNS server went down for whatever reason?

My parsing of nabs reply is that he agrees with awalkerix in that you specifically should not mix entries of internal and external DNS:s like that.

Microsoft must have some form of HA functionality attached to their AD solution, that ought to be a good start if you’re using their server software for DNS.

1 Like

I’m not using Microsoft’s DNS server solution, exactly because I’d like to get rid of Windows Server.

Let me explain my setup better:

My AD domain is Samba-hosted, with a jail running a Samba Domain Controller (TN itself cannot function as a DC). This jail is set up via jailmaker on the TrueNAS SCALE box itself and uses bridged networking to allow it to operate on a separate IP address. The TN SCALE box, in turn, is joined to the domain hosted at the jail.

AD-joined workstations need a working DNS with the proper domain entries to be able to function correctly. This DNS server runs alongside the Samba DC (it is a part of the Samba suite). For this reason my router’s DHCP server leases have their DNS pointing to the IP of the Samba jail - when started, this DNS can resolve both internal and external names (via forwarders set up in smb.conf).

This setup works, and the domain can be managed with Microsoft’s own administration tools, including AD Users and Groups, Group Policies, logon scripts, etc., exactly as if it was hosted on a Windows Server machine.

However, if the Samba jail fails to start, then, along with AD issues, there is also no external connectivity for the workstations, as their sole DNS server (normally running inside the jail) is down. Users get upset very quickly when their web access isn’t working…

It is worth noting that the same thing would happen if you were running a single Windows server with no DNS server redundancy (as was the case, for example, with the old Small Business Server product that MS used to sell, licensed for a single physical server)

What I’m trying to investigate is the possibility of having an additional, fallback DNS server to allow at least Internet access while the issue of the non-functional jail is being worked on.

Hope this clears some of the confusion - English is not my first language, so my posts can be somewhat stilted. Sorry for that.

1 Like

Is it possible to run two instances of your samba controller?

Obviously, the real issue is the jail failing to start. Perhaps “jailmaker startup” timed out, or maybe there was a race condition on the bridge startup?

It could be. I had just the day before uneventfully upgraded this box to 24.04.1, everything came up all right - except for a single (expected) alert, mentioning DNS problems locating the AD domain (the DC/DNS jail of course wasn’t up yet, soon after the reboot), but these alerts can be safely disregarded as they clear themselves up after a couple minutes. My problem was with 24.04.1.1 on this particular server: I have other TN servers with bridged networking jails, and no issues on those.

Having my DHCP server configured with an additional external DNS server did work - of course, as long as the Samba DC jail is down, AD domain access is unavailable, but at least my users are able to browse their Facebook and TikTok accounts :clown_face:

As for the jails themselves, I’m not sure why they did not start on boot, even though I have the correct

jlmkr.py startup

command set, and this particular machine has been rebooted uneventfully a few times in the past…

I have put an additional mechanism in place to force start my jails manually a couple minutes after TrueNAS boots. Hopefully this will solve my immediate concern, that of losing administrative access to the server.

1 Like

jlmkr startup will start the jails configured to auto-start.

Check the config file for the jail or use the jlmkr command to make it autostart.

The relevant jails are set to auto-start. And they had been auto-starting over the prior few reboots - something happened during the last one, and none of them did come up.

It was probably a one-off thing, I did a couple reboots after putting in place an emergency fallback script to start them manually a few minutes after boot, if needed, and it wasn’t needed.

Ah! This important clue was not in your initial post:
“And they had been auto-starting over the prior few reboots”

Guess you sort of did with this, but clarity will always get your further sooner:
“and this particular machine has been rebooted uneventfully a few times in the past”

If “jlmkr.py startup” does not auto-start all jails with startup=1 in their config file,
the issue is jlmkr itself (guessing that jlmkr start <jail_name> works?).

Errors, messages, logs, journalctl?

Thank you for your input, but the error was a one-off thing. I have several other boxes that auto-start their jails flawlessly, and the jails in this one did auto-start today. I’m quite satisfied with the reliability of jlmkr startup, but still it is nice to have a fallback script.

Anyway, this was not originally a thread about jailmaker, and I have marked the original topic as solved.