My suspicion is that this is to do with the fact that we want to keep affinity between the client IP and a backend server (which OP mentions in their blog). And the question is "do you break that affinity if the backend server goes down?" But I'll reply to my own comment when I know more.
Please remember to include a TTL so I know how long I can cache that answer.
We are going to make the change. This will improve our free accounts so it's a win for everyone. Thanks to OP for writing this up!
PS Thanks for writing this up. Glad we were able to change this behaviour for everyone.
Thanks for bringing it to the Free accounts, great outcome!
The new solution for load balancing seems to be the new HTTPS and SVCB DNS records. As I understand it, they are standardized by people wanting to add extra parameters to the DNS in order to to jump-start the TLS1.3 handshake, thereby making fewer roundtrips. (The SVCB record type is the same as HTTPS, but generalized like SRV.) The HTTPS and SVCB DNS record types both have the priority parameter from the SRV and MX record types, but HTTPS/SVCB lack the weight parameter from SRV. The standards have been published, and support seem to have been done in some browsers, but not all have enabled it. We will see what browsers will actually do in the near future.
The other big advantage of the HTTPS record is that it allows for proper CNAME-like delegation at the domain apex, rather than requiring CNAME flattening hacks that can cause routing issues on CDNs which use GeoDNS in addition to or instead of anycast. If you've ever seen a platform recommend using a www subdomain instead of an apex domain, that's why, and it's part of why Akamai pushed for HTTPS records to be standardized since they use GeoDNS.
However, using MX-style records safely can be tricky if you can’t rely on DNSSEC.
Golang HTTP2 clients will reuse the first server they can connect to over and over and the DNS is never re-resolved. This can lead to issues where clients will not discover new servers which are added to the pool.
An particularly pathological case is if all serving backends go down the clients will all pin to the first serving backend which comes up and they will not move off. As other servers come up few clients will connect since they are already connected to the first server which came back.
A similar issue happens with grpc-go. The grpc DNS resolver will only re-resolve when the connection to a backend is broken. Similarly grpc clients can all gang onto a host and never move off. There are suggestions that on the server side you can set `MAX_CONNECTION_AGE` which will periodically disconnect clients after a while which causes the client to re-resolve the DNS.
I really wish there was a better standard solution for service discovery. I guess the best you can do is implement a request based load balancer with a virtual IP and have the load balancer perform health checks. But you are still kicking the can down the road as you are just pushing down the problem to the system which implements virtual IPs. I guess you assume that the routing system is relatively static compared to the backends and that is where the benefits come in.
I'm curious how do people do this on bare metal? I know AWS/GCP/etc... have their internal load balancers, but I am kind of curious what the secret sauce is to doing this. Maybe suggestions on blog posts or white papers?
I’m not a DNS expert but shouldn’t it re-resolve when the TTL expires?
If I’m reading the code right round trips (HTTP requests) go through queueForIdleConn which picks up any pre-existing connections to a host. The only time these connections are cleaned up (in HTTP2) is if keepalives are turned off and the connection has been idle for too long OR the connection breaks in some way OR the max number of connections is hit LRU cache evictions take place.
Furthermore, the golang dnsclient doesn’t even expose record TTLs to callers so how could the HTTP2 transport know when an entry is stale? https://github.com/golang/go/blob/master/src/net/dnsclient_u...
Querying DNS can be expensive, so it makes sense to build a cache to avoid querying again when you don't need to, but typical APIs for name resolution such as gethostbyname / getaddrinfo don't return the TTL, so people just assume forever is a good TTL. Especially for a persistant (http) connection, it kind of makes sense to never query DNS again while you already have a working connection that you made with that name, and if it's TLS, it's quite possible that you don't check if the certificate has expired while you're connected or if you do a session resumption.
But innocent things like this add up to make operating services tricky. Many times, if you start refusing connections, clients figure it out, but sometimes the caches still don't get cleared.
Oh wow I didn’t know this but I looked it up and you’re right. Interesting.
Your machine -> Local router -> Configured upstream DNS Server (ISP/CF/Quad8/etc) -> ? -> Authoritative DNS Server
Any one of those layers can override/mess with/cache in a variety of ways including TTL. This is why Cloudflare and a variety of other providers use IP anycast. They accepted DNS for what it is and worked around it.
Not only is the IP always the IP, the "global" BGP routing table actually universally and consistently updates much faster than DNS. Then whatever routers, machines, etc downstream from that don't matter.
I would need to re-read the code to refresh my memory.
also, java historically had -1 ttl (eg: infinite) by default. causing a lot of headaches with ephemeral/container services.
> service nginx stop
But that's not how you should test this. A client will see the connection being refused, and go on to the next IP. But in practice, a server may not respond at all, or accept the connection and then go silent.
Now you're dependent on client timeouts, and round robin DNS will suddenly look a whole lot less attractive to increase reliability.
This is the nasty key point. The reliability is decided client-side.
For example, systemd-resolved at times enacted maximum technical correctness by always returning the lowest IP address. After all, DNS-RR is not well-defined, so always returning the lowest IPs is not wrong. It got changed after some riots, but as far as I know, Debian 11 is stuck with that behavior, or was for a long time.
Or, I deal with many applications with shitty or no retry behavior. They go "Oh no, I have one connection refused, gotta cancel everything, shutdown, never try again". So now 20% - 30% of all requests die in a fire.
It's an acceptable solution if you have nothing else. As the article notices, if you have quality HTTP clients with a few retries configured on them (like browsers), DNS-RR is fine to find an actual load balancer with health checks and everything, which can provide a 100% success rate.
But DNS-RR is no loadbalancer and loadbalancers are better.
There were definitely some warts in that system but as those sorts of systems go it was fast, easy to introspect, and relatively bulletproof.
It also puts failover in those same hands. If one of your regions goes down, do you want the traffic to spread evenly to your other regions? Or pile on to the next nearest neighbor? If you care what happens, then you want to retain control of your traffic management and not cede it to others.
I'd argue it isn't acceptable at all in this day and age and that there are other solutions one should pick today long before you get to the "nothing else" choice.
You also need to find yourself some IP ranges. And learn BGP and find providers where you can use it.
DNS round robin works as long as you can manage to find two boxes to run your stuff on, and it scales pretty high too. When I was at WhatsApp, we used DNS round robin until we moved into Facebook's hosting where it was infeasible due to servers not having public addresses. Yes, mostly not browsers, but not completely browserless.
We're talking about today.
The reason why I said Anycast is cause the vast majority of people trying to solve the need for having multiple servers in multiple locations, will just use CF or any one of the various anycast based CDN providers available today.
We didn't have many outages due to DNS, because we had fallback ips to contact chat in our clients. Usage was down in the 24 hours after our domain was briefly hijacked (thanks Network Solutions), and I think we lost some usage when our DNS provider was DDoSed by 'angry gamers'. But when FB broke most of their load balancers, that was a much bigger outage. BGP based outages broke everything, DNS and load balancers, so no wins there.
Exactly! When you control the client, you don't even need DNS. Things are actually even more secure when you don't use it, nothing to DDoS or hijack. When FB broke one set of LB's, the clients should have just routed to another set of LB's, by IP.
The servers behind it were fine, if you could get to one. You could push broken DNS responses, I suppose, but it's harder than breaking a load balancer.
To [hesitantly] clarify a pedantry regarding "DNS automatic offline detection":
Out of the box, RR-DNS is only good for load balancing.
Nothing automatic happens on the availability state detection front unless you build smarts into the client. TFA introduction does sort of mention this, but it took me several re-reads of the intro to get their meaning (which to be fair could be a PEBKAC). Then I read the rest of TFA, which is all about the smarts.
If the 1/N server record selected by your browser ends up being unavailable, no automatic recovery / retry occurs at the protocol level.
p.s. "Related fun": Don't forget about Java's DNS TTL [1] and `.equals()' [2] behaviors.
[1] https://stackoverflow.com/questions/1256556/how-to-make-java...
[2] https://news.ycombinator.com/item?id=21765788 (5y ago, 168 comments)
On average, does this really matter/make sense?
Personally, my default for names that are likely to change often is 5 minutes, but 1 minute is ok, but might drive a lot more DNS traffic.
However, you should understand that not ALL clients will respect those TTLs. There are resolvers that may minimum TTL threshold where IF TTL < Threshold, TTL == Threshold, Common with some ISPs, and also, there may be cases where browsers and operating systems will ignore TTLs or fudge them.
> "It's an amazingly simple and elegant solution that avoids using Load Balancers."
When a server is down, you have a globally distributed / cached IP address that you can't prevent people from hitting.https://www.cloudflare.com/learning/dns/glossary/round-robin...
Load balancing isn't without cost, and load balancers subtly (or unsubtly) messing up connections is an issue. I've also used providers where their load balancers had worse availability than our hosts.
If you control the clients, it's reasonable to call the platform dns api to get a list of ips and shuffle and iterate through in an appropriate way. Even better if you have a few stablely allocated IPs you can distribute in client binaries for when DNS is broken; but DNS is often not broken and it's nice to use for operational changes without having to push new configuration/binaries everytime you update the cluster.
If your clients are browsers, default behavior is ok; they usually use IPs in order, which can be problematic [1], but otherwise, they have good retry behavior: on connection refused they try another IP right away, in case of timeout, they try at least a few different IPs. It's not ideal, and I'd use a load balancer for browsers, at least to serve the initial page load if feasible, and maybe DNS RR and semi-smart client logic in JS for websockets/etc; but DNS RR is workable for a whole site too.
If your clients are not browsers and not controlled by you, best of luck?
I will 100% admit that sometimes you have to assume someone built their DNS caching resolver to interpret the TTL field as a number of days, rather than number of seconds. And that clients behind those resolvers will have trouble when you update DNS, but if your loadbalancer is behind a DNS name, when it needs to change addresses, you'll deal with that then, and you won't have experience.
[1] one of the RFCs suggests that OS apis should sort responses by prefix match, which might make sense if IP prefixes were heirarchical as a proxy to get to a least network distance server. But in the real world, numerically adjacent /24s are often not network adjacent, but if your servers have widely disparate addresses, you may see traffic from some client ips gravitate towards numerically similar server ips.
You know, not many apps do this but in particular WhatsApp does! Was it you?
Chat is basically binary encoded XMPP, with essentially a compression dictionary, so per iq overhead is minimal. Especially for the start of connection stuff (login, offline message delivery), we counted bytes and made accomidations for typical network issues we would see. Not acking a big chunk of offline messages after a few tries? Let's send one at a time and see if that works, etc.
Our socket timeouts were rather long as well. Before the move into Facebook infra, servers were in the US only, and rural India is a long ways from the US; and last mile contention on 2G gets real rough out there too... I want to say timeouts were on the order of 30 seconds?
Multimedia (attachments) was https, with resumption. I don't remember the full history, originally I don't think we had resumption on uploads, there's some coordination required for that, which IIRC started as more or less send an IQ that you want to upload a file with a hash of the file, and get a response of either what the download url is if the file was complete, or where to upload and what byte to start with if not. I think it's likely different now, but probably still https based. I wanted to move it so multimedia would be either multiplexed on the chat channel or using a similar protocol to the chat channel, but I didn't have the pull, and I got redirected into pushing TLS 1.3 into our Android client's mms upload/download instead; I didn't do the code there, just prototyping to show it could be possible, and then was more of a facilitator than a contributor. I'm not sure I got all the benefits I was looking for, but there were some, and it kept me busy while I was wrapping up our pre-FB hosting and my time at WA.
I’ve run a min ttl of 3600 on my home network for over a year. No one has complained yet.
Of course somebody will inevitably misconfigure their local DNS or use a bad client. Either you accept an outage for people with broken setups or you reassign the IP to a different server in the same DC.
Design for failure. Don't fabricate failure.
A large number of services successfully achieve their failure tolerances via these kinds of DNS methods. That doesn't mean all services would or that it's always the best answer, it just means it's a path you can consider when designing for the needs of a system.
It is an absurd train of thought that nobody in their right mind would consider... just like using DNS-RR as a replacement for load balancing.
LWS get away with it because of Anycast...
https://www.cloudflare.com/en-gb/learning/cdn/glossary/anyca...
More directly - is there some set of common web client I've been missing for many years that just doesn't follow DNS TTLs or try alternate records? I think the article gets it right with the wish list at the end containing a Amazon Route 53-like "pull dead entries automatically" note but maybe I'm missing something else? I've used this approach (pull the dead server entries from DNS, wait for TTL) and never caught any unexpected failures during outages but maybe I haven't been looking in the right places?
If you mean it's possible to design something with round-robin DNS in a way that more clients than you expect will fail then absolutely, you can do things the wrong way with most any solution. Sometimes you can be fine with a subset of clients not always working during an outage or you can be fine with a solution which provides slower failover than an active load balancer. What I'm trying to find is why round-robin DNS must always be the wrong answer in all design cases.
I don't know if there is such a list but older versions of Java are pretty famous for caching the DNS responses indefinitely. I don't hear much about it these days so I assume it was probably fixed around Java 8.
Yes. There are tons of people with outdated and/or buggy software still using the internet today.
I don't understand why anyone would argue for this as a solution when there are near zero effort better ways of doing this that don't have any of the negative downsides.
Running load balancers does have a downside, every single design choice other than "don't do anything" is another point of configuration and cost. Round-robin based DNS solutions often require nothing more than adding a second A record and are possible the simplest solution to many problems for that reason. Many cloud DNS systems offer automatic pullout functionality if that's even a need, keeping cases where pullout is a must still not needing to move to more complex answers.
Solutions only make sense in context of what service one is delivering, not in what thinks sounds sexiest, what is the absolute best, or what could be a possible problem in some other use case. That you can think of a case it could possibly not work out is not the same thing as an example of why it's a bad design for everyone - or even that scenario. If you can't gather data the answer is to find a way to do so and make a data driven decision, not swag based on personal opinion. Not every app is only correctly scoped when resources are put in to make it a fluid 144 FPS native experience in <1 MB package, not every DC needs 2n redundancy to be up enough for its customers, not every database needs to be designed to scale to a billion users, and not every web service needs a load balancer to be reliable enough for the use case.
If you get to the point of needing what you think RRD is providing you, then you might as well do it using a solution that doesn't have the negative side effects of RRD.
If you are going as far as using a cloud dns system with "automatic pullout", then you might as well just use a cloud dns, like CF, that solves the round robin dns known issues for you.
An example of a time failover needs to be instant or failover doesn't matter at all is completely unrelated to whether or not there are times "somewhat decent" failover is needed. Not to mention times load balancing primary role may be to balance the load rather than boost redundancy.
As my personal example: waiting a seconds (or a couple minutes in the absolute worst case) to reconnect to a web terminal session in the occasional failover is not an impacting issue, waiting for someone to troubleshoot and diagnose a single server outage (a couple of minutes to many hours in the worst case) is an event worthy of handing out free vouchers to do the training another time. We've never had to do the latter due to remote training infrastructure failover issues in many years without a traditional load balancer (despite many outages) and it's allowed the training infrastructure to be extremely lightly staffed.
As the example from the blog: waiting seconds (or a couple minutes in the absolute worst case) for free map tiles to load in the occasional failover is probably preferable when weighed against things spending limited money on load balancers vs additional servers for all-round performance and scalability (tying back to the "balance the load" use case being the bigger value per dollar).
> If you are going as far as using a cloud dns system with "automatic pullout", then you might as well just use a cloud dns, like CF, that solves the round robin dns known issues for you.
Not sure what you mean here, CF's cloud dns is indeed one example of what I meant by a cloud dns system with "automatic pullout". It's referenced in the article, Zero Downtime Failover. Perhaps you meant to say "why not just use Cloudflare Load Balancing at that point" instead? The answer to that, if it were the question, is it's a paid addon ("Running load balancers does have a downside") as mentioned in the article. If that wasn't the intended question, then yes - you've got it, though I'm not sure how it's "might as well" rather than exactly what was said to use.
If I had to guess (and I could be very wrong) you come more from a background on the for profit high end datacenter hosted services side. Large scale, high performance, bleeding edge services for high dollar, 2n redundancy, high dollar equipment support contracts, the idea of not having cold spares on site for things with n+2 (or more) hot redundancy unthinkable given the target SLAs shouldn't allow waiting for equipment to show up until redundancy levels are back. That's fine and dandy. It's a fun type of environment and comes with certain assumptions... but trying to apply the common sense logic you'd use in those kinds of scenarios like "just assume you need full load balancers if you're going to make any uptime guarantee at all" doesn't necessarily apply to everyone else in all other scenarios. That's why engineering starts with asking more about what in the use case drives that decision rather than declaring a solution universally wrong out of the gate.
Exact implementation of TTL, is a suggestion.
As OP clearly shows, it's also not useful for geographically routing traffic to the nearest endpoint. Clients are dumb and may do things against their interest, the user will suffer for it, and you will get the complaints. Use a DNS provider with proper georouting if this is important to you.
The only genuinely valid reason for multiple A addresses is redundancy. If you have a physical NIC, guess what, those fail sometimes. If you get a virtual IP address from a cloud provider, guess what, those abstractions leak sometimes. Setting up multiple servers with multiple NICs per server and multiple A records pointing to those NICs is one of those things you do when your usecase requires some stratospherically high reliability SLA and you systematically start to work through every last single point of failure in your hot path.
We had a dedicated DNS host and various other dedicated hosts for various services related to order fulfillment. A batch job would be downloaded in the morning to the order server (app) and split up amongst the symbol scanners which ran basic terminals. To keep latency as low as possible the scanners would dns round robin. I'm not sure how much that helped because the wifi was by far the biggest bottleneck simply for the fact of interference, reflection and so on.
With this setup an outage would have no effect the throughput of the warehouse since the batch job was all handled locally. As we moved toward same day shipping of course this was no longer a good solution and we moved to redundant, dedicated fiber and cellular data backup then almost completely remote servers for everything but app servers. So what we were left with was million dollars hvac to cool a quarter rack of hardware and a bunch of redundant onsite tech workers.
> Curl also works correctly. First time it might not, but if you run the command twice, it always corrects to the nearest server.
This took two tries for me, which begs the question how curl is keeping track of RTT (round trip times), interesting.
I use this feature, and there are options to control Affinity, Geolocation and others. I don't see this discussed in the article, so I'm not sure why Cloudflare load balancing is mentioned if the author does not test the whole thing.
Their Cloudflare wishlist includes "Offline servers should be detected."
This is also interesting because when creating a Cloudflare load balancing configuration, you create monitors, and if one is down, Cloudflare will automatically switch to other origin servers.
These screenshots show what I see on my Load Balancing configuration options:
https://cdn.geekzone.co.nz/imagessubs/62250c035c074a1ee6e986...
https://cdn.geekzone.co.nz/imagessubs/04654d4cdda2d6d1976f86...
Also, the article is about DNS-RR, not the L7 solution.
I always assumed curl was stateless between invocations. What's going on here?
Firefox and Chrome use DNS over HTTPS by default I believe, which may mean they use a different name resolution path.
The above is entirely conjection on my part, but the guess is heavily informed by the surprise of curl's behavior.
But operating system resolver only speak with DNS servers. It does not make https connections to calculate latency which would pick "the closest server". Also dns had no way to tell what port you will be using, maybe service is on 8443 or something.
For geo DNS I've built a custom backed for powerdns with geo DNS capabilities and healthckecks to quickly remove a broken vps from the DNS responses.
No way MacOS parse tls clienthello looking for SNI.
Also I doubt a DNS resolver runs in the Mac kernel, ring 0 to pull this off.
The thing with DNS is that it works on layer 3. Hold on, what? Yes, layer 3 because you obtain network address for layer 3 (ip4, ipv6) but latency can be measured only in layer4 (tcp, quic). Of course I know that common wisdom says DNS is a layer 7 but from functional perspective, you are yet to establish your destination network address, therefore functionally it's like layer 3 to me. Or even lower, because without destination, you can't even start creating a packet and inspecting your routing table entries figuring out if you can even reach it ;)
There is zero chance Mac resolver libraries can connect you to the fastest responding server - unless there is no Berkeley sockets but something that allows you to do a connect(char * fqdn) and system library return you two pipes, one for write, other for read, and that you can close them independently. I doubt it there is such a thing, but don't know Mac os API.
[1] https://github.com/mlhpdx/cloudformation-examples/tree/maste...
Is it true then that before HE, most round-robin implementations simply cycled and no one considered latency? That's a very surprising finding.
However, as is common with web tech, the old SRV record has been reinvented as the SVCB record with a smidge of DANE for good measure.
There is never a delay if one of them is down.
I am using a closed-source client (Bluezone Rocket), but I'm assuming that it pulled a lot of code from PuTTY as it uses the PPK format.
How your OS sorts DNS responses also comes in to play. Depends on what your browser makes DNS requests.
The real solution with Cloudflare is to use their Load Balancing (https://developers.cloudflare.com/load-balancing) which is a paid feature.
preach on.
Wish I could add instructions like:
- random choice #round robin, like now
- first response # usually connects to closest server
- weights (1.0.0.1:40%; 2.0.0.2:60%)
- failover: (quick | never)
- etc: naming countries, continents
DNS has one job. Hostname -> IP. Nothing further. You can mess with it on server side like checking to see if HTTP server is up before delivering the IP but once IP is given, the client takes over and DNS can do nothing further so behavior will be wildly inconsistent IME.
Assuming DNS RR is standard where Hostname returns multiple IPs, then it's only useful for load balancing in similar latency datacenters. If you want fancy stuff like geographic load balancing or health checks, you need fancy DNS server but at end of day, you should only return single IP so client will target the endpoint you want them to connect to.
It was specifically built for multi DC or multi cloud or hybrid operations that are on separate continents, with geo DNS, heathchecks and faiolver on the DNS level at the same time. When all usa servers in the WRR pool are down, or DC is down, it starts to answers the closest next set of WRR (Canada) automatically.WRR pools are dynamic and auto healing, constantly doing http heathchecks.
It is also dirt cheap, like 100x cheaper as opposed to aquire provider independent IP address space and run and operate AnyCast and having 24/7 NOC teams on this AnyCast, constantly adjusting bgp communities etc. and it is not like anycast and bgp solve anything when one server is down but other works. You can't stop announcing whole prefix if you run 200 machines but only one or two are down.
TTL I'm using is 30 seconds.
I never shared this backed with the world, you can't test it or purchase it. But maybe some day I'll launch a route53 competitor ;)
What can be useful: dynamically adjusting DNS responses depending on what DC is up. But at this point shouldn't you be doing something via BGP instead? (This is where my knowledge breaks down.)
If you want cheaper load balancing and are ok with some downtime while DNS reconfigures, DNS system that returns IP based on which Datacenter is up works. Examples of this are Route53, Azure Traffic Manager and I assume Google has solution, I just don't know what it is.
but please do continue reading on…