Earlier, we discussed how IP addresses and route authorizations work, before we took a break to talk about how the RIRs issue ASNs. As promised, I’ll now cover BGP route selection, how it enables anycasting, and how we can use it to achieve low latency and high availability. We’ll also cover some of the pitfalls of this approach and how it led to an infamous outage.
For those not familiar with the concept, anycasting means the same IP address is shared by devices in multiple locations, with routers sending packets to the “nearest” location. This can result in latency lower than that is possible with the speed of light limitation from a single location1. Although, as you will see later, the routers’ concept of “nearest” may not necessarily be what we expect.
Now, if one location stops announcing the IP address via BGP, then routers will select the next best location, enabling high availability as long as there is one location still available. Somewhat morbidly, I’ve claimed that this website will stay up even if Yellowstone erupts, which is theoretically true since my servers in Europe would still be able to serve traffic to the rest of the world even if every server in North America is down, but I’ve not tested this (and hope it will never be tested).
Side note: AS200351 turns one year old today! 🎂
BGP Route Attributes
As alluded to in the very first post of this series, every route sent over BGP has a bunch of attributes describing it. Here are the ones that will be relevant for today’s discussion:
- Prefix: the IP prefix this route is meant to reach;
- Local preference: a value assigned to the route inside an AS. Typically, downstream routes have higher values than peer routes, which have higher values than transit routes. This value is passed between iBGP routers inside the AS but not between ASes (eBGP);
- Next hop: the IP address to which the peer should forward the packets destined to the prefix;
- AS path: the chain of ASNs through which the final destination is reached;
- Origin: whether the route originated from an interior gateway protocol (“IGP”), the Exterior Gateway Protocol (“EGP”), or if this is unknown (“incomplete”). This is not really used these days;
- Multiple Exit Discriminator (MED): a 32-bit unsigned integer used to convey to a peer the optimal entry point to the local AS;
- Router ID: a 32-bit integer used to identify the router, typically set to one of the router’s IPv4 addresses; and
- Communities: extra tags on the route.
BGP Route Selection
First of all, the traffic is always routed through the most specific prefix that
matches the destination IP. For example, if there are routes for
2001:db8::/48, then traffic to
2001:db8::1 would always be routed
through the route for
2001:db8::/48, and never to
If multiple routes exist for the same prefix, then the best route is selected
based on the route attributes. This may differ slightly depending on the vendor,
but the fundamentals are mostly the same. We’ll use
algorithm specifically for reference:
- Prefer routes with the highest local preference attribute. If they are the same, then
- Prefer routes with the shortest AS path length. If they are the same, then
- Prefer IGP origin over EGP, and EGP over incomplete origin. If they are the same, then
- Prefer routes with the lowest MED value. If they are the same, then
- Prefer routes received via eBGP over routes received via iBGP. If they are the same, then
- Prefer routes with lower internal distance to a boundary router (for iBGP). If they are the same, then
- Prefer the route with the lowest value of router ID.
Other implementations may do things slightly differently. For example, some may prefer the oldest route before comparing router IDs.
In theory, anycasting is just announcing routes for the same prefix from multiple locations. Note that the minimum announcement size accepted on the Internet is /24 for IPv4 and /48 for IPv6, so an anycast announcement needs to be at least that size, which can represent a huge commitment in the case of IPv4.
In practice, from the route selection algorithm described above, you might see a
problem. For example, say you are AS645002 announcing an anycast prefix
2001:db8:1000::/48 from a server in Germany with AS1299 as your upstream and a
server in Canada with AS64501 as your upstream, whose sole upstream is AS6939:
(Note that AS1299 is a tier 1 ISP that peers with all tier 1 ISPs, while AS6939 is an almost tier 1 ISP over IPv6 that peers with every tier 1 ISP except AS1743. In this setup, AS174 will only ever go to Germany, while everyone else can theoretically reach either server. We’ll ignore AS174 in this discussion.)
You would expect users in the US to be routed to your server in Canada, and users in France to be routed to your server in Germany. Unfortunately, it doesn’t quite work the way you’d expect.
For example, if a user in France uses AS6939 as their upstream, they would prefer the server in Canada, since according to rule #1, the highest local preference route is selected, and downstream routes are the most preferred. Note that you are downstream of AS6939 since they are your upstream in Canada (through AS64501). Similarly, users of AS1299 in the US would prefer the server in Germany.
There are two solutions to this:
- Make AS1299 and AS6939 your upstream everywhere. This involves buying transit from AS1299 in Canada, and AS6939 in Germany, which may be expensive; or
- Convince AS1299 and AS6939 to lower the local preference for your route. For
example, AS1299 has the BGP community
1299:150(full list) which will set the local preference to equal to peers (150, 200 is the default for customers of AS1299), giving it the same preference as the route they receive from AS6939. To use it, announce your route with this community added. Alternatively, you can use
1299:10050to lower the local preference to 50 outside of the announced continent (Europe), or
1299:15050to do so explicitly in North America. On the other hand, AS6939 doesn’t have BGP communities to control local preference, so you are out of luck unless you get them as your upstream on your German server, or remove them altogether.
Let’s say you use
1299:10050 to prefer alternative routes in North America and
add AS6939 as an upstream in Germany:
This still doesn’t work as you’d expect. You’ll find that while AS1299 does the
correct thing now, everyone else will prefer the German server. This is because
according to rule #2, the shorter AS path wins. Let’s say you are the transit
customer of a tier 1 ISP with ASN and don’t have any peers. Then, the AS
path you see for the German server would be
x 1299 64500 or
x 6939 64500
while the AS path you see for your Canadian server would be
x 6939 64501 64500
(remember that all tier 1 ISPs peer with each other, so they are always directly
adjacent). Clearly, the former is shorter, and so is preferred. Similarly,
inside AS6939, you’ll see
6939 64500 for the former and
6939 64501 64500 for
the latter, with the same effect. To fix this, you would need to make the AS
path the same length.
To achieve this, you can use AS path prepends, taking advantage of the fact
that you can repeat the same ASN in the AS path. For example, if you prepend
64500 to both AS1299 and AS6939 in Germany, the resulting AS path for transit
customers of tier 1 ISP with ASN would be
x 1299 64500 64500 or
64500 64500, which is the same length as
x 6939 64501 64500 from the Canadian
server. This means rules #3-7 will be used to choose the optimal route, and
there’s a decent chance that the routing is sane globally. The result is:
Essentially, you want to use prepends to keep the AS path length to tier 1 ISPs the same, and either use the same tier 1 ISPs everywhere or use communities to ensure their routing is sane. You can check the routing with the ISP’s looking glasses. If you wish to announce your anycast route to your peers, the same principle applies, but there is no rule of thumb for the AS path length, and you must prepend as necessary to influence the route selection or block the route from being announced to a troublesome peer.
Note that sometimes, you may need to prepend to one of your upstream’s upstreams, but not their other upstreams. In such cases, your upstream may provide you with BGP communities to achieve exactly that, and you can add the community to your route. The exact community would defined by your upstream, so you would need to ask them what it is. If it’s not available, you are simply out of luck.
Single Location Outage
Now, let’s consider what happens if a single location has an outage, or more concretely, if your server in Canada loses power.
The first thing that happens is that the BGP session with its upstream goes
down, causing the upstream to withdraw all the routes your server has been
2001:db8:1000::/48. This will get propagated to the
whole Internet, causing every router that routed
2001:db8:1000::/48 to your
Canadian server to find an alternative route4. In this case, it will find
the announcement of
2001:db8:1000::/48 from the German server and switch to
that, allowing your service to be reached despite one of the servers going down.
Note that this switchover may not happen instantaneously, so your service may be unreachable for a bit. If you are explicitly shutting down the BGP session, this would take seconds, but for a power loss, your upstream may only detect the connection loss after a few minutes while it waits for the BGP session to time out.
Now, say your service is unhealthy but the BGP session is still up in Canada. Then, users in North America would still hit your location in Canada, even though the service is unhealthy! This is a problem, but it can be fixed by having a regular health check. If the service is unhealthy, the route can be manually withdrawn, and all traffic would be directed to Germany. This is how many highly available services are built.
While anycasting with health checks can combat most outages, it is not perfect. Perhaps the most famous example of an outage caused by this is the Facebook (now Meta) outage in 2021 (details), so we’ll dive into it as a case study.
Facebook’s DNS servers used anycast to achieve low latency and high availability, with a health check to withdraw the route if the DNS server couldn’t reach Facebook’s datacenters. Normally, this worked quite well—if a DNS server felt unhealthy, it would withdraw the route to failover to the nearest alternative.
However, during the outage, Facebook accidentally disconnected all the datacenters, which was bad enough. However, every single one of Facebook’s DNS servers found that they couldn’t reach the datacenters, causing the health check to fail and every server to withdraw their routes. At that point, nothing was announcing the IP prefix for their DNS servers, so their DNS server became completely unreachable, resulting in their domains failing to resolve… which broke a lot of their internal systems, making it very difficult to recover.
I suppose the lesson here is to be more mindful of the potential for health
checks to fail everywhere. Ideally, the health checks for the DNS servers should
only involve checking the DNS server itself without any external dependencies,
which would avoid the problem. Alternatively, perhaps they should have convinced
their peers and upstreams to lower the local preference instead of withdrawing
the route outright, which would still allow traffic to flow if there were no
viable alternatives. This could have been achieved with AS-specific BGP
communities, or by using the well-known graceful shutdown community
This concludes the basic introduction to BGP route selection and anycasting. I hope you learned something useful. This also covers everything I wanted to cover about BGP theory. If you want to learn more, I think there is no substitute for running your own network.
If there’s anything else about BGP that you’d like to know, feel free to make a suggestion in the comments. Otherwise, if I find the time and energy, I might start talking about the practical side of setting up a BGP network.
While the speed of light is often spoken of as very fast, it actually is quite slow on the global scale and creates a lot of engineering challenges for networking. For example, the great circle distance between Toronto, Canada and Sydney, Australia is 15 565 km, which requires light to travel 52 ms in a vacuum, meaning a round trip takes 104 ms. This is the fundamental limit set by physics and it is impossible to go faster. In practice, accounting for the slower speed of light in optical fibre, the processing time in routers, and the unideal placement of fibres, I’ve not seen ping times less than 200 ms.
This means that if you visit a server in Toronto from Sydney, it will take at least 200 ms to load no matter what. If you need to go back and forth ten times before the request completes, then it will take two seconds. ↩
According to RFC 6382, each location for an anycast should use a separate ASN. However, in practice, almost no one does this due to the hassle of obtaining ASNs for each location, especially for larger networks with hundreds of points of presence. For this reason, we’ll use the traditional approach of anycasting with a single ASN. ↩
This is the long-running peering dispute between AS174 and AS6939, in which AS6939 asserts that it is a tier 1 ISP on IPv6 (not IPv4) and doesn’t purchase any transit, while AS174 asserts AS6939 should be buying IPv6 transit from them since AS6939 buys IPv4 transit. The result is despite both sides claiming to be a tier 1 network, they couldn’t reach each other over IPv6, causing endless headaches for their users. To avoid visibility problems, their users must purchase transit to another tier 1 ISP. I feel like AS174 should just peer with AS6939 and end this ridiculous situation—they aren’t making any money from AS6939 today and their users would thank them for it. Besides, AS6939 baked them a cake asking them nicely. ↩
Yes, every router in the “default-free zone” (DFZ)—which receives the full Internet routing table instead of using default routes—will receive the message and update their routing tables, even if they have never forwarded a packet to or from you. Your server outage is changing the shape of the Internet itself! This is what simultaneously makes this hobby cool, yet carry a lot of responsibility. ↩