Building a global WireGuard mesh “backbone” network with OSPF

Years ago, before I tried to build my own autonomous system and run BGP, I had a few servers in different locations,¹ and I wanted them to be able to talk to each other over some sort of encrypted VPN, allowing plaintext protocols to be run between locations securely. WireGuard was a good option, but the classic deployment would necessitate all servers to connect to one main server. That simply wouldn’t do when I had multiple servers on each side of the Atlantic. If I put the main server on one side of the Atlantic, then two servers on the other side talking to each other would have to cross the Atlantic twice. That simply wouldn’t do.

Instead, I thought to myself: what if I had a bunch of WireGuard tunnels between the different locations I had, and then have something intelligent select an optimal path between nodes? Open Shortest Path First (OSPF) seemed like the perfect option, allowing optimal routes to be selected based on the lowest possible cost (typically latency) based on Dijkstra’s algorithm, while routing around any failed links. Better yet, it did so without requiring every location to be connected to every other location. This meant that I could connect nodes on each side of the Atlantic to each other for low latency, while creating redundant trans-Atlantic links. It was also able to route traffic faster than the direct connection at times, when that used horrible routing.²

Years later, when I started building my own BGP network, I ran into the problem of trying to use the same IPv4 /24 in multiple locations.³ While I could announce the same prefix from multiple locations as anycast, if I want to route an IP to a single host, I’d need something with which to route packets to that host, no matter which location it entered my network. Typically, a backbone network is used for this purpose. For large networks, this involves actual optical fibre between locations, but we can make do with tunnels for smaller networks. To keep things simple, I turned my existing WireGuard OSPF mesh into my very own “backbone”.

In this post, we’ll explore how to use WireGuard, how to use OSPF, and how to use them to construct a backbone network and an encrypted VPN connecting distinct sites.

Background

To understand this post, a basic understanding of IP and BGP networking is required. It might help to read the BGP series that I’ve written on this blog, especially the first introductory post.

You should also have a decent understanding of the OSI model, dividing networking protocols into layers of encapsulation. A very quick summary is this:

L2 is the data link layer and includes things like Ethernet, which can encapsulate L3 protocols like IP, but could also encapsulate other protocols. The exact protocol is determined by the EtherType, which is 0x0800 for IPv4 and 0x86DD for IPv6.
L3 is the network layer, consisting of protocols like IP, which encapsulate L4 protocols like TCP and UDP. The exact protocol is determined by the IP protocol number, e.g. 6 for TCP and 17 for UDP.
L4 is the transport layer, consisting of protocols like TCP and UDP, which encapsulate familiar protocols like HTTP over TCP port 80 or HTTPS over TCP port 443.

We’ll also be focused on using Linux software routers, not hardware routers. Examples will be for Debian, not things like Cisco console commands.

Choice of technology

It is important to note that building this mesh/backbone network involves two distinct pieces—the underlying transport and the routing protocol.

For the transport, we can use actual L2 links between locations, which probably involves renting an L2 service, a wave,⁴ or dark fibre from a vendor, which is very expensive. Alternatively, we can use some sort of tunnel, such as WireGuard, which is free. We’ll talk a bit more about the consequences later.

The routing protocol decides which link is used to reach which destination. This is the piece that manages the routes and selects the optimal one to each destination based on latency. Most routing protocols would do, but popular options for this kind of thing include OSPF and BGP. I chose OSPF because the default behaviour is finding the lowest cost route based on the connections that are available, whereas BGP would require a bit more convincing.

To run routing protocols, we’ll also need a routing daemon. For this exercise, I chose bird, because in addition to OSPF, it can also handle other protocols like BGP. This would prove to be a good choice once I started building my own autonomous system, since the same daemon could be used for all my routing needs. Specifically, I am using the 2.x series, because it’s more battle-tested than the new 3.x series, and it’s more memory efficient, able to take in the full Internet routing table with 1 GB of RAM.

Ultimately, transport and routing are independent pieces and can be swapped out. If you are doing something like this, you can use a different tunnel and still use OSPF as your routing protocol, or use WireGuard and some other routing protocol instead. You can also use a different routing daemon instead, but note that bird’s OSPF implementation is reputed to not be very compatible with other implementations.

Choice of transport

To understand the consequences of our choice of transport, we must first understand what a maximum transmission unit—or as it’s commonly abbreviated, MTU—is. This is effectively the maximum size of an IP packet that can be sent over a link, in bytes.

Traditionally, IP over Ethernet has an MTU of 1500. This is basically the expected MTU size on the Internet when you aren’t using tunnels. Sometimes, ISPs would use PPPoE, which is effectively a form of tunnelling and has an 8-byte overhead, resulting in an MTU of 1492.⁵ This may cause packet loss if path MTU discovery is broken to certain destinations, causing TCP to connect but be unable to transmit any data, resulting in a “blackhole” connection. This is a solved problem by applying TCP MSS clamping, shrinking the maximum segment size of TCP so that the resulting IP packets fit in the MTU.

If you are using a real L2 link, it probably will have an MTU of at least 1500, and oftentimes, larger jumbo packets are supported. However, this, along with everything else, is highly dependent on your L2 provider. For the rest of this post, we’ll focus on tunnels, since those are a lot more accessible and standard.

There are two types of tunnels based on which OSI layer they encapsulate:

L2 tunnels encapsulate full L2 frames, such as Ethernet. This allows connected devices to appear directly on the LAN and work with things like Ethernet broadcasts, but has more overhead due to the Ethernet header.
L3 tunnels encapsulate only the IP packet and have less overhead.

Since we are routing between locations, we don’t actually need L2 access when L3 tunnels will do the job just fine, so we’ll ignore L2 tunnels and focus only on L3 ones. There are several common L3 tunnel options:

Simple IP-over-IP: This is a tunnel using a special IP protocol number to encapsulate another IP packet inside—4 for IPv4 and 41 for IPv6. The MTU overhead is just the size of the IP packet header, which is 20 bytes for IPv4 and 40 bytes for IPv6, resulting in an MTU of 1480 when transporting over IPv4 and 1460 over IPv6 with full Ethernet MTU. This has the smallest possible overhead and yields the largest possible MTU, but there is zero encryption or checksum, and it doesn’t work with most NATs.⁶ Furthermore, only one tunnel can exist between any pair of IP addresses.

Traditionally, an IP-over-IP tunnel can only do either IPv4 or IPv6 inside the tunnel, but modern Linux has removed the restriction. For example, assuming you are on 192.0.2.1 and want to create a tunnel named v4transport to 192.0.2.2, a tunnel encapsulating both IPv4 and IPv6 with an IPv4 transport can be created with:
```
ip link add name v4transport type sit mode any local 192.0.2.1 remote 192.0.2.2 ttl 255
```
Similarly, an IPv6-based tunnel named v6transport from 2001:db8::1 to 2001:db8::2 can be created with:
```
ip link add name v6transport type ip6tnl mode any local 2001:db8::1 remote 2001:db8::2 ttl 255 encaplimit none
```
The same command with local and remote flipped would need to be run on the other end of the tunnel.
Generic routing encapsulation (GRE): This is another tunnel with a separate header inside the IP packet, containing an EtherType, allowing non-IP protocols to be encapsulated. It optionally also has:
- checksums for packet integrity;
- sequence numbers to prevent out-of-order delivery; and
- a key, allowing multiple tunnels to run between the same source and destination IP address pair.
The basic header has 4 bytes of overhead, and enabling checksums, sequence numbers, and a key each requires 4 additional bytes. This results in a variable overhead between 4 and 16 bytes. Since GRE is encapsulated inside IP as protocol number 47, the resulting MTU is between 1464 and 1476 over IPv4 and between 1444 and 1456 over IPv6 with full Ethernet MTU.

A GRE tunnel could be created on Linux as follows, assuming the same endpoints as the IP-over-IP example before:
```
ip link add name v4gre type gre local 192.0.2.1 remote 192.0.2.2 ttl 255
ip link add name v6gre type ip6gre local 2001:db8::1 remote 2001:db8::2 ttl 255 encaplimit none
```
WireGuard: This is an encrypted tunnel over UDP, supporting only IPv4 and IPv6 inside the tunnel, and usable over both IPv4 and IPv6 transport. Using UDP, it can punch through NAT easily, as long as one end has a public IP to connect to. It naturally has more overhead, including the IP header, the 8-byte UDP header, and 32 bytes of additional protocol overhead, resulting in an MTU of 1440 over IPv4 and 1420 over IPv6. To ensure wide compatibility, no matter the underlying transport, the default MTU is 1420. It also has more CPU overhead due to the cryptography.

WireGuard uses public key cryptography and allows peers to be defined with their public key. It also has some basic routing support, allowing multiple peers to connect and be routed based on AllowedIPs in the peer configuration. When using a routing protocol like OSPF over it, we have to use it peer-to-peer and allow all possible IPs over the peer.

There are obviously other tunnelling protocols, such as FOU⁷, GUE, VXLAN, and OpenVPN, but using those is left as an exercise for the reader.

For this post, we’ll use WireGuard as the example, as that’s what I am using. The actual routing protocol setup should apply to any tunnel type you may choose to use.

Note that WireGuard is only recommended if you need encrypted transport. If you are just building a tunnelled backbone network between locations for data that’s sent over plaintext over the Internet already, IP-over-IP is probably a better bet due to its simplicity and lower overhead. I’ll also show a quick IP-over-IP example.

Example layout

For our example, we will define three routers—A, B, and C—each in a distinct location, but the idea can easily be extended to many more locations. On a bigger scale, it might make sense to write a configuration generator, but that’s left as an exercise for the reader.

We’ll assume the following public IPs for each server:

A: 192.0.2.1 and 2001:db8::1
B: 192.0.2.2 and 2001:db8::2
C: 192.0.2.2 and 2001:db8::3

For the VPN portion, we’ll use the following IP allocations:

A: 10.137.0.0/24 and 2001:db8:1000::/48
B: 10.137.1.0/24 and 2001:db8:1001::/48
C: 10.137.2.0/24 and 2001:db8:1002::/48

For the backbone, we’ll assume that all servers are advertising 198.51.100.0/24 and 2001:db8:2000::/48⁸, but partitioning it as such:

A: 198.51.100.0/29 and 2001:db8:2000::/64
B: 198.51.100.8/29 and 2001:db8:2000:1::/64
C: 198.51.100.16/29 and 2001:db8:2000:2::/64

For tunnels, we’ll use link-local addresses for IPv6 and for IPv4, we’ll allocate /31s for point-to-point links, per RFC 3021, allocated from 203.0.113.0/24.

We’ll connect all three sites with tunnels in this example, but note that this actually isn’t necessary and specifically isn’t required in bigger setups. The only real requirement is that it must be possible to reach all locations via some sequence of tunnels.

Setting up WireGuard tunnels

For this example, we’ll use Debian’s wireguard-tools as an example. It comes with the wg tool for configuring tunnels, and the wg-quick command and the systemd unit wg-quick@.service to configure a bunch of tunnels with an INI-like config file. The process may differ on other distros.

To create a WireGuard tunnel, we first have to generate private keys and the corresponding for each end. This can be done with the wg tool:

$ wg genkey 
EFr1rNiP5NsYJYp+J1v5v+D4w9VO7HJwDuH/sgyOv1M=
$ echo EFr1rNiP5NsYJYp+J1v5v+D4w9VO7HJwDuH/sgyOv1M= | wg pubkey 
mJv/wYe0PZeOek8ZEsXQ3PAkBzK73kJerikNtDukTW4=

We’ll use this private key for the A end of the tunnel between A and B. Obviously, use your own keys instead of copying the ones I generated while writing this post.

Let’s also generate a key pair for the B end of that tunnel:

$ wg genkey 
UEgGif7OyKe59WA9BNciAFDBmT0Jw+M7Wf1HYQCOzlY=
$ echo UEgGif7OyKe59WA9BNciAFDBmT0Jw+M7Wf1HYQCOzlY= | wg pubkey 
a5MlLRW8tY0NOFhHH8GxUUbytWUGvJiLqGYw5eXn9gI=

Since WireGuard runs over UDP, it is important for one side to be listening on a certain IP and port and the other side to connect to that. We’ll let A be the listening end, using port 15000.

Now, on A, we create /etc/wireguard/wg_b.conf with the following contents:

[Interface]
ListenPort = 15000
Address = 203.0.113.0/31, fe80::a:b:1/64
Table = off
PrivateKey = EFr1rNiP5NsYJYp+J1v5v+D4w9VO7HJwDuH/sgyOv1M=
MTU = 1420

[Peer]
PublicKey = a5MlLRW8tY0NOFhHH8GxUUbytWUGvJiLqGYw5eXn9gI=
AllowedIPs = 0.0.0.0/0, ::/0

Note that WireGuard interfaces don’t come with a link-local address, so we make one up. I am using distinct link-local IP addresses for each tunnel for some reason, and I can’t remember if I ran into problems doing fe80::1 and fe80::2 with WireGuard for every tunnel. You can try using fe80::1 and fe80::2 if you feel bold enough, and let me know in the comments if it worked.

Further note that we use Table = off to avoid WireGuard adjusting the routing table based on AllowedIPs, since we intend to run our own routing protocol instead of sending all traffic on the system to the other end.

We can then start the A end of the tunnel with:

sudo systemctl enable --now wg-quick@wg_b.service

If you have a firewall, you will need to open port 15000 to all IPs (or at least, any IP whence B may choose to connect).

On B, we create /etc/wireguard/wg_a.conf with something similar:

[Interface]
Address = 203.0.113.1/31, fe80::a:b:2/64
Table = off
PrivateKey = UEgGif7OyKe59WA9BNciAFDBmT0Jw+M7Wf1HYQCOzlY=
MTU = 1420

[Peer]
PublicKey = mJv/wYe0PZeOek8ZEsXQ3PAkBzK73kJerikNtDukTW4=
AllowedIPs = 0.0.0.0/0, ::/0
Endpoint = 192.0.2.1:15000

Start the B end of the tunnel with:

sudo systemctl enable --now wg-quick@wg_a.service

At this point, A and B should be able to talk to each other. You should be able to ping 203.0.113.1 and fe80::a:b:2%wg_b on A, and 203.0.113.0 and fe80::a:b:1%wg_a on B.

You’ll of course need to repeat this process for any other pair of hosts that you need to connect. In this example, a tunnel is required between B and C, and also A and C, using very similar configurations. Remember that ListenPort has to be different for each WireGuard tunnel on the same host, and the UDP port must not be used by anything else on the system.

Aside: Setting up IP-over-IP tunnels

You can replicate the WireGuard setup with IP-over-IP tunnels. For this example, we’ll use the ifupdown2 package on Debian and set up the tunnels with IPv4 transport, but the concept should be easily generalizable to IPv6 or even GRE.

On A, add the following block to /etc/network/interfaces:

auto sit_b
iface sit_b
    pre-up ip link add name "$IFACE" type sit mode any remote 192.0.2.2 local 192.0.2.1 ttl 255
    post-down ip link delete "$IFACE"
    address 203.0.113.0/31
    address fe80::a:b:1/64
    mtu 1480

Then bring the tunnel up with sudo ifup sit_b.

Similarly, on B, add the following block to /etc/network/interfaces:

auto sit_a
iface sit_a
    pre-up ip link add name "$IFACE" type sit mode any remote 192.0.2.1 local 192.0.2.2 ttl 255
    post-down ip link delete "$IFACE"
    address 203.0.113.1/31
    address fe80::a:b:2/64
    mtu 1480

Then bring the tunnel up with sudo ifup sit_a.

Note that if you have a firewall, you’ll need to allow IP protocols 4 and 41 between the hosts. It’s very important to note that these aren’t port numbers!

Setting up OSPF

After setting up the underlying transport links between the hosts, it’s time to set up the routing protocol. For this exercise, we are using bird2, so let’s get that installed on each host:

sudo apt install bird2

We then replace /etc/bird/bird.conf with the following block on A:

log syslog all;

# Change this to an IPv4 address on the server. It should ideally be unique.
router id 192.0.2.1;

protocol kernel {
    scan time 60;
    ipv4 {
        export where source = RTS_OSPF;
    };
}

protocol kernel {
    scan time 60;
    ipv6 {
        export where source = RTS_OSPF;
    };
}

protocol device {
    scan time 60;
}

protocol ospf v3 {
    ipv4 {
        import all;
        export none;
    };

    area 0 {
        # Change these to the prefixes you want to run OSPF on.
        stubnet 10.137.0.0/24;
        stubnet 198.51.100.0/29;

        interface "wg_b" {
            type ptp;
            cost 10; # change this based on the actual latency
            hello 5; retransmit 2; wait 10; dead 20;
        };

        interface "wg_c" {
            type ptp;
            cost 50; # change this based on the actual latency
            hello 5; retransmit 2; wait 10; dead 20;
        };
    };
}

protocol ospf v3 {
    ipv6 {
        import all;
        export none;
    };

    area 0 {
        # Change these to the prefixes you want to run OSPF on.
        stubnet 2001:db8:1000::/48;
        stubnet 2001:db8:2000::/64;

        interface "wg_b" {
            type ptp;
            cost 10; # change this based on the actual latency
            hello 5; retransmit 2; wait 10; dead 20;
        };

        interface "wg_c" {
            type ptp;
            cost 50; # change this based on the actual latency
            hello 5; retransmit 2; wait 10; dead 20;
        };
    };
}

For laziness, we are using import all and export none and declaring the prefixes we want in OSPF with stubnet, but you can export routes obtained from other protocols (such as BGP) into OSPF by configuring an appropriate export filter. Doing this is left as an exercise for the reader.

Naturally, if you are only after the VPN, don’t bother with /29s, and if you only want the backbone to split your /24 for BGP, don’t bother with the VPN prefixes.

For the kernel protocol, we are only exporting routes obtained via OSPF. A better export filter is required if you are using bird for other purposes, such as BGP. Also note that routes exported into OSPF on another host will show up with proto = RTS_OSPF_EXT1 or RTS_OSPF_EXT2, so be prepared to allow those if you are exporting external routes into OSPF, e.g. with export where proto = RTS_OSPF || proto = RTS_OSPF_EXT1.

For the cost, it needs to be based on the latency between hosts. I typically use half of the ping⁹ between the tunnel endpoints in milliseconds. For example, if the ping between A and B is 20 ms, I’d use a cost of 10. This will allow OSPF to find the optimal path based on latency between endpoints. You can periodically recalculate this latency and update the bird configuration to make this more dynamic and reflective of real network conditions.¹⁰

Also, hello 5; retransmit 2; wait 10; dead 20; is configuring various timeouts for OSPF. This is a really popular set of numbers, since the default is commonly deemed way too generous and slow at detecting outages. You can naturally also use BFD to detect outages even quicker, but that’s left as an exercise for the reader.

Finally, if you are using OSPF over an unencrypted tunnel, you are advised to turn on authentication so that people can’t just inject random packets into the tunnel by spoofing. This can be done by adding the following snippet into each interface block:

authentication cryptographic;
# Change this to something unique.
# It has to be the same on both ends of the same tunnel.
password "hunter2";

Once bird is configured, reload it by running sudo birdc configure. It should tell you if there are any syntax errors and reload the configuration if it’s valid.

Repeat this process on every router, and you should be able to see the routes to other hosts in ip route. For example, on A, you should see something like:

$ ip -4 route | grep 'proto bird'
137.1.0/24 via 203.0.113.1 dev wg_b proto bird metric 32
137.2.0/24 via 203.0.113.3 dev wg_c proto bird metric 32
51.100.8/29 via 203.0.113.1 dev wg_b proto bird metric 32
51.100.16/29 via 203.0.113.3 dev wg_c proto bird metric 32
$ ip -6 route | grep 'proto bird'
db8:1001::/48 via fe80::a:b:1 dev wg_b proto bird metric 32 pref medium
db8:1002::/48 via fe80::a:c:1 dev wg_c proto bird metric 32 pref medium
db8:2000:1::/64 via fe80::a:b:1 dev wg_b proto bird metric 32 pref medium
db8:2000:2::/64 via fe80::a:c:1 dev wg_c proto bird metric 32 pref medium

If you don’t, something has gone terribly wrong. Check sudo journalctl -u bird.service to see if there are any errors.

You can also try debugging with sudo birdc show ospf neighbors and see if any neighbours are found. If not, then OSPF traffic is blocked somehow. If you are using a firewall, remember to allow IP protocol 89 (remember, this is not a port number!). Otherwise, run tcpdump and hope you can figure it out.

If neighbours are found, then check out the topology by running sudo birdc show ospf state ospf1 and sudo birdc show ospf state ospf2 to see if it’s seeing the networks you expect. If not, something has gone wrong with defining the networks.

Finally, double check birdc show route protocol ospf1 and birdc show route protocol ospf2 to see if the routes made their way to the internal bird routing table. If so, then the problem has to do with exporting the routes to the kernel. Otherwise, something is wrong with importing routes from the OSPF protocol, and you may want to double check your import filter.

Turn on forwarding

At this point, you may find yourself unable to reach endpoints on the other networks, even though the routes exist. This is because you need to turn on IP forwarding. You’ll need to configure the following sysctls:

net.ipv4.ip_forward=1 and
net.ipv6.conf.all.forwarding=1.

On Debian, this can be done by uncommenting these lines in /etc/sysctl.conf and then running sudo sysctl -p. At this point, you should be able to ping endpoints in the other locations.

If you are using BGP and announcing 198.51.100.0/24 in all locations, you should be able to see that 198.51.100.0/29 goes to A from the entire world. Useful tools for verifying this include ping.sx, ping.pe, and mtr.tools.

Conclusion

At this point, you should have your own mesh network that intelligently routes based on the lowest latency, and this can be used for a multi-site encrypted VPN or a backbone network, depending on your needs.

With OSPF, it can detect outages and route traffic around them. For example, if the link between A and C is down, it can send traffic from A to B to C. Similarly, if the latency from A to C is greater than the sum of the latency from A to B and B to C, then it would take the indirect route for better latency.

Armed with something like this, the possibilities are endless. In my case, I use the encrypted VPN for many things, such as running the MariaDB replication and Galera for my PowerDNS anycast cluster, while cramming unicast IPs for a bunch of different locations onto a single /24. I hope you found this post useful in building your own network.

Notes

This was back when I was a student, so my main concern was how cheap the servers were, not whether they were close by. ↩
And this was the moment I learned how ISPs, especially really cheap hosting ones, have terrible “scenic” routing instead of short and direct ones, which eventually started my journey towards playing around with my own BGP network. For more details about route selection, see my post on the topic. ↩
Remember that /24 is the minimum announcement size for IPv4, but due to IPv4 exhaustion, it is not in plentiful supply. So instead of getting a /24 for each location, you’d often want to fit all your locations into a single /24 to save money if they don’t actually need that many addresses. ↩
By “wave”, we typically mean renting a specific wavelength on an existing optical fibre owned by someone else. ↩
To work around this MTU issue, the ISP could bump the MTU on the underlying network to 1508, resulting in an MTU of 1500 inside the PPPoE connection. My home ISP, Bell Canada, does this. This is called “baby jumbo”, because “jumbo frames” is used to refer to any Ethernet frame encapsulating an IP packet size larger than 1500, but typically, jumbo frames are closer to 9000 bytes and not 1508. ↩
Network address translation, or NAT, is a hack to deal with IPv4 address exhaustion, breaking end-to-end connectivity in the process. Unless it’s a one-to-one mapping used by certain cloud providers to allow the same IP address to be pointed to different hosts, it will break a lot of tunnelling protocols, such as IP-over-IP or GRE. On residential routers, the DMZ host option may or may not work for the tunnel. Note that with NAT, the local IP address for the tunnel should be the private address inside the NAT. ↩
Note that FOU can’t encapsulate both IPv4 and IPv6 packets over the same tunnel, at least on Linux. For some reason, the Linux implementation makes it an encapsulated variant of IP-over-IP and an IP protocol number is required when creating FOU, and only one such number is allowed, and it’s unable to just read the first four bits of the IP header to see if it’s a 4 or 6. Therefore, it will not work with OSPFv3 for IPv4, since for that, OSPF communications happen over IPv6. You can try separate tunnels and use OSPFv2 for the IPv4 part, but that’s not worth it. If you need UDP encapsulation and don’t want encryption, GUE is the better bet. If you want encryption, definitely go with WireGuard. ↩
Note that /48 is the minimum announcement size of IPv6. I can’t really think of a very good reason to do this instead of advertising separate /48s from each site, since IPv6 is cheap. ↩
I use half the ping since ping shows the round-trip time. Using half the ping makes it so that the total cost of the path is the time it takes to get the packet one-way to the destination, assuming the route is symmetric. ↩
I have a script that does this every hour. Since my network is officially called “Dynamic Quantum Networks,” it’d be awkward if it’s not at least a bit dynamic. ↩