The technology behind the Facebook failure explained
The massive failure of Facebook, Whatsapp and Instagram raises a number of questions: What actually is BGP and should the Internet really rely on such a protocol?
On Tuesday, Facebook, Whatsapp and Instagram as well as Workplace and Oculus VR were suddenly not available worldwide for a period of time. Attempting to call one of the services resulted in an error message. It was only six hours later that the affected services were apparently stable again.
In a blog post, Facebook named configuration changes to the routers that coordinate network traffic between the company’s own data centers as the cause. This interruption of the network traffic brought the communication between the data centers to a standstill like a cascade.
Apparently the resolution of the DNS names no longer worked as a result of the incorrect configuration and the infrastructure IP of the services concerned could no longer be reached. How did that happen?
The Internet is a global, decentralized network that consists of many smaller, interconnected networks. These networks largely consist of hosts and intermediate systems, the so-called routers. Information traverses a network in one of many ways. Which is currently the cheapest for forwarding information is selected in a process called routing. The routers, which are responsible for the functioning of the Internet, have huge, constantly updated lists of these possible routes over which the network packets can be directed to their destination.
The standard protocol of the Internet for the exchange of information about the availability between hosts and routers as well as the path selection is called BGP. The abbreviation stands for Border Gateway Protocol. Put simply, without BGP, the internet routers wouldn’t know what to do and the internet wouldn’t work.
BGP enables a so-called autonomous system (AS for short), such as Facebook, to display its presence on other networks on the Internet. If Facebook does not show its presence, service providers and other networks cannot find the Facebook network. An autonomous system is an administrative area, i.e. a network or a group of networks under a common administration, with common routing guidelines. Each of these networks has a so-called autonomous system number, which can be thought of as a kind of big boss IP address. The task of the ASN is to announce so-called prefix routes so that the network – Facebook – can be found. This announcement runs via BGB.
BGP is considered a comparatively simple protocol. It has been used since the commercialization of the internet and has long been considered stable and reliable. With the rapid development of the Internet over the past two decades, performance and security problems have repeatedly come to light in connection with BGP.
Routing tables must be consistent with the network and are constantly updated by a BGP implementation in accordance with changes in the network infrastructure. Examples of such changes are failed and restored routers or broken and restored connections. Such occurrences are considered normal and occur all the time.
However, if a router is configured incorrectly – as is the case with Facebook – it can obviously happen that the routes disappear from the routing table. Without a working connection, no one can access a service from outside – possibly the reason why the problem is apparently could only be fixed by technicians with physical access to said routers.
Whether the alternatives to BGP presented so far can really be a viable substitute for the protocol has been discussed for some time – but the Facebook failure shows impressively that there should be a new solution at some point: social networks, VR and a messenger are these one, such a misconfiguration could theoretically also hit critical infrastructure.