What's Wrong with Facebook

What's Wrong With Facebook - Early today Facebook was down or unreachable for many of you for about 2.5 hrs. This is the worst outage we've had in over four years, and we wanted to firstly apologize for it. We additionally intended to supply far more technological detail on what took place as well as share one big lesson found out.

What's Wrong With Facebook

What's Wrong With Facebook


The key defect that caused this outage to be so extreme was a regrettable handling of an error condition. A computerized system for validating setup values wound up creating much more damages than it taken care of.

The intent of the automatic system is to look for arrangement values that are invalid in the cache and change them with upgraded values from the relentless store. This works well for a transient trouble with the cache, but it does not work when the relentless store is invalid.

Today we made a modification to the relentless copy of a configuration value that was taken invalid. This meant that every customer saw the void value as well as attempted to fix it. Since the solution includes making a query to a collection of data sources, that cluster was quickly bewildered by thousands of thousands of inquiries a 2nd.

To make issues worse, whenever a customer got a mistake trying to query one of the data sources it interpreted it as a void worth, as well as deleted the matching cache secret. This meant that also after the initial problem had been dealt with, the stream of queries continued. As long as the databases stopped working to service some of the requests, they were creating much more requests to themselves. We had actually entered a responses loophole that really did not enable the databases to recuperate.

The way to quit the feedback cycle was fairly unpleasant - we had to stop all traffic to this data source cluster, which indicated switching off the website. As soon as the data sources had recuperated and also the source had actually been repaired, we gradually permitted even more people back onto the site.

This obtained the site back up as well as running today, and for now we've turned off the system that attempts to deal with arrangement values. We're discovering new layouts for this setup system adhering to design patterns of various other systems at Facebook that deal even more gracefully with responses loops as well as short-term spikes.

We say sorry once again for the site outage, as well as we desire you to understand that we take the efficiency and also integrity of Facebook extremely seriously.