What is Wrong with Facebook

What Is Wrong With Facebook - Early today Facebook was down or inaccessible for most of you for roughly 2.5 hrs. This is the worst outage we have actually had in over 4 years, and also we intended to first off excuse it. We also intended to offer far more technical information on what happened as well as share one large lesson learned.

What's Wrong With Facebook

What Is Wrong With Facebook


The key defect that created this outage to be so extreme was an unfortunate handling of a mistake condition. An automatic system for verifying configuration values ended up triggering far more damages than it repaired.

The intent of the computerized system is to check for configuration worths that are invalid in the cache and replace them with upgraded values from the consistent shop. This works well for a short-term trouble with the cache, but it doesn't work when the consistent shop is invalid.

Today we made a modification to the relentless duplicate of a configuration worth that was interpreted as void. This indicated that every single customer saw the invalid worth and also tried to fix it. Due to the fact that the solution includes making an inquiry to a collection of databases, that collection was rapidly overwhelmed by thousands of hundreds of queries a second.

To make issues worse, every time a client obtained a mistake trying to query one of the data sources it translated it as an invalid worth, as well as erased the matching cache secret. This indicated that even after the original trouble had been fixed, the stream of inquiries continued. As long as the data sources failed to service some of the demands, they were creating even more requests to themselves. We had actually gotten in a feedback loophole that didn't allow the data sources to recoup.

The way to quit the feedback cycle was rather uncomfortable - we needed to quit all website traffic to this database cluster, which meant switching off the website. As soon as the databases had recuperated and the origin had been taken care of, we gradually enabled even more individuals back onto the site.

This obtained the website back up and running today, and in the meantime we have actually switched off the system that attempts to deal with setup values. We're exploring brand-new styles for this configuration system adhering to design patterns of various other systems at Facebook that deal even more gracefully with feedback loopholes as well as transient spikes.

We ask forgiveness once again for the website failure, and also we desire you to know that we take the efficiency and also integrity of Facebook extremely seriously.