Something Wrong with Facebook
Something Wrong With Facebook
The essential imperfection that caused this outage to be so extreme was an unfortunate handling of an error problem. An automated system for verifying setup values wound up causing a lot more damages than it fixed.
The intent of the automated system is to look for arrangement values that are invalid in the cache and also change them with updated worths from the relentless store. This works well for a transient issue with the cache, however it does not work when the relentless shop is void.
Today we made an adjustment to the relentless copy of a setup worth that was interpreted as invalid. This meant that every client saw the void worth and attempted to repair it. Since the solution involves making a query to a cluster of data sources, that cluster was quickly bewildered by thousands of hundreds of questions a 2nd.
To make issues worse, every single time a client got an error attempting to quiz one of the data sources it translated it as a void value, and deleted the matching cache key. This implied that also after the initial problem had actually been dealt with, the stream of queries continued. As long as the databases stopped working to service several of the demands, they were triggering even more requests to themselves. We had gotten in a comments loop that didn't permit the databases to recoup.
The means to quit the feedback cycle was fairly excruciating - we needed to quit all traffic to this database collection, which meant switching off the site. Once the databases had actually recuperated and the source had been repaired, we gradually permitted more people back onto the site.
This got the site back up and also running today, and in the meantime we've turned off the system that attempts to correct arrangement worths. We're checking out brand-new layouts for this arrangement system following style patterns of various other systems at Facebook that deal more with dignity with feedback loops and transient spikes.
We say sorry again for the website outage, and also we want you to understand that we take the efficiency and also reliability of Facebook very seriously.