Will everybody stop tweeting "Facebook is down"? I *know* that already, people, even if I weren't using a super-fancy Web site monitoring system - I am on Twitter :-).
So we know that Facebook had some pretty serious performance problems today. I'd like to help deconstruct the situation, because it's pretty trivial to know when a site is down, but less trivial to ascertain what is causing the performance problem, or how you can prevent such downtime. To help me in the analysis, I picked out my favorite arsenal of tools - Twitter and the free KITE. Running an instant measurement on KITE takes about 60 seconds or so, where it tests the Facebook URL from 5 cities - from San Francisco, New York, London, Hong Kong and Frankfurt. From SF, the measurement is taken from both a real DSL line but also from a high-speed backbone. The test results showed:
Basically, this told me that the page failed to load from Frankfurt, partially loaded on Hong Kong, and was slow on other cities. But wait, San Francisco on DSL took 7.089s, but 34.399s from a high-speed connection (like a T1 or T3 line). Whoa! This couldn't be true all the time - perhaps it had to go with the timing of requests being taken. And anyway, it was just a single test. What was really happening across all cities? For this, I needed to look at the data in the paid Keynote service.
Looking at a measurement of Facebook over the past 24 hours, I see some more comprehensive data that tells me how good or bad the situation really was. What our Keynote dashboard tells us is that, for the last 1 hour, availability fell to as low as 43.59% for a transaction (go to Facebook's home page; login) and to 66.67% for just its home page. But, availability's coming up back - over the last 15 mins it was 83%, and now, over the last 5 minutes, it's 91.67%. So Facebook is still unavailable for some users, but it's getting better, and will soon be up completely.
What really happened was that availability ping-ponged between 100% (available all the time) to 0% (not available at all), starting at 2:00 pm ET today, as shown by this graph below - the problems were all with DNS servers not responding (or timing out):
A reporter asked me today how Facebook's availability generally is - and my answer was that it's excellent. For the 30 days ending Sept 16, 2010, Facebook's availability was 99.41%, and the page loaded in 0.582 seconds. All measurements reported here were taken from 10 US cities, every 15 mins using a real Internet Explorer browser.