This is not a test
Unfortunately, we lost one of our data centers today at around noon (our automated health system alerted us at 12:06) which brought down several of our servers simultaneously (apparently the hosting service is having power problems). We've had to deal with hosting problems before so we've not only built up our capacity, but also designed our system to be able route around server failures. Today was our first real world trial by fire.
We lost a large chunk of our infrastructure but we were actually partially operational (about 40%). There were a few hiccups with the process but we were able to get back to full functionality (minus one datacenter) in about 20 minutes (12:30). Not bad for our first real trial where we lost an entire data center (but we can do a lot better!). We learned a lot from this outage including finding a couple of bugs in our re-routing system. This means that next time we should be able to route around failures much more quickly! As we refine this system we're getting closer to our goal of grazr never going down.
I apologize for the flaky Grazr behavior during those 20 minutes.