Massive #FAIL: Gawker is still down.

Posted by: on Nov 2, 2012 | No Comments

So Gawker and it’s family of sites (Jalopnik, Jezebel, io9, Lifehacker, etc) has been down since Monday. This is a site that does between 30-40MM uniques a month. They’ve since shifted to a bunch of tumblr blogs and I even saw a former Gawker writer praise them for monetizing their new tumblr existence with State Farm ads. This is all sorts of crazy and more importantly, it just didn’t need to happen.

(Disclaimer: I don’t work for Gawker, never have had any involvement with them, and have only been following this with some interest because it was so avoidable. Some of my clients were in the direct path of Sandy and while we didn’t experience any downtime, we were prepared.)

Apparently they’re hosted in a data center in downtown Manhattan, which still lacks power, the lower floors flooded, and their pipes might even be cut. BuzzFeed, HuffPo and even Fog Creek all experienced some amount of outage as well, yet it was brief.

Rather obviously, Gawker lacks failover. I’m sure some sort of local redundancy exists, multiple webservers, dB boxes and the like. Maybe even switches. These are more likely in place for load reasons than redundancy given that they’re still down today.

Now, Gawker was hacked back in 2010 which resulted in the release of their source code and database. I’ve not looked at it personally (nor should you ever admit to doing so) but I’ve seen some analysis and it’s a fairly straight-forward set up of PHP and MySQL. This provides some valuable insight. There’s no one-of-a-kind appliances or ornate setups in the mix. It’s basically code + dB, like most sites.

There’s a million mitigation strategies one could use to allow geographic failover without any downtime or data loss. Database clusters, for example. I won’t go into those details here. I’d argue a site as large as Gawker should be using them, but they do increase hosting costs (obviously) but that’d be marginal for the cash-cow that is Gawker. But of course such a plan wasn’t in place. I think it’s rather obvious they didn’t have a fucking clue as even what to do given the days of warnings about what was to happen. The news didn’t overplay this one.

What they should of done (and this is the “they didn’t prepare shit until the skies went grey out the window” scenario):

  • When the shit started getting real, put the site into real-only mode.
  • Dump the DB and anything else associated that might be user/editor-generated content. (Images, for example.)
  • Move the critical data off-site into something safer, say S3.
  • Have EC2 instances (or similar) ready to become your backup webservers and databases boxes. This costs almost nothing if they’re not actively running. They’re simply sitting around as AMIs ready to be launched. (And considering we’re 5 days out, they could of even started from scratch and accomplished this on Monday night.)
  • I believe most of the data centers warned when they were about to go caput, given their generators were flooding. Spin up your backup instances now. (Better yet, move to them before the inevitable happens as everything below 39th is rapidly becoming part of the east river.)
  • Bring your code up to date by pulling from your code repository or using the backup from your primary boxes.
  • Load in the latest dB snapshot.
  • Change IPs to point to the new site.
  • Resume Lohan updates and snark funnel.

So shall we call it incompetence? Probably. That’s completely fair 5 days out. That didn’t bother State Farm apparently but I’d guess advertisers and even employees are wondering why they deal with a place that treats their core product with such lax concern.