« Eject Keys | Main | Cars »

December 25, 2005

Santa Tracker

Last night while driving around I was reminded of something I hadn't thought of in a LONG time; the NORAD Santa Tracker. Looking at it last night, the site has changed drastically since I was involved in it many years ago, it's really awesome to see just how much this site has expanded, how well it's running, and frankly how popular it's become (even was mentioned on Fark.com, twice in one day).

In honor of it's success, I feel I should feel I should share in what was probably it's worse year ever for historical capture, but also so that you might get some idea of how much work really goes into this.

Back in 1997 I was working for a company called Analytical Graphics, who made a piece of software that could plot and track various bits of data typically relegated to military aspects. For years before I was born, NORAD had been running a yearly "Santa tracker" via a telephone calling system. Rumor had that it began after a child one year found the number for NORAD, called, and asked where Santa was at the moment. It became a pretty cool and fun tradition since then. Somewhere along the path, it was decided that putting the Santa Tracker online would be a really great idea, and that the team at AGI was ready to help make this happen.

For weeks the execs at AGI, the creative department, and one of the IT staff had been working on creating, what they thought would be, the coolest online version of the site ever (granted it was the first version too). Apparently it was just a little too cool. On December 24th at about 9am, I arrived at work like usual and proceeded to head upstairs to work on some final development bits and pieces. The network was running beyond slow and into the realm of unusable. Since my position was time shared between IT and development, I went downstairs to see if there was something I could help with. There was a lot of chaos during the time, but from what I can remember and pieced together a few things went wrong.

A bit of the timeline as I remember it, much of which has probably been forgotten or skewed due to time.

9AM Dec 24th - I arrived downstairs to see a flurry of activity. Our head IT/Network guy, Dave, is scanning through log files and router traffic trying to decipher why this influx of traffic is coming. Our web-master is sitting in front of the web-server machine sweating bullets as his machine is non-responsive. He proceeds to reboot it multiple times, but never really gains control of it again.

9.30 AM Dec 24th - I've since become involved in the process in trying to reestablish network stability. The issue was really that we had external sites VPN'd in, and losing that connection became an utter pain in the ass to reestablishing the connection. Dave and I soon become aware of what is causing the network traffic spike, and look for some answers.

9.35 AM Dec 24th - It's painfully obvious this wasn't the expected response for the site and a huge disaster has now hit. Dave and I begin to step in and take the logical steps to save what has already failed. We begin by calling our upstream provider to see what kind of services they could offer us for hosting space.

10 am Dec 24th - MSNBC declares the NORAD Santa Tracker to be one of the worst failures of Internet-dom. The head web-master noticed this and printed out a copy of the page for all of us to read during the process of the rescue.

10:30 AM Dec 24th - Our upstream provider having not moved quick enough was quickly becoming not an option. Dave and I had already set up round robin DNS'ing and replicated the web server to multiple machines. The machines themselves were responsive but we still had no bandwidth to serve pages anymore. I take a chance and contact a former employer of mine, Microserve Information Systems, and ask if they might have some bandwidth they could burn and disk space to spare. I explain the situation, and with no hesitation the head of sysadmin there, Dan, agrees to help us. In a matter of 5 minutes we have multiple machines on a different network which we can upload pictures to.

10:32 AM Dec 24th - We take the external sites offline and use a series of multiplexed dial up connections to upload the images to Microserve.

10:38 AM Dec 24th - We begin to change three of the local 4 web servers to send people to Microserve for the images. We were later planning to setup the entire site there, but DNS propagation stopped this from happening. We could round robin internally, but not to external sites.

11 AM Dec 24th - Dan calls to let us know he's seen a drastic spike in network traffic. We haven't yet seen it locally, but are happy that people are now being served the pages properly. The web-master has been busy creating lower resolution graphics that he's begun to distribute out everywhere now.

11.10 AM Dec 24th - our upstream provider has the ability to share some bandwidth and space with us. We begin to spread ourselves further and setup yet another web-server to distribute the content from the upstream provider.

12 PM Dec 24th - lunch was purchased for us by upper management who had been watching with concern the entire time.

1 PM Dec 24th - we get to eat lunch.

1:30 PM Dec 24th - CNN interviews various members of the Santa staff for updates on Santa's location, how the tracking is done, and how we were able to coordinate this back to the website.

2 PM Dec 24th - traffic levels begin to level off, and things seem to be okay. Everyone exchanges cellular numbers again, schedules checkin times, and promises to be on call all night long. We break for Christmas all a little wary.

8 PM Dec 24th - After driving back to my parents for 3 hours, I catch a little of CNN who is regularly advertising the URL and showing updates from it. I check the servers and everything seems to be handling the load okay.

The rest of the night went rather quietly. The servers stayed fairly stable, I could get to a lot of the content without problems, and more importantly CNN could too.

When I returned to work after the weekend, it was interesting to see the massive amounts of email we received about the site. Many were thankful for such an awesome website, and how it made their kids day to see where Santa was at that exact moment. What really confused me were the number of emails received complaining about who we used for hosting.

Important pieces that we learned in the aftermath:

  • During the planning of the Santa Tracker, no one kept our head sys-admin/network guy in the loop. When traffic on our internal network began to spike somewhere around 7am, he was at a loss to explain it.
  • A series of false assumptions by the lone member of the IT staff on the team. His first assumption was that this site could be hosted on our corporate T1 line. As we've never once seemed to hit maximum capacity of the line, it seemed plausible that a low volume site could be handled by it. The key here being low volume/traffic. Due partially to a long standing UNIX vs Windows pride bet, he also believed his lone Windows NT 4 server could host the entire website with no difficulties. Indeed it could certainly handle everyone on the test and design team hitting it, but I don't believe any serious amount of stress testing had ever occurred to the machine.
  • The most important lesson was to create a disaster recovery plan. They were so sure of how everything would work, there was no fallback plan in case things just didn't work at all.
  • Days later we did some tallying from log files and believe we served up some 6 million individually unique IP addresses in that short period of time. This did not include the crunch time where everything just stopped working early in the morning. We estimate that had to be a couple million more, all said about 10 million was our estimate. A consultant at the company somehow injected his hand into everything too, and that upset a bunch of people.

    A few months later, I was invited to a dinner where I received a nice little plaque from NORAD for my efforts. I still have it today.

    The second year went much smoother. Partially because we outsourced the hosting to IBM who was boasting the use of their Olympics' server farm for our use (we crashed that too).

    Posted by Dan at December 25, 2005 02:58 PM

    Comments

    Post a comment




    Remember Me?