Skip to main content

How To Learn From a Traffic Surge

I want to say a few things for my own benefit. Maybe that's
the only thing I do here. As always, I hope something I have
is useful to someone else. In this case, if you're in any
position to deal with a big surge on a small site, you might
get something useful, or at least enjoy, what I have to from
my experience getting a bump from some guy named Mike
Arrington with a little blog called TechCrunch.

This is about reaction and what was the right and wrong way
to react to the impact of a weeks traffic in a couple hours.
Should natural means have brought our typical traffic to
these levels (time will bring this) the means to handle it
on a day to day basis would have been put in place.

The sudden increase began to timeout our FastCGI processes
and this was alerted to me quickly. I confirmed this and my
first response was to initiate a restart cycle, restarting
each process in turn, which did nothing to help. I brought
up a new instance on EC2 and prepared to roll it out as a
new production machine, with the same steps I use for every
rollout of software updates. The new instance ran fine, so
I initiated the rollout, associating our public IP with the
new instance to begin taking traffic. Immediately, the
staging machine, now in produciton, stumbled and began
behaving exactly the same.

My next thought was the obvious thing both machines shared:
the database. I started looking at any metrics I could, and
with nothing obvious and the site already failing to respond,
it seemed a safe bet to restart the database, after some comments from the fine folks in ##postgresql, it became possible that badly terminated transactions might have been hanging processes and I was advised to restart PG, which is a disruptive action. When it finally cycled my staging machine seemed fine and I deployed it, only to watch it start to suffer once again.

This was when I got a message that we had gotten the bump from Mike Arrington, over at TechCrunch. Everything suddenly made sense, and dropping into logs showed me a huge surge in traffic. There are things I could probably improve about our setup, but I'm mostly satisfied with its progress. Still, this surge was well over what it was prepared for at the rate it was coming in and it would actually be unreasonable to expect a site this size to scale that quickly for such a large and relatively short burst (a few hours).

In the end, my final call is that the biggest problem that happened is that I didn't have the information obvious to me that it was the traffic and not the system causing the problem. Everything I did was only making the problem worse, and my best course of action should have been to step back and cross my fingers. I'm looking at short term reports I can consult to give me a better overview of the recent activities, traffic rates over the last hour and server error ratios that can tell me what's going on without spending too much time digging into it. The more time it takes to figure out what's going on, the more likely someone is going to jump to a conclusion in an attempt to get a solution moving as quickly as possible.

Comments

Popular posts from this blog

CARDIAC: The Cardboard Computer

I am just so excited about this. CARDIAC. The Cardboard Computer. How cool is that? This piece of history is amazing and better than that: it is extremely accessible. This fantastic design was built in 1969 by David Hagelbarger at Bell Labs to explain what computers were to those who would otherwise have no exposure to them. Miraculously, the CARDIAC (CARDboard Interactive Aid to Computation) was able to actually function as a slow and rudimentary computer.  One of the most fascinating aspects of this gem is that at the time of its publication the scope it was able to demonstrate was actually useful in explaining what a computer was. Could you imagine trying to explain computers today with anything close to the CARDIAC? It had 100 memory locations and only ten instructions. The memory held signed 3-digit numbers (-999 through 999) and instructions could be encoded such that the first digit was the instruction and the second two digits were the address of memory to operat...

Statement Functions

At a small suggestion in #python, I wrote up a simple module that allows the use of many python statements in places requiring statements. This post serves as the announcement and documentation. You can find the release here . The pattern is the statement's keyword appended with a single underscore, so the first, of course, is print_. The example writes 'some+text' to an IOString for a URL query string. This mostly follows what it seems the print function will be in py3k. print_("some", "text", outfile=query_iostring, sep="+", end="") An obvious second choice was to wrap if statements. They take a condition value, and expect a truth value or callback an an optional else value or callback. Values and callbacks are named if_true, cb_true, if_false, and cb_false. if_(raw_input("Continue?")=="Y", cb_true=play_game, cb_false=quit) Of course, often your else might be an error case, so raising an exception could be useful...

How To Teach Software Development

How To Teach Software Development Introduction Developers Quality Control Motivation Execution Businesses Students Schools Education is broken. Education about software development is even more broken. It is a sad observation of the industry from my eyes. I come to see good developers from what should be great educations as survivors, more than anything. Do they get a headstart from their education or do they overcome it? This is the first part in a series on software education. I want to open a discussion here. Please comment if you have thoughts. Blog about it, yourself. Write about how you disagree with me. Write more if you don't. We have a troubled industry. We care enough to do something about it. We hark on the bad developers the way people used to point at freak shows, but we only hurt ourselves but not improving the situation. We have to deal with their bad code. We are the twenty percent and we can't talk to the eighty percent, by definition, so we need to impro...