Wednesday, September 30, 2009

How To Invest in Poor Decision Makers

I couldn't think of a better title that fit my "How To ..." pattern. The point is, I wanted to make a response to the 37signals post that I found a little harsh. Sure, if you were able to build your company up without investors, that's a great thing! It doesn't make it a terrible thing to get a boost in the early stages or give you a license to insult people trying to pay the bills and put children through college.

Making great products is something a lot of us aspire to. Frankly, that simply isn't all of us and there really are good developers out there who are still only in it for the money. I don't know that is the case with Mint.com, but neither does anyone over at 37signals. Belittling them for taking a quick-cash option assumes a lot that may just be completely wrong about the intentions.

Now with a chunk of change, maybe the founders are planning to jump ship in a couple years and self-fund their real dreams.

On the matter of start-up investment itself, I do want to make some comments. Full disclosure: I've never been involved in a venture backed startup and I'm completely making this up from my own opinions about the world!

Pretend I'm from your bank and call you back after a loan application. You're taking out a small business loan to build an additional room in your home for a new child. Everything looks good, and I've got a few questions to go over before approving the loan.

"I'd like to make you an offer for 10% ownership in exchange for this investment in your new venture," I begin.

"What the hell are you talking about?" you quizzically respond.

"We're talking about a significant investment in a potentially very profitable new enterprise. This child may well become a doctor or lawyer and if we're going to help with the initial costs of raising this from the ground up, we all feel it is a reasonable request to share part ownership and benefit from that share over the lifetime of its profitability."

"Umm... I thought I'd pay the loan back. Plus interest, even. I don't even think I would own the child myself, technically. This is very strange..."

"Pay us back? A guarantee of interest accumulated as profit on our contribution? We'd rather take a chance of nothing or you paying us regularly for the rest of the child's entire lifespan. Oh, and all of it's children, of course."

*click*

If we look at everything in our world from neutral eyes that aren't used to our ways, things look weird. Does our investment model make sense, in this industry or any other? Why are any initial investments not setup as high-risk, high-interest loans, most likely with some initial grace period to await profitability? Of course, we could make some comments about the predatory loans and paying a cut of income for the rest of one's life, but at least banks pretend that isn't the deal upfront.

It isn't like this isn't an unusual idea. People get small business loans all the time. The tech sector seems to have skewed expectations that lead to dangerous and strange arrangements for funding. Still, I can't help but wonder if there are independent investors who would or do take such a (relatively) altruistic route. I imagine something like a traditional investment round, mandating some grace period of 1-2 years, a repayment schedule requirement full reimbursement, and interest accumulation that tapers off after repayment of the initial investment.

The basic foundation could be extended to view all initial players as investors, be the investors of time or money. Invest your time to get a business started, helped by monetary investments from others, and after repaying yourself and those individuals the company becomes its own entity. It is not burdened with paying out profit shares to you or anyone else. Yes, you'll still make your salary and you'll still run the company, but it might be a better one for it.

Sunday, September 27, 2009

How To Learn From a Traffic Surge

I want to say a few things for my own benefit. Maybe that's
the only thing I do here. As always, I hope something I have
is useful to someone else. In this case, if you're in any
position to deal with a big surge on a small site, you might
get something useful, or at least enjoy, what I have to from
my experience getting a bump from some guy named Mike
Arrington with a little blog called TechCrunch.

This is about reaction and what was the right and wrong way
to react to the impact of a weeks traffic in a couple hours.
Should natural means have brought our typical traffic to
these levels (time will bring this) the means to handle it
on a day to day basis would have been put in place.

The sudden increase began to timeout our FastCGI processes
and this was alerted to me quickly. I confirmed this and my
first response was to initiate a restart cycle, restarting
each process in turn, which did nothing to help. I brought
up a new instance on EC2 and prepared to roll it out as a
new production machine, with the same steps I use for every
rollout of software updates. The new instance ran fine, so
I initiated the rollout, associating our public IP with the
new instance to begin taking traffic. Immediately, the
staging machine, now in produciton, stumbled and began
behaving exactly the same.

My next thought was the obvious thing both machines shared:
the database. I started looking at any metrics I could, and
with nothing obvious and the site already failing to respond,
it seemed a safe bet to restart the database, after some comments from the fine folks in ##postgresql, it became possible that badly terminated transactions might have been hanging processes and I was advised to restart PG, which is a disruptive action. When it finally cycled my staging machine seemed fine and I deployed it, only to watch it start to suffer once again.

This was when I got a message that we had gotten the bump from Mike Arrington, over at TechCrunch. Everything suddenly made sense, and dropping into logs showed me a huge surge in traffic. There are things I could probably improve about our setup, but I'm mostly satisfied with its progress. Still, this surge was well over what it was prepared for at the rate it was coming in and it would actually be unreasonable to expect a site this size to scale that quickly for such a large and relatively short burst (a few hours).

In the end, my final call is that the biggest problem that happened is that I didn't have the information obvious to me that it was the traffic and not the system causing the problem. Everything I did was only making the problem worse, and my best course of action should have been to step back and cross my fingers. I'm looking at short term reports I can consult to give me a better overview of the recent activities, traffic rates over the last hour and server error ratios that can tell me what's going on without spending too much time digging into it. The more time it takes to figure out what's going on, the more likely someone is going to jump to a conclusion in an attempt to get a solution moving as quickly as possible.

Saturday, September 26, 2009

How To Turn Web Development Around (Part 3)

When I complained about the problem, I promptly outlined some ideas about solving it, vaguely. Now, I want to narrow that outline into systems I actually use. I do most of my work with Django, some hobby time is spent with App Engine and Twisted, and I enjoy Amazon Web Services, so I'm thinking from these perspectives when I approach this. Parts one and two were broad, but some of this might only apply to fewer of you. Either ignore those or adapt to whatever you use.

Django's cache layer sucks. Simply stated and simply true. Any time I decide I can cache something, I should ask myself if I could have built it before I even had the request in the first place. Doing that with the template caches simply isn't possible. It should be possible and it should be the first path you take, instead of forcing us to go out of our way to do the better thing. Anything I might want to cache, I also might want to be sure I'm not doing in more place than once, and forcing them inline in my templates does not help this. The template caches imply a copy-and-paste method of reuse when a cached portion is used in more place than one. When I define a cache block, I name it and I specify a set of keys. This is exactly the information, that when changed, I should just generate that block as a static snippet to be inserted. If it weren't for the lacking in reuse mechanics, I would advocate parsing all your templates for cache blocks and pre-generating them. Instead, we need to pull the cached contents out of the normal templates and use the existing names and keys to find the generated snippets.

On the more basic level, there are some abstractions that need to be injected into Django-proper to really be useful, by means of what they would standardize. We have no current means of standardizing our cache keys in a way that different applications can cooperate about what data is where and how to get it. Even the types that are taken for granted in Django have no useful standards. If they did, I would be able to drop a QuerySet object into the cache in a way that another query can find to reuse. And, when memcached is by far the most likely cache backend to be used, we would be providing a mechanism that abstracted away its limitations in entry size, allowing us to trust dropping our QuerySet in safely.

Denormalization should be normal. I have revision tracking in a document system, and from a normalization perspective it makes sense that each version hold a foreign key to either its previous or next version, but not both. From a practicality perspective, if I have one version I want to know the previous and next versions without doing a new query. Our Resources might offer a solution, by giving us some place outside of our model to allow denormalized data. I could generate a record of my documents with all the revision information queried and built and stored in one flat record, while keeping my base model clean.

Queuing work should be as accessible as doing work. There is little or nothing inhibiting a developing from dropping one little query or action into an existing operation. I've recently built a weighted sort to replace our basic date and time based order for posts. This means generating scores for all the posts and updating those when posts or votes change. Now, whenever we calculate scores we account for the age of all votes and the relative scores and age of all posts and votes together. In other words, this is something I'd prefer not to add to the cost of a user actually posting content or voting on something. It would have been extremely easy for me to call one generate_scores() function, but it takes thought, planning, and infrastructure to have this done after the request is handled.

Borrowing from existing Python canon makes sense, so I think multiprocessing is a candidate for use here, in one form or another. multiprocessing.Pool.apply_async() without a result returned fits the bill for an interface to call some function at another time, possibly in another process. Any function that works when passed through multiprocessing into another process should also work when queued up for execution at some later time, so borrowing here reusing existing semantics developers should be familiar with.


Friday, September 25, 2009

How To Adopt/Kidnap a Project

Distributed version control is a good thing. I've started wondering, abstractly, removing the middle word of that phrase. In other words, how are we being affected by "distributed control" and how will the landscape of free software politics change as it becomes more predominate and we all become more comfortable with it?

Even centralized version control began the distribution of control. At least, it made it easier for more than one person to control the changes of a codebase. In old days of e-mailing patches around, it was pretty much a requirement that a single person be responsible for the merging of patches into any single codebase (or any section of that codebase). Source control allowed multiple developers to commit changes and began to put less burden and less power in any one person's hands.

Anything that makes the submission of new code easier is going to thin that power even more. When anyone can come along and submit changes to change functionality or add something new, it takes a little bit of control away from the owners of that project. At some point, you start to feel that the community runs your project as much or more than you do. This has its good and its bad sides, but it is a shift we see more and more.

A few years ago there was a rift in the development team of the XFree86 project and from this we got our current fork, X.org. The story is well known and it brings to light an important political power of open source: fork and run. Even if you own a project, you'll begin to loose power both to make users and other developers happy, and to keep control at all. A strong enough disagreement could mean everyone else just leaving you behind and taking the project with them, under a slightly different name and a forked codebase. This can be scary and obviously could be harmful, but like any democracy we trade that for the benefits willingly.

Today, forking is easier than ever. Project hosts like github and launchpad promote forks as the primary means of submitting patches. No longer do you submit your changes for scrutiny and wait for acceptance or denial. These are the days of "I liked your project, and I have my own version of it. Take it or leave it." Other developers are as welcome to use your version as the original. This begs the question, when does the original project stop mattering and when do we come to realize that all forks are created equal?

The big question here is when the use of a fork with a few patches, either yet to be pulled into the original or rejected for differences in opinion, becomes as reputable. This can only happen if we can get past looking at forks as either replacing or diverging and understand them as ongoing versions with differences for good reasons. Should I find that I want to make some modification to a library I'm using that the current maintainer doesn't want to accept, there should be no social issue with those two branches, the original and my own, existing and being used in parallel. The choice for others on which to use can be considered of as much weight as its configuration options.

When we reduce to zero the social cost of taking a project for you own to make the changes that fit your needs, we make many things easier. Abandoned projects become far easier to adopt, without feeling the need to make all due process to contact and get the fair wishes of the creator. Difficult to work with maintainers no longer hold control over users and developers who disagree, because even more democratic than a democracy we can allow everyone to truly get what they want.

Thursday, September 24, 2009

How To Turn Web Development Around (Part 2)

I did my best to outline the problem in Part 1. Now I have to stand up and propose some kind of solution. Otherwise, I'm just complaining and contributing nothing of real value.

Our frameworks make certain things easier. They don't provide tools to help us with other things. For some other set of activities, they may actually prohibit. The problem here is a combination. Django makes it easy to query your database and wrap functionality up into re-usable template tags. While I'm thankful for that, I am also realizing that ease of one thing can prohibit another. When one path is made easier it creates the perception of greater difficulty in other paths. I think this is why, when our web frameworks give us all these tools to response to a web request, we completely lack in everything we could do aside from that request.

How can we make it easier to work outside the web request?

We need some idea of what working outside the web request means. We also need to define these in terms that are useful when we do get around to that request handling we've already got.

Going back to the tag cloud example, we look at the resources created when we generate one. Aside the HTML snippet of the tag cloud itself, we build the data used in the cloud, consisting of all the unique tags and their counts. This is the kind of data that makes sense to store in your cache, but this fails the normal cache use case. We don't want to loose these generated resources when caches reset, so we need something less ephemeral. Any decent key-value store would be a good solution here.

Unfortunately basic Django signals are lacking. Another means of triggering the resource generation at the right times, with the right parameters, has to be found. It makes sense to actually use existing signals, which would add to a job queue.

The few remaining parts to give us easy mechanisms for inserting snippets into templates or grabbing generated datasets in views are all very simple. Together, the three layers come together to give us what our frameworks are leaving out today. Resources, to store non-cheap data. Jobs, to generate resources. Finally, Tools to acquire and use those resources. If I were an egotistical man, I might try to coin my own acronym and name this RJT.

I know this is nothing new. Rather than make the situation better, that actually makes it worse. As any project grows and matures, the cut corners need to be filled in. Everything here is eventually built, to different variations and with probably a lot more forethought (or a lot less, depending on the pressure.) The only difference is that large scale applications need to divert more resources to pushing, instead of pulling, whereas smaller scale applications simply should, because the benefits exists in either case. We won't all need to grow at exponential rates, but we should be doing better with whatever resources and whatever work load our application is given, small or large.

Wednesday, September 23, 2009

How To Turn Web Development Around (Part 1)

Something is bothering me this past week. I've been taking some stabs at reducing the maximum render time of a site, when the caches are all empty. I cache certain components and queries, but when the caches are primed the render time is under 500ms and I think that's pretty good. That worst case senario, however, is just not acceptable. Worse than a couple seconds. That isn't time that should be taken. I dug in and found a really bad pattern.

It isn't hard to make a page faster, but the default is to be as slow as possible. We have to understand this pattern. I am looking at this in relation to Django, but I have a feeling there are similar patterns other places.

The common tagging application is a good example. It makes it really easy to tag objects, count them, query by them, and build those clever little clouds. You're given lots of new wrappers for all the common tag-related queries you'd need to do. This may be a source of the problem. We've gotten into a rut of complacency with components that give us more rope than we need to hang ourselves. Abstraction hides the cost of operations.

Are we asking the really simple question: Why are we pulling when we could be pushing? With one of the most read-heavy information systems in the world, everything revolves around the needs and demands of the almighty HTTP request. A browser asks a question and then we go and figure out the answer. We are, by default, building our software around the read when we should be building around the write.

Caching is lazy. We should be pro-active. How often is a tag cloud going to change? Only when the taggings change, of course. No page request should ever generate a tag cloud. We should be building the cloud, as a static html snippet, every time the tags change. When we're actually rendering a page, we should just insert that current snippet.

The problem is that we make lots of tiny little increases in the pulling we do and we do it all over the place. We hide it behind innocent looking functions and properties and end up using a few of those inside one element that gets repeated and it piles up. The amount of work in one page becomes insane, for what that page is. The problem isn't that it is so difficult to do something better, but that the default should be better. I would like like some practical answers to making that default better.

Friday, September 11, 2009

How To Track Changes in the Location Hash

As the web becomes more "2.0" we're collapsing collections of pages into fewer and even single, more dynamic pages. Instead of ten pages of event listings, we might have page that loads further items dynamically as you scroll. The state that was once static to a page is now loose and can alter in the lifetime of a page, which grows longer every day. Parameters of the page state have always sat in either the path, in URLs like http://myblog.us/post/how-are-you or the querystring in cases like http://www.coolstore.com/view.html?product=4356.

Neither approach works when those parameters are changing for the life of the page, and where a single URL needs to be able to represent these multiple parameter values at any time. In most uses so far, the bullet is simply bitten. The user can browse to your site and click around, and if they bookmark or send the link to a friend, they'll always come back to the front page, because the state of the page is no longer held in that URL. This wasn't acceptable for long and a few projects, including GMail, have taken to tossing some state information into the hash or anchor of the URL, after the # symbol. This has traditionally told a browser to scroll to a <a> tag with that name, but if none exists it becomes, essentially, a no-op. We have a place in URL we can stare state, now, without causing page loads and persisting the state in links and bookmarks. There still haven't been great or standard ways to deal with this, yet.

A couple years ago I started my own attempt to make this easier, when I found existing libraries outdated or just not really doing what I hoped for. They either seemed to depend on obsolete versions of other libraries, or only give a little trigger when the hash changed. I thought we needed something more than that, because this is really replacing everything we used to use querystrings for. Sure, I could toss #2 or #43 at the end of the URL depending on what page of results you saw, but what if the state was more than a single number? Querystrings can store lots of variables. This is what I wanted within the hash.

Born was hashtrack.js!

The API is pretty simple. You can check and set variables in the hash with hashtrack.getVar() and hashtrack.setVar(). Changes to the hash or to specific variables in it can be registered with callbacks via hashtrack.onhashchange() and hashtrack.onhashvarchange(). You can view the full documentation, included embedded interactive examples, at the github pages hosting it.

Tuesday, September 08, 2009

How To Select from a Range

I had some down time today to relax, and in true obsessive fashion I spent it coding for the hell of it. I got something in my head and whipped up a demo of the idea. Do you ever need to let someone select a range of things? Maybe they need to pick which and how many items to show in a search result or which letters of names they want to see from an address book? I wanted to allow selection of both "what" and "how much" in one click.


click for demo

The range being selected from can be anything: numbers, letters, weeks of the year, etc. Users can click among that like a list of a page numbers, like they would expect. I think this would work well in situations where you don't need the entry to be exact, although it can be used for precise entries. Multiple quick selections would also be easy here, maybe quickly changing the range you're viewing in an analytics app. I'd also like to look at adding a "zoom" feature, so that one selection fills the entire widget and then you can select within that to narrow down on the exact range or specific item you want.

Fork away! Especially if you're the kind of developer/designer who can make this not look like government grade bread

Github: http://github.com/ironfroggy/rangeselection/
Demo: http://ironfroggy.github.com/rangeselection/
License: MIT

I write here about programming, how to program better, things I think are neat and are related to programming. I might write other things at my personal website.

I am happily employed by the excellent Caktus Group, located in beautiful and friendly Carrboro, NC, where I work with Python, Django, and Javascript.

Blog Archive