Monday, August 11, 2008

How to Understand AppEngine Datastore Under the Hood: Part 1 - An Overview of the Underview

There are a lot of wrong perceptions about the datastore in Google AppEngine. People both familiar and foreign with AppEngine don't really understand what the datastore is. There is a deeper system underneath the nice API we are given. Understanding the guts can help us understand the skin. We may also find there are times when we must shed the skin for new clothing.

The biggest misconception about the datastore is the assumption that "kinds" are anything like "tables". You could use a set of entity kinds similar to the way you would use a set of tables, but they simply are different beasts, entirely. A table controls a strict requirement on the structure of its rows. Every entity, on the other hand, is free to hold any properties of allowed types. The published Model API is all an abstraction provided to give us a nice interface on top of an otherwise much looser foundation.

Many people would be very surprised to learn that a given kind doesn't actually require anything of its entities, but from the right angle it makes perfect sense. Meeting the kind of scalability requirements the datastore is designed for places interesting limitations. Schema changes can't get in the way when you could have such a large dataset that no operation can ever effectively operate on the entire set at once. This means what was a simple matter of ALTER TABLE in SQL is practically impossible in this new world, as the logistics behind updating and migrating potentially millions of entities to a new schema grows beyond the acceptable resources to give to a schema change. However, if we allow flexibility, we simply start creating new entities in the updated form and be sure that when we load one of the previous versions, we're prepared to use or upgrade it on the spot. For this and other reasons, allowing all entities to be free-form is the simplest direction to provide the foundation we need.

With a better understanding of our foundation we can better understand the abstractions in google.api.ext.db, with the Model subclasses most AppEngine developers know. I've seen quite a few people asking about migrating to changes in their db.Model subclasses, not understanding why or how their existing entities will change to match the newly defined properties. The behavior and how to work with it is a lot easier to understand when you view the individual entities are independent property bags, and not rows following a defined column schematic. We can also come to understand db.Expando as closer to the wire, so to speak, than its stricter Model cousin.

Perhaps a more exciting gain from this different view of the datastore is that we aren't bound by the published Model-centric API at all. In fact, we can access the underlying Entity class directly, providing us with a simple, persisted mapping object, without anything building on top of it. If we need some structure to our persistence, but the provided API simply isn't to taste, then an understanding of this layer gives us what we need to build our own variant datastore API. We may even use this understand to provide implementations compatible with previous ORM solutions, but powered by the entities and BigTable, rather than traditional SQL databases. The possibilities open up with our deeper understanding.

The more variation we have in what everyone is doing on AppEngine, the more value it has to all of us. Take this information and do some exciting. Share it and we're all reap the benefits.

Look for Part 2: The Raw Datastore API

Please vote on Reddit and/or Digg this article.

No comments:

I write here about programming, how to program better, things I think are neat and are related to programming. I might write other things at my personal website.

I am happily employed by the excellent Caktus Group, located in beautiful and friendly Carrboro, NC, where I work with Python, Django, and Javascript.

Blog Archive