The Rockethouse RSS

and all those that dwell within the rockethouse...

follow forjared at http://twitter.com

Archive

Apr
9th
Wed
permalink

The distributed computing myth

GigaOm posted this article about Google’s App Engine and I couldn’t help but comment. Much of what Mr. McConnell has to say is right on. Yes, we want utility like computing that can scale up and down as needed. Yes, we want to be able to use our development language or platform of choice. Yes we want it to be standards based.

But where I take issue is with the notion that cloud computing should look like a single linux CPU. Or that it should provide what looks like a single MySQL database.

This will not scale. If you want scalable computing architecture, you need to look beyond the CPU. If you build your application assuming that all CPU’s have access to a single global memory, you will build your software in a way that is inconsistent with running on a distributed system.

For example, don’t assume that every CPU has access to all session state. As you scale up to hundreds or thousands or millions of CPU’s, you’re going to spend all of your time making sure that your CPU’s have the same and consistent version of memory. Computation will grind to a halt as your virtual servers continuously update everybody else. You’re preserving an abstraction that does not efficiently scale with growth of infrastructure. 

Same goes for the database. If you assume that everything sits in a single database image, you enable and in fact encourage the programmer to make bad decisions.

The key to achieving scale is to set up your programmers with an environment in which they can’t make bad decisions (or rather, make it such that the easiest decision for them to make is inherently scalable).

BigTable is a perfect example. Google realized that it just isn’t feasible to build a single database cluster that contains the entire web. So they sacrificed some of the features of SQL that limit it to a single database instance in favor of a subset (GQL) that can be supported across a massively distributed set of servers. Engineers design their tables using BigTable and they query it using the tools Google has provided. As long as they do that, they can scale their database from kilobytes to petabytes without sacrificing performance.

Virtualization is not about “making the Internet look like a single linux CPU”. It’s about changing the programming environment in ways that prevent bad decisions and encourage good ones.

The classic problem with distributed systems has been trying to hide the fact that you’re running on a distributed system. It’s better to embrace this fact and build our tools in a way that supports the right abstraction.

I DO NOT want a single linux CPU or a single MySQL database. I want a set of API’s that ensures that whatever code I write will scale from one CPU to 1,000,000 CPU’s without modification. BigTable’s a giant leap in this direction. Distributed Hash Tables are another. We need more technologies like these to make this happen.

Comments (View)
blog comments powered by Disqus