Back to the Future

Don't worry. As long as you hit that wire with the connecting hook at precisely 88mph the instant the lightning strikes the tower... everything will be fine. ~ Back to the Future (1985)

One of the great challenges of working in technology is that patterns of thinking change quickly and from time to time, no matter how wired-in you are, you discover that everything you know is wrong. Novelist William Gibson is right: the future has already arrived -- it's just not evenly distributed... When I first learned Ruby on Rails back in 2006, it struck me as a wondrous advance on the Java development I was doing. Java had bulked up as an Enterprise solution so now, 5 years later, it's little surprise that Java End of Life is something Thoughtworks worries about.

In tech we often see the tail end of Clayton Christensen's The Innovator's Dilemma. In TID, disruptive technologies catch on because whatever they lack in robust features they make up for in agility. With time, though, the PT Boats grow into Battleships, and the cycle starts anew.

There are signs that this is happening now with Internet technology -- our toolsets (like Rails) have grown so fit to the task that they seem a bit ponderous as the task shifts. With enough shift we again conclude that everything we know is wrong and the cycle starts again.

"It's not what you don't know that kills you, it's what you know for sure that ain't true." ~ Mark Twain

Here's what we know about Internet technology today:

  • "Computers" are how people interact with the Internet
  • Modern apps display web pages and submit information
  • Pages are served from servers (of course)
  • The client-server Internet model works fine

WRONG, WRONG, WRONG, and WRONG. Here's the world we've been living in for a while now:

  • Today there are more wireless handsets than there are people on earth
  • In 2011, nobody updates a whole page anymore -- Ajax rules
  • To paraphrase Bill Joy -- no matter where you are, most of the interesting content is somewhere else (on someone else's handset)
  • Pages are easier -- and if we wait maybe those pesky smartphones will just go away...

We're ready for a new programming world, and I've been investigating that new world for a while now. With my next post post I'll write up what I've found. As in the sound clip below, you may not be ready for this yet -- but it'll be here soon "...and your kids are gonna love it!"


Happy Birthday, Bobby Fischer

We opened a group meeting at work today with the classic icebreaker "Two truths and a lie." In TTL everyone in the class writes down two obscure truths about themselves along with a single lie, and hands them in to the instructor on a folded sheet of paper. The instructor then selects a sheet at random, reads the "3 truths" and the class has to guess 1) who they apply to, and 2) which one is the falsehood.

I have a great fun truth for the game that runs like this:

I once won a chess tournament, playing blindfolded, and then didn't go on a date for 3 whole years!

This is a true story -- I was a terrific chess player and really did once win a tournament playing blindfolded in the town I grew up in. I also really did NOT go on a date for the 3 years that followed -- but that was OK, not because I was hopelessly geeky but because I was only 13 years old when I won, blindfolded.

My skills were in part the product of the man above, who would have celebrated his 67th birthday today. Bobby Fischer was the greatest American chess player ever -- maybe the greatest chess player ever, period. He won the World Chess Championship back in the Cold-Warry summer of 1972, and made the game of chess as much a sensation as chess could be back then.

I played chess all the time because Fischer was a sensation, and you could find people to play against easily back then. I played blindfolded because I'd read about historical American champion Paul Morphy, who was said to have been a great blindfold player by age 12. I was 13, and how hard could it be? It really wasn't that hard, and my mental images of board positions weren't blurred by my opponents having to recite their every move to me.

That was a fun time -- Fischer - Spassky taught a generation of American kids to spell "Reykjavik," and books like Fisher's My 60 Memorable Games gave my dad and me hours of fun -- playing each other and playing the classics. For me Fischer's most memorable game is his Game of the Century -- a breathtaking classic by a 13 year old boy against one of the strongest masters of his day.

It's a fine line between brilliance and madness, and the eccentric Fischer crossed over and back freely between his triumph in 1972 and his death in 2008. His gifts were wondrous games that we can still enjoy even today -- his birthday (born 1943).

Happy Birthday, Bobby Fischer.

The Game of the Century (scorecard)


Casi Casi ... Cassandra

I've written a couple of times about the "N+1 Queries" problem and I've suggested that it's a bane to relational databases. But there's a way out of it -- let me tell you about it.

But first let's wallow a bit in it. I'm in Twitter, I've written a tweet and I'm ready for it to be sent out to all of my (countless) followers... Here's what my code for that broadcast might look like:

All fine so far -- that's a Rubyish-take twittery world we all live in. I can send out my breathless message of what I had for breakfast, and then Twitter picks it up and broadcasts the message from me (as well as all the messages from the other tweeters):

So here we're going to do a query for each of the X tweeters, and for them we'll do another query for each of their Y followers.

Code smell! Fail Whale!!!

(particularly when you consider Dare Obasanjo's take on Twitter combinatorics)

The problem here is Relational: we need a SELECT to find me, and then a new SELECT to get the info on each of my followers. This "N+1 SELECTS" problem is a simplified version of a real problem, where relational databases stagger and where column-oriented databases are much more what we're looking for. Column oriented databases are designed to be fast at grabbing all of the attributes (columns) associated with a given entity. To understand why this is vital for a Twitter or any other social application, consider the one-to-manys: Twitter has many tweeters, who have many followers, who themselves have many followers... and so on.

Let's think, though, about the code that gets generated when I tweet. If we're using a relational database we'll follow a SELECT for each of my followers with a SELECT for each of their followers -- so we got a polynomial number of SELECTs grinding away for each tweet, and as I get more popular the the disks whirr and lights dim every time I tweet about anything.

So to save the power grid let's try a little Twitter application, but this time using the column-oriented data store Cassandra to handle our users and tweets.

I'll run this from the same Amazon Cloud instance that I've used for my previous postings:
So, in my terminal connected to Amazon, I enter:

sudo gem install cassandra

I've already put Java on my base instance, so I'm just about good to go! A single-line command, and it really does run...

Now, lets start Twitter and tweeting. We'll use the Ruby interpreter IRB on Amazon to enter our users and their tweets:

root@ip-10-245-133-190:/var/www/apps# irb

We're rolling -- first we'll enter our requirements: rubygems to run our additional toys, cassandra to link to the data store we just installed, and SimpleUUID to identify our tweeters:

Now we'll start Twitter in Cassandra, and put in some users and screen names (I've mostly left the Cassandra responses out for brevity here):

Great so far -- we have user 5, "mudcat," and we've given him a tweet. Let's give him someone to tweet to:

And there we are -- we have a reasonable data model for Twitter, backed by the Cassandra data store. Let's review what we've got here:

Cassandra works as a kind of multidimensional hash, and the data it contains can be referenced as:

  • A keyspace
  • A column family
  • An optional super column A column, and
  • A key


Here's what these all mean:

The keyspace is the highest, most abstract level of organization. Our Cassandra conf/storage-conf.xml file contains our keyspace definitions at startup.

The column-family is the chunk of data that corresponds to a particular key. In Cassandra each column family is stored in a separate file on disk, so frequently-accessed data should be placed in a column family for fastest access. Column families are also defined at startup.

A super column is a named list, containing standard columns stored in recency order A column is a tuple, a key-value pair with a key (name) and a value
A key is the permanent name of the record, and keys are defined on the fly
With this structure we're basically defining a schema, and I'd like to claim it's original, but this one was taken from Twissandra by Eric Florenzano.

The great thing about Cassandra is that it evolved to solve real-world problems, and that it may have a free form but it is NOT exactly schema-less. Cassandra may fall in the "NoSQL" class with Hadoop, but the use cases that apply to it could scarcely be more different. Runtime lookups can be handled really well in Cassandra, due to Cassandra's low latency organization and strict definition. Asychronous analytics with the freedom of high latency and greater flexibility demands are a better fit for analytics systems like Hadoop.

Cassandra generally offers terrific performance. There is a tradeoff in eventual consistency, something that perhaps I'll take up in my next blog post.


Inventing the Future

The best way to predict the future is to invent it. ~ Alan Kay

"If I'd asked people what they wanted they would have said a faster horse." ~ Henry Ford

I really like NoSQL data stores and I appreciate the new approaches because they tackle one of the most hidebound parts of information technology - enterprise storage. I still (dimly) recall the days before the predominance of the relational model where "database analyst" was a specific (rather than general) skill, and terms like "current of run-unit" could pretend to have meaning. The relational data model brought order and reason to enterprise data, but the seductiveness of the relational answer to the question of "How do we organize this?" over time came to preclude thoughts of any other form of organization.

It's not that there haven't been other successful data models and stores, but the other successful models have mostly been transparent -- you won't see the data model working because it is invisible to you. Microsoft Office applications have terrific data stores, but they don't place data in a relational model because there is simply no need to -- for Office apps, speed and flexibility trump the structure and order of the relational model. Wikipedia lists dozens of Windows file types, and across different operating systems and application types there are probably hundreds of data store types in general usage.

NoSQL approaches are so interesting and promising because they represent a break from the functional fixation that the relational model is the only conceivable model for enterprise data. In an earlier posting, I wrote:

Relational data models have ruled enterprise data management for more than 20 years -- to the point where it may be hard for generations of developers to imagine that there could be any kind of data model other than rows and columns.

This is the world we've lived in -- terrific for tabulations and Accounts Payable, but not necessarily well aligned for other problem domains. The challenge in moving beyond Relational is that our assimilation of "Relational" hinders us in looking at problems in anything other than Relational terms. This is a manifestation of a pattern that psychologist Karl Duncker called functional fixity -- the condition in which we grasp our principal tool as a hammer, thus our whole world comes to look like nails.

Eric Haseltine writes about functional fixity, creativity and invention in his book Long Fuse, Big Bang, in which he describes some of the opportunities that arise from seeing the world with new eyes. Haseltine writes:

In a classic functional fixity experiment, test subjects are asked how they would-- without grabbing the end of the rope-- get the ends of a rope suspended from a ceiling to touch opposite walls. Aside from the rope, the only object in the room is a pair of pliers sitting on the floor. Most test subjects don't realize that the only way to solve the problem is to tie the pliers to the end of the rope, then swing it so that the pliers-acting as a weight-carry the end of the rope to one wall, then, in pendulum style, to the other. The few subjects who perceive the pliers as a weight instead of a tool, solve the problem.

The progression of Moore's Law and advancement of NoSQL data approaches, cloud deployment, and modern languages and tools can open doors to advancements that would have been inconceivable without the synchronicity of these technological advances. We have new solution archetypes:

  • Key-value storage (Memcached, Redis, Voldemort) for caches and transient data...
  • Hadoop and MapReduce for massively parallel data processing
  • Document databases (MongoDB, CouchDB) for log files and serial data
  • Graph databases (Neo4j) for the social graph and applications

...but the kinds of solutions possible with these new approaches are far broader than that. We just need to keep building the skills to see such solutions as they become possible.

“The real voyage of discovery consists not in seeking new landscapes but in having new eyes.” ~ Proust


Spreadsheets for the New Millennium

"We shape our tools, and thereafter our tools shape us." ~ Marshall McLuhan

I've done a lot of writing on "big data" and NoSQL solutions over the past couple of months, and it's time to take stock of it all: "Why should anyone care about any of this?" I've spent some time in 2011 playing with the four main varieties of new "NoSQL" data stores:

  • I started with a simple little note taking app, hosted on the cloud and backed by document-oriented store MongoDB --
  • The second application was a URL shortener, hosted on the cloud and backed by the key-value NoSQL store Redis -- based on a terrific example: An URL shortener with Ruby on Rails 3 and Redis by Christoph Petschnig You can play with the URL shortener here: Mini-URLs -- for the masses
  • The third application was a nice Hadoop / Cloudera model to perform word counts across a bunch of files. It was based on Phil Whelan's terrific posting and it showed just how easy it can be to trigger MPP with Hadoop and Cloudera
  • My final example is a "6 Degrees of Kevin Bacon" social-graph-solution generator written in the graph database Neo4j, based on Ian Dees' terrific Everyday JRuby posting
    This is a fun little app, and you can play the 6-Degrees game here: 6-Degrees of Kevin Bacon

So we have some examples up, but why does this matter? Who cares?

As it turns out we have a confluence of a number of technologies and trends that make solutions of an entirely new type possible. Here's what's new:

  • The Internet makes it possible to market to the masses, and keep detailed records of their response-to-stimuli
  • The Cloud makes it possible to spin up supercomputer-level resources with no capital expenses
  • The "NoSQL" family of data stores has arisen to deal with non-relational data challenges: Hadoop for MPP, Memcached, Redis and Voldemort for transient data, SETI and CouchDB for log data, etc.
  • Visualization tools like Tableau make it possible to create stories and narratives that result from the other tools here

So why does this matter, and how might things turn out for the pioneers of something new like this? For the answer let's look back to a previous example, in which an MBA-student decided to try programming a "toy" personal computer to manage data for the business cases he encountered at school. In 1979 the MBA (Dan Bricklin) asked a buddy (Bob Frankston) to help him code up a solution on a primitive, toy Apple II. How that that turn out for them? As Steven Levy wrote, way back in 1984:

It is not far-fetched to imagine that the introduction of the electronic spreadsheet will have an effect like that brought about by the development during the Renaissance of double-entry bookkeeping. Like the new spreadsheet, the double-entry ledger, with its separation of debits and credits, gave merchants a more accurate picture of their businesses and let them see – there, on the page – how they might grow by pruning here, investing there. The electronic spreadsheet is to double entry what an oil painting is to a sketch. And just as double-entry changed not only individual businesses but business, so has the electronic spreadsheet.

And so it happened -- a novel application (the spreadsheet) on a toy machine gave business managers a new way to track and value their businesses. The "and value" is key here -- this was an application of the technology that Bricklin and Frankston never really considered, and advances to support valuation (1MB memory space and macros, in Mitch Kapor's Lotus 1-2-3) enabled the Mergers and Acquisition boom that has continued to this day, and this application (in vehicles like hedge funds) created more than a dozen billionaires just in 2007 alone.

So this is why NoSQL, big data, the cloud and visualization are so fascinating to watch. They enable solutions to problems that would have been inconceivable even a decade ago. The cloud era doesn't have its "spreadsheet," yet, but it will...

Page 1 ... 3 4 5 6 7 ... 8 Next 5 Entries »