« Understanding Social Media "Insanity" | Main | You Only Live Twice (Basho and Riak) »

Consumerizing Big Data

Perfection is achieved, not when there is nothing more to add, but when there is nothing left to take away.
~ Antoine de Saint Exupéry

These are great days for Big Data -- Oracle's now in the game with an appliance and a new database, Microsoft has all kinds of new initiatives post-Dryad, and Amazon is going big data and Enterprise with DynamoDB.

Where are we going with this? The new initiatives may validate the space but they belie the notion that "more is better." More is better, but only until the field gets swept by less. 37Signals suggests that you Underdo your competition, and the late Steve Jobs raised simplicity to a high art. I suggest that Big Data will reach gestalt when we agree, not on more, but on less.

To appreciate the power of less, lets go back to one of my favorite Big Data solutions -- the one based on the terrific Phil Whelan article: Map Reduce with Ruby Using Hadoop. We got a nice solution working last year, and I posted about it then. In that posting, I noted that Cloudera scripts make Hadoop accessible for the masses, but was that all there is to it?

As with late-night-TV, I have to offer: "But Wait! There's More..." Indeed there is, and better yet there's Less. To show where we're headed let's take another look at that Hadoop solution.

The Hadoop app we wrote last year was based on an earlier version of Cloudera's Hadoop release -- CDH version 0.1.0+23. That version was a lot of Cloudera ago, so we'll explore Hadoop with the latest version, CDH version 3 Update 3. CDH3 U3 integrates Hadoop 0.20.2 with a lot of goodies that we'll see later, including

  • Mahout 0.5+9.3 -- we'll see this later as part of our Recommendation Engine
  • Hive-0.7.1+42.36 and Pig 0.8.1+28.26 for programming
  • Whirr 0.5.0+4.8 -- we'll use here for cloud integration, and
  • Zookeeper 3.3.4+19 -- to coordinate the processes we spawn

Download and installation are much as we performed last year, and we'll start with a similar word-count application that we ran last year. But first -- let's define our data input sources and output directory, and kick off our Hadoop run:

Now we've got input $IN and output $OUT sources set, and after a bunch of output to STDOUT we pull things together with:

...and we can go to $OUT to see the results:

So fine so far -- we've got the same 13 aardvarks and aardwolves we had last year, from the same Macintosh dictionary file we looked at last year. One dictionary is nice, but by setting the input and output directories as we have we can run Hadoop on much more than just one file. Since we routinely run on Ubuntu Linux, let's take its dictionary file was well and add it to the mix. Here I've got a copy of the Ubuntu dictionary, entitled "unix_words." Let's copy it on in, and have another run.

First we'll add in unix_words and kick off the Hadoop run:

It runs much as before, and here are our results:

Bingo! Our varks and wolves are now supplanted by "a'" at the top of our list, but there are 21 of them now. We could add more data, hundreds more or thousands more input files and it's a one-line command to perform the analysis. But that's not all we can do. As we did last year, we have simple map and reduce files -- let's try adjusting the map file to sort by the first THREE letters this time.

It's a simple 2-line change to make our map function grab 3-letter combinations. Here's our new map.rb function.

We can save it, and as we've defined a run_hadoop function and set $IN and $OUT, we can trigger our ./run_hadoop and see the new results.

Simple start -- we'll clear out our previous $OUT results, and with the new map.rb file we'll kick off another Hadoop run. Here we made a simple change (2 letters to 3) but there's no reason we couldn't get more creative with our simple map and reduce functions. Let's see what we get:

So there we are. Our analysis is not exactly Turing-award rich, but we've got a couple of things here that might really change the game for Big Data analysis. Specifically, we've got

  • A standard input target directory (could be "file system," but this is a start)
  • A standard output target
  • A flexible, readable map function
  • Standard location and processing for output

We have the core components of a big data application emerging. Rather than "one-offing" Big Data analysis, we can standardize the basic approach by

  • Enriching the mappers and reducers
  • Expanding our input processing, and
  • Feed our outputs to visualization tools like Jaspersoft or Tableau

If we put the platform on a standard (HBase) data store and tie in search engine and matrix processing we start to approach the long-sought spreadsheet for the new millennium. We're still just getting started, but the future is this way...