Sunday
Feb052012

Consumerizing Big Data

Perfection is achieved, not when there is nothing more to add, but when there is nothing left to take away.
~ Antoine de Saint Exupéry

These are great days for Big Data -- Oracle's now in the game with an appliance and a new database, Microsoft has all kinds of new initiatives post-Dryad, and Amazon is going big data and Enterprise with DynamoDB.

Where are we going with this? The new initiatives may validate the space but they belie the notion that "more is better." More is better, but only until the field gets swept by less. 37Signals suggests that you Underdo your competition, and the late Steve Jobs raised simplicity to a high art. I suggest that Big Data will reach gestalt when we agree, not on more, but on less.

To appreciate the power of less, lets go back to one of my favorite Big Data solutions -- the one based on the terrific Phil Whelan article: Map Reduce with Ruby Using Hadoop. We got a nice solution working last year, and I posted about it then. In that posting, I noted that Cloudera scripts make Hadoop accessible for the masses, but was that all there is to it?

As with late-night-TV, I have to offer: "But Wait! There's More..." Indeed there is, and better yet there's Less. To show where we're headed let's take another look at that Hadoop solution.

The Hadoop app we wrote last year was based on an earlier version of Cloudera's Hadoop release -- CDH version 0.1.0+23. That version was a lot of Cloudera ago, so we'll explore Hadoop with the latest version, CDH version 3 Update 3. CDH3 U3 integrates Hadoop 0.20.2 with a lot of goodies that we'll see later, including

  • Mahout 0.5+9.3 -- we'll see this later as part of our Recommendation Engine
  • Hive-0.7.1+42.36 and Pig 0.8.1+28.26 for programming
  • Whirr 0.5.0+4.8 -- we'll use here for cloud integration, and
  • Zookeeper 3.3.4+19 -- to coordinate the processes we spawn

Download and installation are much as we performed last year, and we'll start with a similar word-count application that we ran last year. But first -- let's define our data input sources and output directory, and kick off our Hadoop run:

Now we've got input $IN and output $OUT sources set, and after a bunch of output to STDOUT we pull things together with:

...and we can go to $OUT to see the results:

So fine so far -- we've got the same 13 aardvarks and aardwolves we had last year, from the same Macintosh dictionary file we looked at last year. One dictionary is nice, but by setting the input and output directories as we have we can run Hadoop on much more than just one file. Since we routinely run on Ubuntu Linux, let's take its dictionary file was well and add it to the mix. Here I've got a copy of the Ubuntu dictionary, entitled "unix_words." Let's copy it on in, and have another run.

First we'll add in unix_words and kick off the Hadoop run:

It runs much as before, and here are our results:

Bingo! Our varks and wolves are now supplanted by "a'" at the top of our list, but there are 21 of them now. We could add more data, hundreds more or thousands more input files and it's a one-line command to perform the analysis. But that's not all we can do. As we did last year, we have simple map and reduce files -- let's try adjusting the map file to sort by the first THREE letters this time.

It's a simple 2-line change to make our map function grab 3-letter combinations. Here's our new map.rb function.

We can save it, and as we've defined a run_hadoop function and set $IN and $OUT, we can trigger our ./run_hadoop and see the new results.

Simple start -- we'll clear out our previous $OUT results, and with the new map.rb file we'll kick off another Hadoop run. Here we made a simple change (2 letters to 3) but there's no reason we couldn't get more creative with our simple map and reduce functions. Let's see what we get:

So there we are. Our analysis is not exactly Turing-award rich, but we've got a couple of things here that might really change the game for Big Data analysis. Specifically, we've got

  • A standard input target directory (could be "file system," but this is a start)
  • A standard output target
  • A flexible, readable map function
  • Standard location and processing for output

We have the core components of a big data application emerging. Rather than "one-offing" Big Data analysis, we can standardize the basic approach by

  • Enriching the mappers and reducers
  • Expanding our input processing, and
  • Feed our outputs to visualization tools like Jaspersoft or Tableau

If we put the platform on a standard (HBase) data store and tie in search engine and matrix processing we start to approach the long-sought spreadsheet for the new millennium. We're still just getting started, but the future is this way...

Sunday
Jan152012

You Only Live Twice (Basho and Riak)

You only live twice
Once when you are born, and once
When you look death in the face
Ian Fleming ~ "You Only Live Twice"

It's not about the bike. It's a metaphor for life...
Lance Armstrong ~ "It's Not About the Bike"

Today was a big day for me. Way back on June 6, 2008 I was in a terrible car-bike accident. It was so bad that the first word that got sent to a traffic copter overhead was that I'd been killed. I hadn't, but it was a couple of months of hospitalization and six months of hard rehab before I was back to anything like my life before the accident again. I got great support from my wife Barbara and son Bryan, and with great care and therapy I even got back on the bike again.

January 1, 2009 was my first post-accident bike ride - 1.4 miles around Clement Park lake here in Littleton, Colorado. As little as that was, I kept at it and today, 3 years later, I completed my 10,000th mile since the accident. It's true that you "only live twice," and the greatest gift in life is to come back from that edge.

The quotation above is a haiku coined by James Bond in the book "You Only Live Twice," which Bond himself declares "...after Basho..." -- referring to Matsuo Basho, the great Japanese poet (1644-1694). Basho was the master of the haiku, and a nice sampling of his work can be found here: A Selection of Matsuo Basho's Haiku.

Basho may be revered as a poet-laureate of Japan (something like Robert Frost is considered here) but it's a shame that there's so little awareness of his work. Our world is full of fine, obscure art, and the joy of an internet-enabled world is that it's not so hard to find it anymore.

Basho's name (if not his verse) lives on in the NoSQL datastore company Basho, and through their key-value store database Riak. I spent the weekend getting Riak rolling in the cloud -- it's not hard to set up, and it's scalable, flexible and fast as a key-value store. Here's a quick peek at how I got there:

Riak was designed for robustness, speed and scalability, and to get started with Riak you'll need to install the programming language Erlang first. Riak was built with Erlang, and Erlang is a terrific jackrabbit of a language that even on its own is absolutely worth a look. I was running 10.04 LTS (Lucid Lynx) on AWS, and in that world the Erlang install only took 4 steps:

curl -O http://erlang.org/download/otp_src_R14B03.tar.gz
tar zxvf otp_src_R14B03.tar.gz
cd otp_src_R14B03
./configure && make && sudo make install

The latest Erlang (R15B) doesn't work yet with the latest (1.02) Riak, so you'll want to make sure you're linking compatible pairs of Erlang and Riak. Once that's complete, it's also a simple set of steps to install Riak:

curl -O http://downloads.basho.com/riak/riak-1.0.2/riak-1.0.2.tar.gz
tar zxvf riak-1.0.2.tar.gz
cd riak-1.0.2
make rel

With Erland and Riak installed we're ready to get rolling. Inasmuch as I see "Big Data" as an emerging data structure and both NoSQL and Hadoop as tools forming the operating system around that data structure, I like (where I can) to stick to high-level languages and OBDM (object-big-data-mapping) tools for access to the structure. Fortunately, Sean Cribbs has just released Ripple, an Active Model-based document abstraction utility based on Active Record and MongoMapper. With Ripple added, we just need a bit of code (and a big assist to Justin Pease) to migrate our Redis-based URL shortener over to Riak. But first, let's get Riak working:

First we'll need a new Rails project to test Riak:

$rails new riaktest

Then we'll go into riaktest and add Ripple and curb to our Rails 3.x Gemfile, and do a bundle install:

gem 'ripple', :git => 'http://github.com/seancribbs/ripple.git'
gem 'curb'

Save the Gemfile, and then

$ bundle install

Next we'll add Ripple into or config/database.yml:

ripple:
  development:
    port: 8098
    host: localhost

Next we'll add a little Url class in app/models/url.rb:

require 'ripple'
class Url
  include Ripple::Document
  property :ukey, String, :presence => true
  property :url,    String
end

And finally we'll fire up Riak:

$ /var/www/apps/riak-1.0.2/rel/riak/bin/riak start

With our Development environment complete, we can now dive into Rails on the console and play with our Riak data store:

$ rails console@
Loading development environment (Rails 3.1.3)
ruby-1.9.2-p290 :001 > url = Url.new
 => <Url:[new] ukey=nil url=nil>
ruby-1.9.2-p290 :002 > url.ukey = "2432"
 => "2432" 
ruby-1.9.2-p290 :003 > url.url = "http://www.ibm.com"
 => "http://www.ibm.com" 
ruby-1.9.2-p290 :004 > url.valid?
 => true 
ruby-1.9.2-p290 :005 > url.save
 => true 
ruby-1.9.2-p290 :006 > exit

Great -- we've initialized our data store, and gone away (thus the "exit") above. Now we can come back and access our Riak store:

rails console
Loading development environment (Rails 3.1.3)
ruby-1.9.2-p290 :001 > newurl = Url.first
 => <Url:TdxQ3iFGEwkmfMrYQBmvwcZYoCM ukey="2432" url="http://www.ibm.com">
ruby-1.9.2-p290 :002 > exit

So we have Riak operational on the Amazon cloud, and it's a small matter of coding to move our Redis URL shortener over to a new back end. In my next posting I'll show how we can do that, and do a little Apache Benchmark testing to see how our little example applications benchmark out.

We'll end with a little inspiration from Lance Armstrong:

Wednesday
May252011

How do I get started? A General Solution to Discovery in Big Data

Source: http://www.flickr.com/photos/41829005@N02/6162370327/

I've used the "spreadsheet" as a metaphor for an epiphany -- in this case combining enabling technologies (cheap PC processing, high-resolution displays and cheap memory) to provide a new metaphor for problem solving. Spreadsheet visual programming is a perfect metaphor for financial analysis because the rows-and-columns of financial ledgers map crisply to rows and columns on a computer screen. The final essential piece of the "PC Data" revolution arrived when a macro language was built into Lotus 1-2-3 that hadn't been build into Visicalc. This single feature guaranteed the hegemony of 1-2-3 and spreadsheets, as the macro language made them capable of solving problems outside of the domains envisioned but the first spreadsheet's developers.

Before spreadsheets, if you had a problem you could either lay it out on paper, or have a programmer write a specific program to perform the analysis you wanted. "Exploration" and "Discovery" were limited to what you could describe to a developer to program. Life before spreadsheets was brutish and short…

Source: http://appraisalnewsonline.typepad.com/photos/uncategorized/2008/01/08/matrix_data.jpg

So here we are today, at the dawn of the Big Data era. The core toolset is emerging (MapReduce via the Hadoop family of products) and word is spreading that remarkable solutions might be found in data that we formerly thought of as "disposable." The old problem is back, though -- if you (as a manager or executive) want solutions, you better go find a programmer. There are steps being taken to bring us spreadsheets for big data -- Datameer particularly is bringing spreadsheets to Big Data. Or, more properly, bringing Big Data to spreadsheets. They may move Big Data forward, but there's an impedance mismatch here -- if Big Data naturally fit in the rows and columns of spreadsheets it would already have made the jump and be found there. If Big Data describes a world beyond rows and columns, then the spreadsheet metaphor will end up fitting Big Data like a bad suit. Sure, we'll have our familiar rows and columns, but like Mozart played on a kazoo something in the essential nature of the data will be lost.

The answer for Big Data is a spreadsheet conceptually, but with a richer representational metaphor than rows and columns. We want fundamental insights from big data, so our building blocks should match the topologies that we're studying. Here's a first take at what "rows and columns" for Big Data might look like:

  • Predictive Modeling -- stripped of scale, are there linear relationships in the data that offer explanatory or predictive value?
  • Clustering Partition -- is the data uniformly distributed or clustered, and what can we learn from the clusters?
  • N-Dimensional Visualization -- US Supreme Court Justice Potter Stewart once said that he couldn't define pornography, but "…He knew it when he saw it." Are there visual representations of Big Data that provide insight?
  • Outlier Analysis -- does the data follow a predictable distribution (normal, exponential, poisson, etc.) and if we can fit the data to control charts, and what is meant by outliers to those charts?
  • AB Analysis -- The data may be noisy, but can we use it to measure the performance of key variables against each other?
  • Markov Chains -- You know the score this far into the game, and your customers' web interactions foreshadow their interests going forward. Where are we heading, and when do we get there?

These are our rows and columns, and in my next post I'll describe the architecture I'm pursuing to explore them, an architecture built around:

  • HDFS for general data storage
  • HBase for data management
  • Hadoop for unstructured data analysis
  • Zookeeper for task management
  • SOLR for structured "free text" search
  • Thrift for access to external development languages and platforms
  • Massive_record to provide ORM-access to all that HBase data
  • JQuery for unobtrusive JavaScript and core visual presentation
  • SIMILE for advanced visual presentation
  • Tableau for advanced visual presentation
  • Node.js to serve up all that JavaScript

That's a lot to describe and it'll take some posting to do it, but the ultimate objective never changes -- to provide a sandbox that managers can play with and coax Big Data into giving up it's secrets.

Monday
May232011

Spreadsheets for the New Millennium -- Part 3

So here's what comes next:

When I write about "spreadsheets," I'm thinking about technology bringing a real innovation to market. Spreadsheets were a breakthrough in modern business because they took new technologies - low-cost PCs, high-resolution displays and comparatively large amounts of RAM - and combined them into a facile metaphor that fit a rich set of problems. Hadoop and MapReduce are terrific but they are elemental -- they provide a rich, parallel, functional-programming approach, but they remain basically metaphor-free. They are to Big Data what Quicksort is elementary computer science -- a nice step beyond Bubblesort, but in themselves just tools. The Killer App lies elsewhere.

For that reason I think Datameer and Factual are a step forward in the routinization of big data, but I don't think they've got it yet either. The metaphor is still wrong.

 Visicalc and Lotus 1-2-3 were a big step forward because they gave a hands-on way for non-IT people to grasp the rows-and-columns world of financial analysis. The impedance barrier went away because you could make financial models in a visual domain-specific language (DSL) that mirrored the world you were modeling.



The DSL has to match the world you're modeling, thus I expect that jamming big data into a spreadsheet today will be like jamming financial calculations into Wordstar would have been back then. It's a step forward (maybe a big one) but the gestalt will arrive elsewhere.

 When I wrote "big data needs a spreadsheet" in the past Spreadsheets for the New Millennium what I meant was that big data needs a metaphor and a DSL -- a way to put big data understanding into the hands of everyday users. Putting big data in a spreadsheet is a start, but these aren't rows-and-columns problem domains and stuffing them into rows and columns might provide some facility, but at a cost of richness and understanding.

 Big Data deserves its own metaphor and a DSL ... somebody's incubating it ... even as I type this ... now, where is it??? In my next post I'll lay out a few steps to the epiphany.

Thursday
May192011

The Big Easy - Spreadsheets for the New Millennium Part 2

Back in January I wrote a post that I called Spreadsheets for the New Millennium, and in that posting I suggested that Big Data would never take hold in the public consciousness until there were gateways to it -- tools that could let anybody play with it, tools that could make it easy.

I love the idea of Mardi Gras -- everybody picks a time (the Tuesday before the start-of-Lent Ash Wednesday) and a place (New Orleans' Bourbon St.) to get together to have a party. I love it because that practically never happens in technology. As much as we celebrate advances in technology, we need to celebrate them because progress (both discovery and adoption) is so hard.

Why does it seem to be so hard for communities to assemble and take up advances in science? This is one of the key questions in Thomas Kuhn's The Structure of Scientific Revolutions

What does it take for an idea to break through? Here is Kuhn's answer:

Kuhn saw that for a new candidate for a new candidate for paradigm to be accepted by a scientific community, "First, the new candidate must seem to resolve some outstanding and generally recognized problem that can be met in no other way.

Second, the new paradigm must promise to preserve a relatively large part of the concrete problem solving activity that has accrued to science through its predecessors..."

Or, as Steve Jobs might say: Think Different, but not Too Different.

For Big Data to become mainstream technology we need to satisfy two conditions:

  1. Solve a generally-recognized problem that can be met no other way. Check - Progressive Insurance can give you an estimated insurance quote on the spot, but only because they pre-calculate a rate quote for every car in the US every night -- using Hadoop. The problem and solution are simple, but without Hadoop the could never generate every car, every night...
  2. ...Preserve its predecessors. FAIL. Hadoop is a terrific tool, but unless you've had the good fortune to take MIT's 6.001 or Yale's CS 323 you've probably never seen anything like it.

The Hadoop community has desperately added Hive and Pig to try to reduce the foreignness-barrier of functional programming and get Hadoop over Kuhn's second barrier.

Brook Byers of Kleiner Perkins put my onto Kuhn back when I was at Stanford, so it's no surprise that KBCP announced a $9M round with the Hadoop-y Big Data company Datameer, announcing

Datameer’s Analytics Solution, which integrates the data mining power of Hadoop with a spreadsheet interface, enables business users to run analytics against very large data sets with no programming required. The product is designed to help users with little to no computer engineering experience handle massive amounts of data.

Kleiner and Datameer aren't alone in the race for Spreadsheets for the New Millennium, Factual is another player that's raised a lot of smart money.

These are great approaches, but in the best open-source tradition we'll bring up a solution that does a lot of the same things -- based in open-source code -- in one of my next postings.