Journal - Insight from Visual Mathematics

Sunday

Feb092014

Life Beyond Hadoop

Sunday, February 9, 2014 at 3:22PM

But I'm here to tell you … There's something else … the afterworld.

~ Prince

"Yet, for all of the SQL-like familiarity, they ignore one fundamental reality – MapReduce (and thereby Hadoop) is purpose-built for organized data processing (jobs). It is baked from the core for workflows, not ad hoc exploration."

Why the Days are Numbered for MapReduce as we Know It

I first started with Big Data back in 2008, when CouchDB introduced itself as New! and Different! and offered live links to the Ruby on Rails development that we were doing. It worked, but it couldn't easily explain why it was better than the MySQL and PostgreSQL that we were using. By 2010 I hired (and got brilliant work from) Cindy Simpson and build a web system backed by MongoDB, and it was great for reasons we could understand, namely:

It could handle loosely structured and schema'd data
Mongoid and MongoMapper give it a nice link to Ruby on Rails
It was straightforward step beyond its more SQL-y cousins
Binary JSON (and the end of XML as we knew it)

I wrote about it here (NoSQL on the Cloud), and followed that writing with other notes on Redis, Hadoop, Riak, Cassandra, and Neo4J before heading off for the Big Data wilds of Accenture.

Accenture covered every kind of data analysis, but everyone's love there was All Hadoop, All the Time and Hadoop projects produced a lot of great results for Accenture teams and customers. Still, it's been 5 years since I started in Big Data, and it's more than time to take a look to see what else the Big Data world offers BEYOND Hadoop. Let's start with what Hadoop does well:

What Hadoop / MapReduce does well:

ETL. This is Hadoop's Greatest hit: MapReduce makes it very easy to program data transformations, and Hadoop is perfect for turning the mishmash you got from the web into nice analytic rows-and-columns
MapReduce runs in massively parallel mode right "out of the box." More hardware = faster, without a lot of extra programming.
MapReduce through Hadoop is open source and freely (as in beer) licensed; DW tools have recently run as much as $30K / terabyte in licensing fees
Hadoop has become the golden child, the be-all and end-all of modern advanced analytics (even where it really doesn't fit the problem domain)

These are all great, but even Mighty Hadoop falls short of The Computer Wore Tennis Shoes, 'Open the pod bay door, HAL', and Watson. It turns out that there are A LOT of places where Hadoop really doesn't fit the problem domain. The first problems are tactical issues — things that Hadoop might do well, but it just doesn't:

Tactical Hadoop Issues:

Hadoop is run by a master node (namenode), providing a single point of failure.
Hadoop lacks acceleration features, such as indexing
Hadoop provides neither data integrity nor data provenance, making it practically impossible to prove that results aren't wooden nickels
HDFS stores three copies of all data (basically by brute force) — DBMS and advanced file systems are more sophisticated and should be more efficient
Hive (which provides SQL joins, and rudimentary searches and analysis on Hadoop) is slow.

Strategic Hadoop Issues:

Then there are the strategic issues — places where map and reduce just aren't the right fit to the solution domain. Hadoop may be Turing-complete, but that doesn't mean it's a great match to the whole solution domain, but as the Big Data Golden Child, Hadoop has been applied to everything! The realm of data and data analysis is (unbeknownst to many) so much larger than just MapReduce! These different solution domains were once though limited and as the first paper on them revealed 7 of them, so they were referred to as the "7 Dwarfs."

More have been revealed since that first paper, and Dwarf Mine offers a more general look at the kinds of solutions that make up the Data Problem Domain:

These dwarves cover the Wide Wide World of Data, and MapReduce (and thus Hadoop) are merely one dwarf among many. "Big Data" can be so much bigger than we've seen, so

Let's see what paths to progress we might make from here…

If you have just some data: Megabytes to Gigabytes

LinAlg — the old Fortran linear algebra routines are one of my favorite solutions for solutions that look like Big Data but are really Fast Data. The article referenced above: SVD Recommendation System in Ruby is a great example of one of the greatest hits of the Big Data era — Recommendation Engines — explained and modeled with LinAlg and a little Ruby program.

Julia — Julia is a new entry in the systems-language armory for solving about anything with data that may scale to big. Julia has a sweet, simple syntax, and as the following table shows it is already blisteringly fast:

Julia is new, but it was built from the ground up with support for parallel computing, so I expect to see more from Julia as time goes by.

If you have kind of a lot of data: up to a Terabyte

Parallel Databases

IBM DB2
Teradata
Netezza
Vertica
VoltDB
SciDB

Parallel and in-memory databases start from a world (RDBMS storage and analytics) and extend it to order-of-magnitude great processing speeds with the same ACID features and SQL access that generations have already run very successfully. The leading parallel players also generally offer the following advantages over Hadoop:

Flexibility: MapReduce provides a lot more generality in what can be performed by the programmer and and almost limitless freedom, as long as you stay in the map/reduce processing model and are willing to give up intermediate results and state. Modern database systems generally support user-defined functions and stored procedures that trade freedom for a more conventional programming model.
Schema support: Parallel databases require data to fit into the relational data model, whereas MapReduce systems allow users to free format the data. The added work is a drawback, but since the principal patterns we're seeking are analytics and reporting, that "free format" generally doesn't last long in the real world.
Indexing: Indexes are so fast and valuable that it's hard to imagine a world without indexing. Moving text searches from SQL to SOLR or Sphinx is the nearest comparison I can make in the web programming world — once you've tried it you'll never go back. This feature is however lacking in the MapReduce paradigm.
Programming Language: SQL is not exactly Smalltalk as a high-level language, but almost every imaginable problem has already been solved and a Google search can take even novices to some pretty decent sample code.
Data Distribution: In modern DBMS systems, query optimizers can work wonders in speeding access times. In most MapReduce systems, data distribution and optimization are still often manual programming tasks

I've highlighted SciDB/Paradigm4 and VoltDB in the set above, not (only) because they are both the brainchild of Michael Stonebraker, but because both he and they have some of the best writing on the not-Hadoop side of the big data (re)volution.

Specific Solutions: Real-time data analysis

Spark
- Spark is designed to make data analytics fast to write, and fast to run. Unlike many MapReduce systems, Spark allows in-memory querying of data, and consequently Spark out-performs Hadoop on many iterative algorithms.
  - Spark Advantages:
    - Speed
    - Ease of Use
    - Generality (with SQL, streaming, and complex analytics)
    - Integrated with Hadoop (see my notes from Thomas Kuhn's Structure of Scientific Revolutions here and most-importantly here.

Storm
- Storm is a distributed realtime computation system. Similar to how Hadoop provides a set of general primitives for doing batch processing, Storm provides a set of general primitives for doing realtime computation.

MPI: Message Passing Interface
- MPI is a standardized and portable message-passing system designed to function on a wide variety of parallel computers. The MPI standard defines the syntax and semantics of a core of library routines useful to a wide range of users writing portable message-passing programs in Fortran or the C programming language.

Specific Solution Types: Graph Navigation:

I've written about Graph databases before (Graph Databases and Star Wars), and they are the most DSL-like approach to many kinds of social networked problems, such as The Six Degrees of Kevin Bacon. The leaders in the field (as of this writing) are:

Specific Solution Types: Analysis:

Hadoop is great at crunching data yet inefficient for analyzing data because each time you add, change or manipulate data you must stream over the entire dataset

Dremel
- Dremel is a scalable, interactive ad-hoc query system for analysis of read-only nested data.

Percolator
- Percolator is a system for incrementally processing updates to a large data sets. By replacing a batch-based indexing system with an indexing system based on incremental processing using Percolator, you significantly speed up the process and reduce the time to analyze data.
- Percolator’s architecture provides horizontal scalability and resilience. Percolator allows reducing the latency (time between page crawling and availability in the index) by a factor of 100. It allows simplifying the algorithm. The big advantage of Percolator is that the indexing time is now proportional to the size of the page to index and no more to the whole existing index size.

If you really do have lots of data: Terabytes — but want something other than Hadoop

HPCC
- HPCC enables parallel-processing workflows through Enterprise Control Language (ECL), a declarative (like SQL and Pig), data-centric language.
- Like Hadoop, HPCC has a rich ecosystem of technologies. HPCC has two “systems” for processing and serving data: the Thor Data Refinery Cluster, and the Roxy Rapid Data Delivery Cluster. Thor is a data processor, like Hadoop. Roxie is similar to a data warehouse (like HBase) and supports transactions. HPCC uses a distributed file system.

MapReduce-y solutions for Big Data without Hadoop:
- The Hadoop ecosystem (beautifully described in The Hadoop Ecosystem Table) has spawned a multitude of related projects that address one or more of Hadoop's shortcomings in a MapReduce-y way. This list is dynamic, as both it and the Hadoop ecosystem are still evolving rapidly. I've listed some of the leading contenders below, and you may see more detail on them in these pages in the future. Here are some of the most notable candidates:
  - Disco
  - Filemap
  - GPMR
  - HTTPMR
  - Mapredus
  - Misco
  - Octopy
  - Phoenix
  - Qizmt
  - Skynet
  - Sphere
  - Riak
  - Starfish
  - R3
  - Stratosphere

Data is a rich world, and even this timestamped note will likely be outdated even by the time it's published. The most exciting part of the "Big Data" world is that "Big" is increasingly an oxymoron — ALL data is "big", and ever more powerful tools are appearing for all scales of data. Hadoop is a great tool, but in some aspects it has "2005" written all over it. Review the field, choose the tools for your needs, and…

"…go crazy — put it to a higher floor…"

John Repko | Comments Off |

Saturday

Jan112014

Predictions for 2014 - Part #1

Saturday, January 11, 2014 at 11:42AM

"The most exciting phrase to hear in science... is not 'Eureka!' (I found it!) but 'That's funny.'" ~ Isaac Asimov

"One ring to rule them all, one ring to find them, one ring to bring them all and in the darkness bind them." ~ JRR Tolkien

1. The Big Easy — Big Data gets smaller part #1

I started with big data solutions back in 2008, not (unlike Twitter or Facebook) because I needed a solution to escape the CAP limitations of SQL solutions, but in search of new value from data that we’d otherwise have discarded. CouchDB came first as an experiment in moving off MySQL for Rails apps. MongoDB came next, and persisted because of the following features:

Easy structure and protocols for SQL-trained DBAs to adopt
Mongoid and MongoMapper data modeling gems for Rails
JSON syntax and conventions

This got things started, as time went on the Hadoop environment has produced richer and richer toolsets for bigger and bigger data. These are great for global web-scale companies, but might miss the point for the rest of us. I made the point a couple of years ago that, for the rest of us, the key thing was NOT that big data was BIG — it was that big data was FAST. Now in 2014 we’re ready to take the next step forward: in 2014 everything is FAST, and so big data now needs to be EASY.

I’ve also written before that “there are only 2 kinds of problem that big data solves”:: "Hindsight" (where something has happened and you want to know what in your pile of data might have predicted it) and "Foresight" (where you have a pile of data and want to know what it leads to). Foresight solutions probably outnumber Hindsight 10:1, and being 2014 everybody should be familiar with recommendation engines. So here’s what we can expect in 2014:

Seeking insight from all forms (web, orders, social, searches) of social data has moved from Innovative to to Best Practice
All that data is unstructured and needs structure. Hadoop-as-ETL rules the day — leading to…
Return of the Jedi — SQL databases and reporting tools were always good but couldn’t handle unstructured data. Hadoop is magical with unstructured data, but doesn’t easily provide real-time results, reporting, and support hands-on analysis. So…

2. Business Intelligence is BACK - Big Data gets smaller part #2

The ubiquity of Y2K-spawned ERP and enterprise data systems led to a golden age for BI, but those implementations are mossbacked now, and more than 65% of the new data generated today is unstructured. Standard BI solutions, before the Hadoop boom might run $50K per terabyte in licensing fees alone, and cost and structure made them a tough fit for vast, sloppy piles of interaction data. MapReduce to the rescue, with the operation part of the term is Reduce — as petabytes of document-data crystalize like diamonds into gigabytes of nice, reduced, rows and columns in conventional data stores. Manage your data right, and the "Spreadsheets for the New Millennium" that I’ve written about previously here, here and here become exactly that:

Spreadsheets!

3. JavaScript everywhere - MEAN and Meteor

Much of my earlier big data work started with Ruby on Rails — Rails was a great DSL for the web, and it provided a couple of wonders that are still wonders today, and were absolutely magical in 2005:

A web domain-specific-language. Write for the Web in the language (thinly-veneered http/html) of the web
Active Record — object-relational management for the rest of us — escaping the nasty .Net antipattern of starting with brittle stored procedures first and subsequently coding your way out into the user domain
Full-stacked-ness — with Active Record and similar patterns it only took one language (Ruby) to create your entire application — front end, back end, databases and presentation layer and all. Seemingly gone was the need for distinct (and non-cross-communicating) teams of DB-developers, middleware developers, and front-end developers.

I still love Rails, but by Version 4 Rails has left its simple past. Gone are the days when DHH (and everyone like him) could produce a good demo of a blog application, crafted from scratch in the course of a 15-minute video. As Rails has gotten bigger and richer, its universality has declined because it got cumbersome and subsequent programming models started with assumptions Rails pioneered.

I still love full-stack development and tools, and much as I like Rails as a replacement for sweet-but-overgrown J2EE, it’s now time to see what might advance us beyond sweet-but-overgrown Rails. Such an überstack might feature:

Full-stack: one programming language and model, top-to-bottom
Fast: I’ve always loved Smalltalk (Alan Kay’s gift to programming languages), but the cobra still bites at anything > 100 ms.
Universal: Separate teams with separate development languages is SO 1994! Even worse, desktops are so 1981 and modern code needs to expect to run on everything from handsets to big-screen displays.

Can any language and platform meet all these requirements? Fortunately, there is a solution!

Here’s what we need — power and consistency at the database, application, and presentation layers, with a common language and syntax across all layers. In Rails we covered the layers with Ruby from Active Record up to ERB, and if we’re going to get better and faster, we’ve got to get MEAN:

MongoDB (the database)
Express (presentation framework)
AngularJS (presentation language extensions)
Node.js (the web server)

Meteor and Ember.js are great emerging framework as well, and they all lead to full stack development with JavaScript everywhere. As I’ve written before, Node.js is wicked fast with modern JS engines. There are other nice frameworks rising in the JS world — Meteor.js, Derby.js and others — and JavaScript is the world’s most popular programming language, and has moved as near to ubiquity as a programming language can. For a great introduction to JavaScript and what makes it good, and a nice intro to MapReduce you might look here: Can Your Programming Language Do This?. To paraphrase William Gibson, the future is already here, and it's about to become more evenly distributed...

We’ll need these tools and popularity for the yin and the yang of the modern web age: The Internet of Things and BubblePop, which I’ll cover next time.

John Repko | Comments Off |

Friday

Nov292013

Wicked - Cold!

Friday, November 29, 2013 at 7:52PM

I love computer science for the beauty of it, but often that beauty is obscured by the Moore's Law-rate of progress in the field. Things get so much better so quickly that it's easy to lose track of just how far we've come. It's also possible that even the most profound advances in processing speed get swallowed up by still-faster-bloating software.

The Moore's Law I mentioned above states roughly that computing speed doubles about every two years. So the computers we buy this Christmas should run about twice as fast as computers we got back in 2011. Sounds great, and while we still might be stuck waiting for computers of any generation to boot, worries about things like "speed of macro calculation" are lost to the distant past. Big Data is the computing frontier, because advanced analytics is the only area where calculations aren't instantaneous - yet.

At the other end of the spectrum, computer graphics is one of the best areas to see just how far computer power (and the human ingenuity that it advances) have come. My wife Kate and I went out to see the Disney film Frozen over Thanksgiving, and it was a great example of just how far we've come. The animation was breathtaking and beautiful, and it was even more beautiful because it was INVISIBLE.

Frozen is a fun movie and I hope you go see it, but forget the computer animation for a minute: Great songs and two strong female characters, with the remarkable Idina Menzel as the "bad witch" and Kristen Bell playing the good witch this time. The songs are straight from (or possibly heading STRAIGHT TO) Broadway -- Frozen really not a cartoon animated by computers, but a Broadway Show animated by computers!

Forget "Frozen" -- they might have called it "Wicked - Cold!"

But back to our main point. The wonder of the computer animation of Frozen is NOT that it's wonderful (it is, but we've grown to expect that from that-which-once-was-Pixar). The remarkable thing is that the animation is invisible -- John Lasseter is back with his storytelling genius, and the show unfolds beautifully before our eyes.

John Lasseter is one of the wonders of our time, and it's a joy to see the computers let him tell his story. Such has it ever been: even 25 years ago (in 1988) Lasseter was already at work with a set of Macintosh IIs at his command. Each machine in his arsenel now is roughly 6,000 times more powerful than each box was then, but even then he was the sorcerer and the machines did his bidding.

To appreciate the storyteller and how much richer his current computer canvases are, just watch this wonderful time-piece -- his video "Pencil Test" from way back in 1988!

Mmmmmmmm... Computer animation sure has come a long way in the past 25 years. Craftsmen are still craftmen and whizzy graphics are nothing without a great story and a compelling storyteller. The great wonder of a John Lasseter or a Steve Jobs is that they could envision stories like "Frozen" -- back in 1988 and probably even earlier...

...and that's something we can all enjoy and be thankful for this Thanksgiving.

...Happy Thanksgiving, everybody...!

John Repko | Comments Off |

Wednesday

Oct162013

Pearls After Breakfast ~ Systems Orchestration with Ansible

Wednesday, October 16, 2013 at 9:19AM

Perfection is achieved, not when there is nothing more to add, but when there is nothing left to take away. ~ Antoine de Saint Exupéry

I've long been a fan and I've always been amazed at the skill of violinist Joshua Bell. I've seen him in concert a half-dozen times in the past 5 years, and in each performance he has succeeded in producing what I can only call "Joshua Bell Moments" - instances in time, perfect in themselves in which you can just watch, listen and marvel. Bell is a dynamic performer onstage, and it's remarkable to see musical precision generated so dynamically. If you're not familiar with Joshua Bell or his music, you should start with the Pulitzer Prize-winning introduction here: Pearls Before Breakfast

In Pearls Before Breakfast, Bell describes his skills as 'largely interpretive', and his ability to mesh with a symphony orchestra is a wonder to behold. Orchestration in any field is hard, and it's the holy grail of modern computer architectures. MapReduce might be elegant but Hadoop surely isn't - the wonder isn't that it works elegantly, but that it works at all. These days it's routine to require a dozen or dozens of systems to support modern applications, so when a system promises "Radically simple IT orchestration," I've got to take a look.

But is the problem really that bad? Can we just "do it by hand" or simply script these things? In his book, Taste Teste: Puppet - Chef - Salt - Ansible, Matt Jaynes writes

There were about 60 servers altogether and no two were exactly the same since they had all been set up by hand. Over time, the systems engineers would periodically update the servers, but they always seemed to miss some. Every server was a “beautiful snowflake” and unique from all the others.

The development environments were all slightly different too. So, an app release that worked on one developer’s environment, wouldn’t work on another developer’s environment. And an app release that worked on all the developers’ environments would have bugs in the staging environment. And an app release that worked in the staging environment would break in production. You get the idea.

If you've worked in cloud applications you certainly do get the idea. So what Ansible offers is a simple, YAML-based approach to managing the raft of servers that run most modern web applications. Does it work? Is it Radically Simple? Let's try it out and see what we find.

We'll be doing a lot of cloud-work on the new platform that I'll call "Cloudburst", so first we'll update the Linux base machine I'm running on. I'm on Ubuntu 13.04 "Raring Ringtail" for the usual reasons: not because it's necessarily a "better" distribution, but because there is practically limitless documentation on Ubuntu distributions online.

So first let's update Ringtail, and make sure we've got an up-to-date-enough Python version for Ansible:


$ sudo apt-get update
$ sudo apt-get upgrade
$ python --version
Python 2.7.4

Great so far. Ansible does it's work via SSH (not unlike the soon-to-be-written-about Github), so let's make sure our SSH keys are in place as well:


$ ssh-keygen -t rsa
$ cd ~/.ssh
$ cp id_rsa.pub authorized_keys

Ansible will look for its keys in the SSH directory file "authorized_keys", so once our copy is complete we're ready to install Ansible. Let's do that, then use ifconfig to identify the IP address of our target machine.


$ sudo apt-get install ansible
$ ifconfig
$ sudo vi /etc/ansible/hosts

    # Here's another example of host ranges, this time there are no
    # leading 0s:

    #db-[99:101]-node.example.com

    127.0.0.1

$ sudo apt-get install openssh-server

All set! Generally I'll use ifconfig got get the IP address for servers in my cloud stack, but we'll just use IP address of our base Ubuntu machine here. Now let's try a quick call-out from Ansible to our target machine to confirm that our installation is complete, and that all is well:


$ ansible all -m ping
127.0.0.1 | success >> {
    "changed": false, 
    "ping": "pong"
}

$ ansible all -a "/bin/echo hello Ansible"
127.0.0.1 | success | rc=0 >>
hello Ansible

GREAT! We're almost finished. To confirm Ansible (and have it do a little work for us), I'm going to create a simple YAML file for Ansible to confirm that the web server nginx is installed and the config file is in the right place. Our YAML file will just be a start, and even as an example it's great at showing how closely the YAML description matches our task list. So what we want to do is (in psuedocode):

For all of our servers
- Make sure https support is enabled
- Make sure nginx is installed
- Make sure the nginx.conf file is installed in the right place, and
- Make sure nginx is running

So how much scripting and syntactic sugar do we need to get these tasks done? With Ansible, not much:


nginx.yml:

---
- hosts: all
  tasks:

  - name: Ensure https support for apt is installed
    apt: pkg=apt-transport-https state=present

  - name: Ensure nginx is installed
    apt: pkg=nginx-full state=present

  - name: Ensure the nginx configuration file is set
    copy: src=/app/config/nginx.conf dest=/etc/nginx/nginx.conf

  - name: Ensure nginx is running
    service: name=nginx state=started

Let's try it out, and see how our script works:


$ ansible-playbook nginx.yml

Success! This is just the beginning - from here we can use Ansible to

Make a standard server install definition (Note: Not a script!)
Apply that definition to all the servers in our stack
Use the definition for regular stack updates

Our stack in now just another variable in our solutions, and with the setup-magic taken away, we can focus on human interaction and delivery-magic! It might be too much to aspire to the performance of a Joshua Bell, but with Ansible we can set up and tune the whole backing orchestra correctly every time...

John Repko |

Cloudburst - What "Client Server" Grew Up Into

Tuesday, October 15, 2013 at 9:30AM

Never trust a computer you can't throw out a window. ~ Steve Wozniak

Computing is not about computers any more. It is about living. ~ Nicholas Negroponte

At this point in the countdown, it seems appropriate to say to the crew of Discovery -- good luck, have a safe flight and to say once again 'Godspeed John Glenn' ~ Scott Carpenter

I grew up in the era of The Computer Wore Tennis Shoes, and I think it's in spite of this mindset that information systems have come as far as they have. Computers won't replace human thought, and that leaves us with the yin and yang of computers as tattletales and spies VERSUS computers as creation and communication tools.

I'm going to side with the creative Miller Puckette's of this world and write a bit today about what the computer - minus tennis shoes, minus client-server, minus "the Internet changes everything!" has evolved into. Back in 2008 I had the privilege of designing an application (the revolutionary Sales Sonar by Innovadex) that employed many of the latest developments in the web delivery stack: Amazon AWS, Ruby on Rails v2.x, nginx, thin, mysql, JQuery, god, haproxy, Google Maps and more. Today I'm going to list out the soul of a 2013 new machine, secure in the knowledge that this genie too shall be surpassed, but equally sure that there will be fun to be had in building a platform of the latest pieces.

Here's where we begin -- I've written about many of the pieces here before, and in 2013 AFAIK the killer environment has/does the following:

Web deployed - Everything connects through the 'web
Handheld - Never trust a computer bigger than a Whopper
Cloud based - Nothing like the ability to spin up dozens of servers when we need them
Big Data'd - Lots of the data will unstructured, and we'll want to ETL it with Hadoop
Data'd - Our mySQL branch MariaDB will handle routine data chores
Git'd - Git is de rigeur now, more and in different ways than Subversion was then
Scrumban'd - JIRA is very popular, but for friendliness and flexibility I'm going to give the nod to VersionOne
Ruby'd - this is still my favorite language, and with Ruby 2 and Rails 4 new vistas (like embedded applications are possible
Handheld - here we'll use two othogonal tools: RubyMotion and Phonegap
Automatically deployed - Chef and Puppet are worthy candidates, but we're going to go lighter and faster with Ansible

I'll have lots more tools as people weigh in on their favorites (and slam the unworthy), but this gives us a start so let's kick the tire, light the fire and see how this new platform flies!

Godspeed Scott Carpenter - May 1, 1925 - October 10, 2013

John Repko | Comments Off |