One of the great things about working in technology is that it's marked by seasons, and watching the seasons you learn that you can plan what's ahead. Much as a robin is a harbinger of Spring, a new McKinsey Technology report signals the arrival of a new technology for McKinsey to ponder. Now, I like McKinsey -- they are fine strategists, and I have a bunch of friends from Stanford GSB who landed there. Still, as technologists they might do well to stretch out their fingers and login a bit deeper sometimes. At the beginning of 2009 their Clearing the air on cloud computing declared that 'Cloud computing' is approaching the top of the Gartner Hype-cycle. That was two full years ago, and the clouds haven't exactly burned off "cloud computing" since.
Well, like the baddies in Poltergeist II, "They're baaaack…" This time McKinsey weighs in on Big Data, and in classic McKinsey fashion they deliver terrific facts without providing any insight on why all this is happening around them. McKinsey's latest, Big data, the next frontier for innovation, competition and productivity, starts with the common Big Data red herring: "...the volume of data is growing at an exponential rate..." which is indisputable but totally misses the point. Data has been growing exponentially since at least the IBM 360 era -- almost 50 years now. The key point is NOT that the data is "Big." The data has always been big. The question is not Why Big Data? but Why Now?
The answer is not "Now the data is big" -- the answer is "Now the data is fast!" Google didn't become Google because their data was big -- Google went to MapReduce so they could keep growing the number of sites-crawled while still returning results in < 200 milliseconds, and now they're going to Google Instant because even 200 milliseconds isn't fast enough anymore. Consider all the action we're seeing today in NoSQL data stores -- the point is NOT that they are big -- the point is that apps need to quickly serve data that is globally partitioned and remarkably de-normalized. Even the best web-era app isn't successful if it isn't fast.
So for now let's forget about McKinsey. If you are looking for opportunity, the question to ask is NOT "Where is there Big Data?" the question to ask is "Where can fast data really make a difference?"
…Even the best web-era app isn't successful if it isn't fast… This is the thinking that brought all the NoSQL data stores to social networking software. The new applications like Twitter and Facebook are huge and distributed but still have to be fast. To the billionaires who founded them, throwing out the conventions of the Relational model was a small price to pay for the success and scale that speed brought.
The core idea behind HBase and Cassandra as NoSQL leaders is that they may be schemaless (which is nice for web data) but they are not unstructured! What makes the column-oriented databases so magical is that they avoid the "6-JOIN" database push-up problem that Dare Obasanjo wrote up in When Not to Normalize your SQL Database. To get speed we're willing to make compromises with some of the core components of heretofore-modern data processing. To get speed we change some of the rules of the game.
Here are the new rules for software delivery in the Web era
- You have 100 milliseconds to respond to a user action in a web application. This is where we ended up in my last post: Much over 100ms == FAIL.
- 100ms is one tough target, because
- Accessing a web server in Palo Alto from a site in NY costs 50-80 ms just in latency (unless you can increase the speed of light)
- Every router-hop = about 3ms
- ESB response times = 10s of ms (maybe 100s)
- XML marshalling / unmarshalling = 10s of ms (maybe 100s)
(this is why JSON is replaces XML in web apps)
- DB access = ~1 (good) to 10 (cheap) ms -- this is why 6-JOINS FAIL
So if we believe that the faster cobra always wins, here are the rules that fall out from this:
Rules for App Delivery in the Web Age
- You need to cache data near users -- round-the-world transmission = FAIL right off the bat (too far, too much latency and too many hops to be fast)
- ESBs for enterprise apps may be fine, but probably fail for web apps
- XML went away in web-space because it had to (JBoss' Marc Fleury once wrote a great article on this)
- One DB access is not fatal, but 6-10 surely are -- thus for the biggest data we find no JOINS, no Transactions, no Stored Procedures, and ultimately NO DATABASES (see the classic eBay Architecture, and note that eBay has already moved most DB ops into App/RAM space (slides 22-23)) for web apps
- Zero lookups are better than one: Hello Memcached!
- If you have a DB, you better get all you need with that one lookup. Thus columnar databases like HBase and Cassandra -- if I lookup "Bill Smith" I get a big chunk of EVERYTHING known about Bill -- I then work on the chunk, and write it to storage as an object. RAM is cheap, busses are fast and this approach works in web-app-land.
- "Eventual" consistency is fine, as long as you have some idea of how eventual "eventual" really is
- Hadoop can prowl around in background, making sure our data stores all eventually sync up
- Conventional data models no longer work here -- the world of fast big data is all about denormalization and deduplication
In this big data-driven world the data model morphs to provide the fast data that apps require. We thus have a new kind of app / data model -- much more object-oriented than the pure-data stores that have taken us this far.
In this world we go to NoSQL for access speed, and gain all kinds of other processing possibilities in the process. The beauty of Google (and other similar Hadoop-y efforts) is that once you get used to working in a Googleplex with MapReduce as a routine operation, you discover that there are all kinds of other operations that you can do in similarly massively parallel fashion. It's likely that most of the wins we're seeing in Big Data are coming, not from intrepid data explorers, but from routine operations-people who went in looking for speed and figured out that the approach yielded other discoveries as well.
What would happen if we STARTED with a data framework with an infinite distributed data store, MapReduce built in for unstructured data analysis, and Apache SOLR as well for free-text search and structured data querying? We'd have an environment where the speed was free, and we could devote our energies to finding patterns in data. Now THAT would be magical…
Indy: That's the Ark of the Covenant.
Elsa: Are you sure?
Indy: Pretty sure.