« NoSQL Next Up: Hadoop and Cloudera | Main | A Quick Redis Key-Value Example for the Holidays »
Thursday
Jan062011

Why NoSQL Matters Today

“But what is it good for?”
(Engineer at the Advanced Computing Systems Division of IBM, commenting on the microchip, 1968)

It wasn't that long ago that "Procedural Programming" was all the rage. The Pascal and C programming languages ruled the universities, and Modula-2 was presented as the be-all and end-all of that world. Pascal and C were (and are) fine languages, but it didn't turn out like that.

The problem was that too many interesting problem domains just didn't match Pascal-and-C syntax, and trying to wedge them into that model was just too hard. "Object-oriented" programming took hold as a more logical match for many programming domains, and the web programming that followed swung to approaches that closely match web formalisms and protocols. Pascal and C were terrific, but they reached their limits when it became clear that too much of the world just didn't work that way.

So things stand today with the data management of business- and web-systems. Relational data models have ruled enterprise data management for more than 20 years -- to the point where it may be hard for generations of developers to imagine that there could be any kind of data model other than rows and columns. But the model is straining with business "big data" and Internet-era solutions -- while they still seemingly fit into the old Payables / Receivables rows-and-columns world, we might want to take stock of what we are losing in jamming that square peg into that round hole. As Dare Obasanjo wrote

What tends to happen once you’ve built a partitioned/sharded SQL database architecture is that you tend to notice that you’ve given up most of the features of an ACID relational database. You give up the advantages of the relationships by eschewing foreign keys, triggers and joins since these are prohibitively expensive to run across multiple databases. Denormalizing the data means that you give up on Atomicity, Consistency and Isolation when updating or retrieving results. And the end all you have left is that your data is Durable (i.e. it is persistently stored) which isn’t much better than you get from a dumb file system.

In the era of the Internet and "big data," a rich, powerful data store still makes sense, but a model originally designed for the 1950's accounting department may not make sense anymore. To better-fit this new generation of problems, a new family of data store approaches has risen to prominence, characterized by the explanation that they are "Not Only SQL" -- the "NoSQL" data stores.

NoSQL approaches make sense for problem domains outside the traditional relational world, and can give vastly better performance for certain families of uses:

  • Frequently-written, rarely read data (like web hit counters, or data from logging devices or space-probes) work well in key-value stored like Redis, or document-oriented databases like MongoDB
  • Frequently-read, seldom written or updated data (see Facebook statistics below) benefit from several NoSQL data approaches: Memcached for transient data caching, Cassandra or HBase for searching, and Hadoop and Hive for data analysis
  • High-availability applications which demand minimal downtime do well with clustered, redundant data stores like Riak or Cassandra
  • Data that will be sync'd across multiple locations can benefit from the replication features of a database like CouchDB
  • Transient data (like web sessions and caches) do well in transient key-value data stores like Memcached
  • Big data arising from business or web analytics that may not follow any apparent schema but which will still require rich (possibly parallel) querying will do well in the family of access tools like Hadoop

A growing number of leading websites and business applications have migrated to NoSQL solutions, driven by the needs arising from their size, scale, and the unavoidable gap between the problem domains they serve and the structure of previously-existing SQL solutions. The demand for NoSQL solutions didn't arise because of problems with the SQL language, but rather because of limitations in the relational model itself. In 2000 Eric Brewer outlined the core deficiency of the relational model in a partitioned, global data world with his CAP Theorem, which states that both Consistency and high Availability cannot be maintained when a database is Partitioned across a (fallable) wide area network. The CAP Theorem opened the door to consideration of data models where Partitioning and high Availability are the requirements, and Consistency is delayed (or "eventual") to meet Availability needs in a Partitioned world. NoSQL data store solutions, which provide partitioning and high availability while settling for "eventual" consistency have been the result.

NoSQL solutions have risen to prominence in many "social web" companies because the rigor and restrictions of a purely Relational world could never have met their data needs. We can see this from a look at the scalability challenges that a company like Facebook faces:

  • 570 billion page views per month
  • More photos than all other photo sites combined
    • More than 3 billion photos uploaded every month
    • 1.2 million photos served per second
  • 25 billion pieces of content, served by more than 30,000 servers

Facebook is clearly not a 50's-era accounting department, and its data processing needs are vastly different than anything even considered in the Relational data era. Facebook has adopted a rich set of tools to meet these data challenges:

  • Memcached. Facebook runs thousands of Memcached servers with tens of TB of cached data at any point in time
  • Cassandra (now replaced by HBase). Distributed storage with no single point of failure
  • Hadoop and Hive. Used to massive data analysis and marketing analytics

These data architectures are key to Facebook's growth and scalability and they underlie the growth of other "big data" web companies like Yahoo, Foursquare and Twitter as well. These companies may represent the frontier of big data tools, but the core technologies that underlie their growth are generally available (often open-sourced) and are finding wide experimentation and adoption in the broader business community.

Much as Procedural Programming once gave way to approaches that were a better match to new problem domains, we expect the richness and flexibility of NoSQL solutions to play a growing role in business solutions now and in the future. Whole business solution approaches, such as Predictive Analytics, are becoming widely available only because of advances in data store technology. As NoSQL solutions evolve the problems we can solve with them will become richer and more business-critical. There are already roughly a dozen major NoSQL packages under broad industry review, and as the NoSQL platform matures we are confident that NoSQL approaches will grow as an important direction in business solutions and database technology.

“First, solve the problem. Then, write the code.”
(John Johnson)

Reader Comments (1)

Great to see you back writing again John. Brilliant read!

January 12, 2011 | Unregistered Commenterjim
Comments for this entry have been disabled. Additional comments may not be added to this entry at this time.