Happy Birthday, Bobby Fischer

We opened a group meeting at work today with the classic icebreaker "Two truths and a lie." In TTL everyone in the class writes down two obscure truths about themselves along with a single lie, and hands them in to the instructor on a folded sheet of paper. The instructor then selects a sheet at random, reads the "3 truths" and the class has to guess 1) who they apply to, and 2) which one is the falsehood.

I have a great fun truth for the game that runs like this:

I once won a chess tournament, playing blindfolded, and then didn't go on a date for 3 whole years!

This is a true story -- I was a terrific chess player and really did once win a tournament playing blindfolded in the town I grew up in. I also really did NOT go on a date for the 3 years that followed -- but that was OK, not because I was hopelessly geeky but because I was only 13 years old when I won, blindfolded.

My skills were in part the product of the man above, who would have celebrated his 67th birthday today. Bobby Fischer was the greatest American chess player ever -- maybe the greatest chess player ever, period. He won the World Chess Championship back in the Cold-Warry summer of 1972, and made the game of chess as much a sensation as chess could be back then.

I played chess all the time because Fischer was a sensation, and you could find people to play against easily back then. I played blindfolded because I'd read about historical American champion Paul Morphy, who was said to have been a great blindfold player by age 12. I was 13, and how hard could it be? It really wasn't that hard, and my mental images of board positions weren't blurred by my opponents having to recite their every move to me.

That was a fun time -- Fischer - Spassky taught a generation of American kids to spell "Reykjavik," and books like Fisher's My 60 Memorable Games gave my dad and me hours of fun -- playing each other and playing the classics. For me Fischer's most memorable game is his Game of the Century -- a breathtaking classic by a 13 year old boy against one of the strongest masters of his day.

It's a fine line between brilliance and madness, and the eccentric Fischer crossed over and back freely between his triumph in 1972 and his death in 2008. His gifts were wondrous games that we can still enjoy even today -- his birthday (born 1943).

Happy Birthday, Bobby Fischer.

The Game of the Century (scorecard)


Casi Casi ... Cassandra

I've written a couple of times about the "N+1 Queries" problem and I've suggested that it's a bane to relational databases. But there's a way out of it -- let me tell you about it.

But first let's wallow a bit in it. I'm in Twitter, I've written a tweet and I'm ready for it to be sent out to all of my (countless) followers... Here's what my code for that broadcast might look like:

All fine so far -- that's a Rubyish-take twittery world we all live in. I can send out my breathless message of what I had for breakfast, and then Twitter picks it up and broadcasts the message from me (as well as all the messages from the other tweeters):

So here we're going to do a query for each of the X tweeters, and for them we'll do another query for each of their Y followers.

Code smell! Fail Whale!!!

(particularly when you consider Dare Obasanjo's take on Twitter combinatorics)

The problem here is Relational: we need a SELECT to find me, and then a new SELECT to get the info on each of my followers. This "N+1 SELECTS" problem is a simplified version of a real problem, where relational databases stagger and where column-oriented databases are much more what we're looking for. Column oriented databases are designed to be fast at grabbing all of the attributes (columns) associated with a given entity. To understand why this is vital for a Twitter or any other social application, consider the one-to-manys: Twitter has many tweeters, who have many followers, who themselves have many followers... and so on.

Let's think, though, about the code that gets generated when I tweet. If we're using a relational database we'll follow a SELECT for each of my followers with a SELECT for each of their followers -- so we got a polynomial number of SELECTs grinding away for each tweet, and as I get more popular the the disks whirr and lights dim every time I tweet about anything.

So to save the power grid let's try a little Twitter application, but this time using the column-oriented data store Cassandra to handle our users and tweets.

I'll run this from the same Amazon Cloud instance that I've used for my previous postings:
So, in my terminal connected to Amazon, I enter:

sudo gem install cassandra

I've already put Java on my base instance, so I'm just about good to go! A single-line command, and it really does run...

Now, lets start Twitter and tweeting. We'll use the Ruby interpreter IRB on Amazon to enter our users and their tweets:

root@ip-10-245-133-190:/var/www/apps# irb

We're rolling -- first we'll enter our requirements: rubygems to run our additional toys, cassandra to link to the data store we just installed, and SimpleUUID to identify our tweeters:

Now we'll start Twitter in Cassandra, and put in some users and screen names (I've mostly left the Cassandra responses out for brevity here):

Great so far -- we have user 5, "mudcat," and we've given him a tweet. Let's give him someone to tweet to:

And there we are -- we have a reasonable data model for Twitter, backed by the Cassandra data store. Let's review what we've got here:

Cassandra works as a kind of multidimensional hash, and the data it contains can be referenced as:

  • A keyspace
  • A column family
  • An optional super column A column, and
  • A key

Source: http://nimbledais.com/?tag=column-family

Here's what these all mean:

The keyspace is the highest, most abstract level of organization. Our Cassandra conf/storage-conf.xml file contains our keyspace definitions at startup.

The column-family is the chunk of data that corresponds to a particular key. In Cassandra each column family is stored in a separate file on disk, so frequently-accessed data should be placed in a column family for fastest access. Column families are also defined at startup.

A super column is a named list, containing standard columns stored in recency order A column is a tuple, a key-value pair with a key (name) and a value
A key is the permanent name of the record, and keys are defined on the fly
With this structure we're basically defining a schema, and I'd like to claim it's original, but this one was taken from Twissandra by Eric Florenzano.

The great thing about Cassandra is that it evolved to solve real-world problems, and that it may have a free form but it is NOT exactly schema-less. Cassandra may fall in the "NoSQL" class with Hadoop, but the use cases that apply to it could scarcely be more different. Runtime lookups can be handled really well in Cassandra, due to Cassandra's low latency organization and strict definition. Asychronous analytics with the freedom of high latency and greater flexibility demands are a better fit for analytics systems like Hadoop.

Cassandra generally offers terrific performance. There is a tradeoff in eventual consistency, something that perhaps I'll take up in my next blog post.


Inventing the Future

The best way to predict the future is to invent it. ~ Alan Kay

"If I'd asked people what they wanted they would have said a faster horse." ~ Henry Ford

I really like NoSQL data stores and I appreciate the new approaches because they tackle one of the most hidebound parts of information technology - enterprise storage. I still (dimly) recall the days before the predominance of the relational model where "database analyst" was a specific (rather than general) skill, and terms like "current of run-unit" could pretend to have meaning. The relational data model brought order and reason to enterprise data, but the seductiveness of the relational answer to the question of "How do we organize this?" over time came to preclude thoughts of any other form of organization.

It's not that there haven't been other successful data models and stores, but the other successful models have mostly been transparent -- you won't see the data model working because it is invisible to you. Microsoft Office applications have terrific data stores, but they don't place data in a relational model because there is simply no need to -- for Office apps, speed and flexibility trump the structure and order of the relational model. Wikipedia lists dozens of Windows file types, and across different operating systems and application types there are probably hundreds of data store types in general usage.

NoSQL approaches are so interesting and promising because they represent a break from the functional fixation that the relational model is the only conceivable model for enterprise data. In an earlier posting, I wrote:

Relational data models have ruled enterprise data management for more than 20 years -- to the point where it may be hard for generations of developers to imagine that there could be any kind of data model other than rows and columns.

This is the world we've lived in -- terrific for tabulations and Accounts Payable, but not necessarily well aligned for other problem domains. The challenge in moving beyond Relational is that our assimilation of "Relational" hinders us in looking at problems in anything other than Relational terms. This is a manifestation of a pattern that psychologist Karl Duncker called functional fixity -- the condition in which we grasp our principal tool as a hammer, thus our whole world comes to look like nails.

Eric Haseltine writes about functional fixity, creativity and invention in his book Long Fuse, Big Bang, in which he describes some of the opportunities that arise from seeing the world with new eyes. Haseltine writes:

In a classic functional fixity experiment, test subjects are asked how they would-- without grabbing the end of the rope-- get the ends of a rope suspended from a ceiling to touch opposite walls. Aside from the rope, the only object in the room is a pair of pliers sitting on the floor. Most test subjects don't realize that the only way to solve the problem is to tie the pliers to the end of the rope, then swing it so that the pliers-acting as a weight-carry the end of the rope to one wall, then, in pendulum style, to the other. The few subjects who perceive the pliers as a weight instead of a tool, solve the problem.

The progression of Moore's Law and advancement of NoSQL data approaches, cloud deployment, and modern languages and tools can open doors to advancements that would have been inconceivable without the synchronicity of these technological advances. We have new solution archetypes:

  • Key-value storage (Memcached, Redis, Voldemort) for caches and transient data...
  • Hadoop and MapReduce for massively parallel data processing
  • Document databases (MongoDB, CouchDB) for log files and serial data
  • Graph databases (Neo4j) for the social graph and applications

...but the kinds of solutions possible with these new approaches are far broader than that. We just need to keep building the skills to see such solutions as they become possible.

“The real voyage of discovery consists not in seeking new landscapes but in having new eyes.” ~ Proust


Spreadsheets for the New Millennium

"We shape our tools, and thereafter our tools shape us." ~ Marshall McLuhan

I've done a lot of writing on "big data" and NoSQL solutions over the past couple of months, and it's time to take stock of it all: "Why should anyone care about any of this?" I've spent some time in 2011 playing with the four main varieties of new "NoSQL" data stores:

  • I started with a simple little note taking app, hosted on the cloud and backed by document-oriented store MongoDB --
  • The second application was a URL shortener, hosted on the cloud and backed by the key-value NoSQL store Redis -- based on a terrific example: An URL shortener with Ruby on Rails 3 and Redis by Christoph Petschnig You can play with the URL shortener here: Mini-URLs -- bit.ly for the masses
  • The third application was a nice Hadoop / Cloudera model to perform word counts across a bunch of files. It was based on Phil Whelan's terrific posting and it showed just how easy it can be to trigger MPP with Hadoop and Cloudera
  • My final example is a "6 Degrees of Kevin Bacon" social-graph-solution generator written in the graph database Neo4j, based on Ian Dees' terrific Everyday JRuby posting
    This is a fun little app, and you can play the 6-Degrees game here: 6-Degrees of Kevin Bacon

So we have some examples up, but why does this matter? Who cares?

As it turns out we have a confluence of a number of technologies and trends that make solutions of an entirely new type possible. Here's what's new:

  • The Internet makes it possible to market to the masses, and keep detailed records of their response-to-stimuli
  • The Cloud makes it possible to spin up supercomputer-level resources with no capital expenses
  • The "NoSQL" family of data stores has arisen to deal with non-relational data challenges: Hadoop for MPP, Memcached, Redis and Voldemort for transient data, SETI and CouchDB for log data, etc.
  • Visualization tools like Tableau make it possible to create stories and narratives that result from the other tools here

So why does this matter, and how might things turn out for the pioneers of something new like this? For the answer let's look back to a previous example, in which an MBA-student decided to try programming a "toy" personal computer to manage data for the business cases he encountered at school. In 1979 the MBA (Dan Bricklin) asked a buddy (Bob Frankston) to help him code up a solution on a primitive, toy Apple II. How that that turn out for them? As Steven Levy wrote, way back in 1984:

It is not far-fetched to imagine that the introduction of the electronic spreadsheet will have an effect like that brought about by the development during the Renaissance of double-entry bookkeeping. Like the new spreadsheet, the double-entry ledger, with its separation of debits and credits, gave merchants a more accurate picture of their businesses and let them see – there, on the page – how they might grow by pruning here, investing there. The electronic spreadsheet is to double entry what an oil painting is to a sketch. And just as double-entry changed not only individual businesses but business, so has the electronic spreadsheet.

And so it happened -- a novel application (the spreadsheet) on a toy machine gave business managers a new way to track and value their businesses. The "and value" is key here -- this was an application of the technology that Bricklin and Frankston never really considered, and advances to support valuation (1MB memory space and macros, in Mitch Kapor's Lotus 1-2-3) enabled the Mergers and Acquisition boom that has continued to this day, and this application (in vehicles like hedge funds) created more than a dozen billionaires just in 2007 alone.

So this is why NoSQL, big data, the cloud and visualization are so fascinating to watch. They enable solutions to problems that would have been inconceivable even a decade ago. The cloud era doesn't have its "spreadsheet," yet, but it will...


Graph Databases and Star Wars

Source: http://livegreenstlouis.files.wordpress.com/2007/10/star-wars-bar.jpg

Often when we speak of the social graph, big data and new applications we present them as steps to the epiphany: "Wouldn't it be great if you could do THIS!?" This is a great approach in good times and even in a tough economy this is a fine message for visionaries, as it appeals to one of the two core emotions that may generally underlie crossing the chasm and signing-on to a deal. It appealed, at base, to greed.

That's a great way to get a deal done but it sure isn't the only way. In difficult times, many executives are driven not by greed but by fear. Even in the strongest businesses Executives are staring down veritable Sarlacc Pits of worry, and the winning message is often NOT fulfilling their aspirations at the top of Maslow's pyramid, but calming their fears at Maslow's base.

Google is a great example of "the strongest of businesses." Google has been the most compelling business on the Internet, but today even Google has some real problems. Google's challenges have made it as far as Paul Kedrosky, and Google has one very big, very current problem: Google is weak in "local" and "social" search. In a world of content farms and with the rise of successful walled gardens, Google's Pagerank model is finally exposed as context-limited and not semantical. For local and social search Google created Percolator -- something more real-time than Pagerank/MapReduce. That's a step forward, and Google still might show you the most popular content, but with the rise of content farms such as Demand Media and Answers.com, they may no longer be showing the best content.

In Google's world links cost money, and Google probably never imagined that it would be economically viable to spam Pagerank. The key point here isn't about Google, it's the recognition that in 2011 it's possible to spam anything! No Marketing executive is immune from the question: "How do we keep from getting spammed out of the marketplace?" The answer is to use social linkages to de-spammify their messages and their marketing. "Social" might be a great visionary message, but were pitching fear here: if you don't go social, your marketing message may no longer be seen by anyone!

What does "Social" mean, and why does it matter so much now? In Pagerank, loosely speaking, one link offers the same level of validation as any other link. This is fine in an asocial world, but in our social world we all know that some links count WAY more than others. A billion Google users can't be wrong, but I don't value their opinions nearly as much as those from co-workers, friends and family -- the people close to me. But how can Google tell who is close to me? From the Marketing standpoint, wouldn't it be great to know how close anyone is to anyone else? But how can you know that?

To understand this problem, let's play a game. It's a formerly pretty popular parlor game, called "Six Degrees of Kevin Bacon" and you can play along too here: Six Degrees of Kevin Bacon. The original code for the game can be found at: Pragmatic Programmers - Everyday JRuby, with modest updates for Ruby and Amazon hosting by me. The game is one actor Kevin Bacon brought on himself:

In a February 1994 Premiere magazine interview about the film The River Wild, Kevin Bacon commented that he had worked with everybody in Hollywood or someone who's worked with them.

Our game tests this premise, and you play it by naming any other actor, and trying to span the personal connections that link Kevin Bacon with that actor. Let's try it with my favorite character actor, the late John Cazale (note: Cazale acted in 5 movies during his life, all 5 were nominated for Best Picture, and 3 won. In fact, Cazale acted in one further picture after his decease, and THAT was nominated for Best Picture, too!).

So let's let the computer play:

Not bad! John Cazale has a "Kevin Bacon #" of 2, as he is not Kevin Bacon himself (#0), but he's acted with Bruno Kirby, who has a Bacon # of 1. The game is a fun one, and on playing it you discover that Kevin Bacon is basically correct -- practically the whole movie industry has a Bacon # of 3 or less. The game offers other wrinkles as well: we can look up Athletes as Actors here:

We can also make up our own categories, such as "Famous 4-letter celebrities who also acted in films" such as the Cher-Bono linkage below:

The point here is that social linkages is actually a pretty hard problem, as (in this case) we have each Movie casting many Actors, and each Actor in many Movies. This is not unsolvable but it's a pretty challenging problem if all of our data is kept in a Relational database. If instead our data is stored in a Graph database the problem is a lot easier and possibly a lot faster to solve as well. A graph database, such as the Neo4j database used here, will reasonably have Dijkstra's algorithm for shortest-path traversal in its instruction set, and the command to solve the Kevin Bacon problem can be a one-line command as easy as:

database.shortest_path 'Cher', 'Bono'

This is the command I executed to produce the output shown above. I won't go into the programming or cloud setup in this posting, but if you're curious you can download and review the code at Six Degrees of Graph Databases. To run your own demo you'll need JRuby (as shortest_path draws on core Java libraries) and the Neo4j graph database, but the setup is pretty straightforward from there (and I'll write on it if anybody's interested).

There are a number of key lessons here, and there's a reason I started this posting with a picture of the Cantina at Mos Eisley from Star Wars.

Anything can be spammed -- if you want a close relationship with your customers, you're going to have to get social with them. Social relationships aren't easy to formalize -- a lot of the data is fuzzy, and most of the relationships are many-to-many. Graph databases like Neo4j are a great match for capturing social information, relatively easy to program, and fit well with a dynamic-language "cloud" world.

Social graphing is the kind of problem that offers great solutions if you JUST HAVE THE RIGHT TOOLS!

Finally, about Mos Eisley above. My personal social graph of people I worked with at Oracle probably has nearly a thousand people with a "Repko #" of 1, and untold tens of thousands with numbers of 3 or less. Two key points on that:

  • I worked at other tech companies (Apple, HP, MSFT) at various points in the past as well, and have similar communities with each of them as well
  • I'm not unusual at this -- we've all worked in the past with lots of great people!

I've often described Oracle as the "Star Wars Bar Scene" of high tech -- at some point everybody in the universe will wander through there. My badge number there was around 10,000, and I believe that if they still number badges that (with Oracle's acquisitions) the badge numbers may top 250,000 now.

We've all worked with lots of great people....

Page 1 ... 3 4 5 6 7 ... 8 Next 5 Entries »