Journal - Insight from Visual Mathematics

Thursday

May192011

The Big Easy - Spreadsheets for the New Millennium Part 2

Thursday, May 19, 2011 at 1:30PM

Back in January I wrote a post that I called Spreadsheets for the New Millennium, and in that posting I suggested that Big Data would never take hold in the public consciousness until there were gateways to it -- tools that could let anybody play with it, tools that could make it easy.

I love the idea of Mardi Gras -- everybody picks a time (the Tuesday before the start-of-Lent Ash Wednesday) and a place (New Orleans' French Quarter) to get together to have a party. I love it because that practically never happens in technology. As much as we celebrate advances in technology, we need to celebrate them because progress (both discovery and adoption) is so hard.

Why does it seem to be so hard for communities to assemble and take up advances in science? This is one of the key questions in Thomas Kuhn's The Structure of Scientific Revolutions

What does it take for an idea to break through? Here is Kuhn's answer:

Kuhn saw that for a new candidate paradigm to be accepted by a scientific community, "First, the new candidate must seem to resolve some outstanding and generally recognized problem that can be met in no other way.

Second, the new paradigm must promise to preserve a relatively large part of the concrete problem solving activity that has accrued to science through its predecessors..."

Or, as Steve Jobs might say: Think Different, but not Too Different.

For Big Data to become mainstream technology we need to satisfy two conditions:

Solve a generally-recognized problem that can be met no other way. Check - Progressive Insurance can give you an estimated insurance quote on the spot, but only because they pre-calculate a rate quote for every car in the US every night -- using Hadoop. The problem and solution are simple, but without Hadoop they could never generate every car, every night...
...Preserve its predecessors. FAIL. Hadoop is a terrific tool, but unless you've had the good fortune to take MIT's 6.001 or Yale's CS 323 you've probably never seen anything like it.

The Hadoop community has desperately added Hive and Pig to try to reduce the foreignness-barrier of functional programming and get Hadoop over Kuhn's second barrier.

Brook Byers of Kleiner Perkins put me onto Kuhn back when I was at Stanford, so it's no surprise that KBCP announced a $9M round with the Hadoop-y Big Data company Datameer, announcing

Datameer’s Analytics Solution, which integrates the data mining power of Hadoop with a spreadsheet interface, enables business users to run analytics against very large data sets with no programming required. The product is designed to help users with little to no computer engineering experience handle massive amounts of data.

Kleiner and Datameer aren't alone in the race for Spreadsheets for the New Millennium, Factual is another player that's raised a lot of smart money.

These are great approaches, but in the best open-source tradition we'll bring up a solution that does a lot of the same things -- based in open-source code -- in one of my next postings.

John Repko | Comments Off |

Friday

May132011

Not "Big Data" -- FAST Data

Friday, May 13, 2011 at 9:23PM

One of the great things about working in technology is that it's marked by seasons, and watching the seasons you learn that you can plan what's ahead. Much as a robin is a harbinger of Spring, a new McKinsey Technology report signals the arrival of a new technology for McKinsey to ponder. Now, I like McKinsey -- they are fine strategists, and I have a bunch of friends from Stanford GSB who landed there. Still, as technologists they might do well to stretch out their fingers and login a bit deeper sometimes. At the beginning of 2009 their Clearing the air on cloud computing declared that 'Cloud computing' is approaching the top of the Gartner Hype-cycle. That was two full years ago, and the clouds haven't exactly burned off "cloud computing" since.

Well, like the baddies in Poltergeist II, "They're baaaack…" This time McKinsey weighs in on Big Data, and in classic McKinsey fashion they deliver terrific facts without providing any insight on why all this is happening around them. McKinsey's latest, Big data, the next frontier for innovation, competition and productivity, starts with the common Big Data red herring: "...the volume of data is growing at an exponential rate..." which is indisputable but totally misses the point. Data has been growing exponentially since at least the IBM 360 era -- almost 50 years now. The key point is NOT that the data is "Big." The data has always been big. The question is not Why Big Data? but Why Now?

The answer is not "Now the data is big" -- the answer is "Now the data is fast!" Google didn't become Google because their data was big -- Google went to MapReduce so they could keep growing the number of sites-crawled while still returning results in < 200 milliseconds, and now they're going to Google Instant because even 200 milliseconds isn't fast enough anymore. Consider all the action we're seeing today in NoSQL data stores -- the point is NOT that they are big -- the point is that apps need to quickly serve data that is globally partitioned and remarkably de-normalized. Even the best web-era app isn't successful if it isn't fast.

So for now let's forget about McKinsey. If you are looking for opportunity, the question to ask is NOT "Where is there Big Data?" the question to ask is "Where can fast data really make a difference?"

…Even the best web-era app isn't successful if it isn't fast… This is the thinking that brought all the NoSQL data stores to social networking software. The new applications like Twitter and Facebook are huge and distributed but still have to be fast. To the billionaires who founded them, throwing out the conventions of the Relational model was a small price to pay for the success and scale that speed brought.

The core idea behind HBase and Cassandra as NoSQL leaders is that they may be schemaless (which is nice for web data) but they are not unstructured! What makes the column-oriented databases so magical is that they avoid the "6-JOIN" database push-up problem that Dare Obasanjo wrote up in When Not to Normalize your SQL Database. To get speed we're willing to make compromises with some of the core components of heretofore-modern data processing. To get speed we change some of the rules of the game.

Here are the new rules for software delivery in the Web era

You have 100 milliseconds to respond to a user action in a web application. This is where we ended up in my last post: Much over 100ms == FAIL.

100ms is one tough target, because
- Accessing a web server in Palo Alto from a site in NY costs 50-80 ms just in latency (unless you can increase the speed of light)
- Every router-hop = about 3ms
- ESB response times = 10s of ms (maybe 100s)
- XML marshalling / unmarshalling = 10s of ms (maybe 100s)
  (this is why JSON has replaced XML in web apps)
- DB access = ~1 (good) to 10 (cheap) ms -- this is why 6-JOINS FAIL

So if we believe that the faster cobra always wins, here are the rules that fall out from this:

Rules for App Delivery in the Web Age

You need to cache data near users -- round-the-world transmission = FAIL right off the bat (too far, too much latency and too many hops to be fast)
ESBs for enterprise apps may be fine, but probably fail for web apps
XML went away in web-space because it had to (JBoss' Marc Fleury once wrote a great article on this)
One DB access is not fatal, but 6-10 surely are -- thus for the biggest data we find no JOINS, no Transactions, no Stored Procedures, and ultimately NO DATABASES (see the classic eBay Architecture, and note that eBay has already moved most DB ops into App/RAM space (slides 22-23)) for web apps
Zero lookups are better than one: Hello Memcached!
If you have a DB, you better get all you need with that one lookup. Thus columnar databases like HBase and Cassandra -- if I lookup "Bill Smith" I get a big chunk of EVERYTHING known about Bill -- I then work on the chunk, and write it to storage as an object. RAM is cheap, busses are fast and this approach works in web-app-land.
"Eventual" consistency is fine, as long as you have some idea of how eventual "eventual" really is
Hadoop can prowl around in background, making sure our data stores all eventually sync up
Conventional data models no longer work here -- the world of fast big data is all about denormalization and deduplication

In this big data-driven world the data model morphs to provide the fast data that apps require. We thus have a new kind of app / data model -- much more object-oriented than the pure-data stores that have taken us this far.

In this world we go to NoSQL for access speed, and gain all kinds of other processing possibilities in the process. The beauty of Google (and other similar Hadoop-y efforts) is that once you get used to working in a Googleplex with MapReduce as a routine operation, you discover that there are all kinds of other operations that you can do in similarly massively parallel fashion. It's likely that most of the wins we're seeing in Big Data are coming, not from intrepid data explorers, but from routine operations-people who went in looking for speed and figured out that the approach yielded other discoveries as well.

What would happen if we STARTED with a data framework with an infinite distributed data store, MapReduce built in for unstructured data analysis, and Apache SOLR as well for free-text search and structured data querying? We'd have an environment where the speed was free, and we could devote our energies to finding patterns in data. Now THAT would be magical…

Indy: That's the Ark of the Covenant.
Elsa: Are you sure?
Indy: Pretty sure.

John Repko | Comments Off |

Wednesday

May112011

Cobra strike - 100 milliseconds to understanding new architectures

Wednesday, May 11, 2011 at 5:58AM

I've written many times now about NoSQL architectures and the rise of whole new species of data stores as the software of the Facebook age. But what's going on here, really? As Ian Fleming wrote in Goldfinger:

"Once is happenstance. Twice is coincidence. The third time it's enemy action."

I've probably written a dozen of the same pieces now on "new software architecture," so let's 1) take a look at what this is all about, and 2) let's see if we can see where it's all headed.

We see so many new components (Hadoop, NoSQL, Sphinx/SOLR, Node.js) with seemingly nothing to link them, other than as different exotic beasts in the new-software zoo. There are some fundamental truths behind why Mutual of Omaha's Software Kingdom is featuring them now, and Google has the answer. Not "Google the search engine" ... but Google the company.

Robin Bloor put a nice light on this back in 2009 with Why Google Won In The Search Market. In that post, Bloor might have been thinking of Google VP Marissa Mayer's famous ...Users really respond to speed... quotation when he wrote:

We can normally react to a stimulus in the 140-200 millisecond range, which is great news for cobras, because it takes a cobra about 100 milliseconds to bite. To put it another way, if a cobra is within striking range and it decides to bite you, it’s too late to stop it. If the mouse pointer moves more than 100 milliseconds after you move the mouse, it feels slow.

That brings me to the fundamental truths of modern software that link all the beasts in our zoo and point the direction ahead. They are:

In a high-resolution, handsetted, wifi'd world the distinctions blur between Enterprise software, Desktop software, and Handset software. It's all just software, delivered as a service everywhere

If your software can't respond in about 100 milliseconds you're dead. Down to 100ms the faster cobra always wins.

If you can't make down to 100ms, it doesn't matter how "good" your architecture is. It fails.

These three rules explain a lot about what's going on in software today. In my next postings we'll do a quick tour of the zoo with these new perspectives, and introduce a really neat package that is a harbinger of where this all is headed.

I've got 911 on speed dial. ~ Douglas Coupland

John Repko | Comments Off |

Saturday

Apr232011

Wicked Fast

Saturday, April 23, 2011 at 6:16PM

“Now, here, you see, it takes all the running you can do, to stay in the same place. If you want to get somewhere else, you must run at least twice as fast as that!” Lewis Carroll ~ Alice's Adventures in Wonderland

I really love Ruby on Rails. My biggest pet peeve with software development platforms has always been their quest for generality -- "with our program, you could build anything, from an iPhone tic-tac-toe app to systems code for the Space Shuttle!" The problem here is that nobody wants to build just "anything" -- people's needs at any given time tend to be pretty specific. A platform that claims to be good for everything is generally good for nothing. That's where Rails comes in -- web apps is all it does.

The best frameworks are in my opinion extracted, not envisioned. And the best way to extract is first to actually do. ~ David Heinemeier Hansson ~ Ruby on Rails

This is where the "PT boats to Battleships" metaphor I wrote about in my last post comes in. I believe Rails is unbeatable for web apps, as long as the definition of a web app doesn't change. It was perfect for what it did -- but do we still do that anymore?

As I mentioned last post -- the world is changing and the old patterns may not work anymore. So what do you do? Is there a "Rails" for Ajax applications between handheld devices?

There is -- or at least there's the start of a platform built around a very different set of assumptions of what Internet applications are all about. It's called Node.js, and it springs from work that Ryan Dahl first published in 2009.

Node is really interesting and it builds on a capability of its core JavaScript language that Joel Spolsky wrote about in 2006: Can Your Programming Language Do This? -- the ability to package rich objects (including inline functions) as parameters in function calls. Hmmmm ... this sounds like this could get deep and theoretical... but stay with me: here's why it matters:

In the web-pagey world, to respond to a request you compose and send a page. With simple web pages you can do this sequentially and are probably fine, and threads are there to bail you out wherever you aren't fine.
Today, with Big Data databases and media files, you might get a request and not know if that request is EVER going to complete!? Processes block, and even the fastest processor can't do much while it's just sitting and waiting.
The solution? A non-blocking architecture. It's fine to have long requests -- as long as you're not stuck waiting for them to finish.... so how do we do that?
This is the problem Dahl solves with Node.js. Node is an event-driven architecture -- when requests come in, Node processes them by attaching a callback routine to them and launching them, and then moving on to the next request.

This is the perfect architecture for a modern web age with a mix of skinny and chunky requests. Rather than grinding through it all in sequence, you tell each request "Here you go -- call me when you're done..." and move on to the next thing. It's a clean approach, and with modern JavaScript engines, such as Google V8 or Apache SpiderMonkey, this kind of approach is fast.

Wicked fast.

Node.js is tight and clean, and it's amazing what you can get done with just a little code. Like all Unix-y code since Kernighan and Ritchie, Node.js has its Hello World app:

var http = require('http');
http.createServer(function (req, res) {
res.writeHead(200, {'Content-Type': 'text/plain'});
res.end('Hello World\n');
}).listen(8124);
console.log('Server running at http://127.0.0.1:8124/');

The explanation of the code is really simple -- it's just as it reads:

Create a HTTP server
Make it request / response
Write a 200 code for success with plain text
Write "Hello World"
Listen on port 8124
Tell the console that we're listening on 8124

It really is just that simple. I'm not sure how you'd make space shuttle code with it, but if you're looking for evented web apps with a tiny footprint, Node.js is it. For this blog posting I wanted to try something a bit bigger, so I put Ben Gourley's little NodeJS-driven presentation page up on the Amazon cloud:

You can access the page here: nodejs presentation page
You can download the source code here: nodejs_blog source code

The code was adapted from Ben's CLOCK BLO" site, and while the presentation isn't exactly full-featured, it has less than 100 lines of code invested in it, and it is...

Wicked Fast.

As you can see, the presentation, but it's in the best 15 minute tradition of new web platform development.

One of the beauties of Node.js with the Express package is that, despite its simplicity, it is still full Model-View-Controller, so setting up the code was easy, and laid out in a nice, clean beautiful way:

There's a lot to write about Node.js, its package manager npm, and development packages like Express, Connect, and Websockets/socket.io, and those will come in other posts. There's a lot here -- maybe the future of the handheld, small-screened, peer-to-peer web.

It really is ... WICKED FAST!

"This is your last chance. After this, there is no turning back. You take the blue pill - the story ends, you wake up in your bed and believe whatever you want to believe. You take the red pill - you stay in Wonderland and I show you how deep the rabbit-hole goes." Morpheus ~ The Matrix

John Repko | Comments Off |

Saturday

Apr232011

Back to the Future

Saturday, April 23, 2011 at 1:24PM

Don't worry. As long as you hit that wire with the connecting hook at precisely 88mph the instant the lightning strikes the tower... everything will be fine. ~ Back to the Future (1985)

One of the great challenges of working in technology is that patterns of thinking change quickly and from time to time, no matter how wired-in you are, you discover that everything you know is wrong. Novelist William Gibson is right: the future has already arrived -- it's just not evenly distributed... When I first learned Ruby on Rails back in 2006, it struck me as a wondrous advance on the Java development I was doing. Java had bulked up as an Enterprise solution so now, 5 years later, it's little surprise that Java End of Life is something Thoughtworks worries about.

In tech we often see the tail end of Clayton Christensen's The Innovator's Dilemma. In TID, disruptive technologies catch on because whatever they lack in robust features they make up for in agility. With time, though, the PT Boats grow into Battleships, and the cycle starts anew.

There are signs that this is happening now with Internet technology -- our toolsets (like Rails) have grown so fit to the task that they seem a bit ponderous as the task shifts. With enough shift we again conclude that everything we know is wrong and the cycle starts again.

"It's not what you don't know that kills you, it's what you know for sure that ain't true." ~ Mark Twain

Here's what we know about Internet technology today:

"Computers" are how people interact with the Internet
Modern apps display web pages and submit information
Pages are served from servers (of course)
The client-server Internet model works fine

WRONG, WRONG, WRONG, and WRONG. Here's the world we've been living in for a while now:

Today there are more wireless handsets than there are people on earth
In 2011, nobody updates a whole page anymore -- Ajax rules
To paraphrase Bill Joy -- no matter where you are, most of the interesting content is somewhere else (on someone else's handset)
Pages are easier -- and if we wait maybe those pesky smartphones will just go away...

We're ready for a new programming world, and I've been investigating that new world for a while now. With my next post post I'll write up what I've found. As in the sound clip below, you may not be ready for this yet -- but it'll be here soon "...and your kids are gonna love it!"

John Repko | Comments Off |