« Wooden Nickels | Main | A Gift From The Past »
Sunday
Jul212013

Glittering data

All that is gold does not glitter,
Not all those who wander are lost; ~ J.R.R. Tolkien

Even the best of hunters finds no game in an empty field ~ I Ching

One of the wonders of data analysis are the nuggets of wisdom it can offer, and part of the thrill of Big Data is the notion that bigger claims will have more nuggets. That is often true, and I've been writing since 2009 of the different kinds of nuggets that can be found in the Gold Country of modern data analysis.

We'll take on a different task today: not that of finding nuggets, but of proving nugget-less-ness. You might think "It's impossible to prove a negative," but in analytics we do have tools that we can use to show (As Gertrude Stein said of Oakland) "There is no there there."

The magic trick we'll rely on here is a T-Distribution calculation. TDC's are terrific for analyzing small sets of data for descriptive features and equivalence, and it's not clear (in the literature I've seen, anyway -- happy to be corrected by mathier mathematicians) that their usefulness is limited to small data sets. TDC's give a simple calculation where the input of the following 6 data elements:

Sample Set 1

  • Number of elements in the set
  • Mean of the set
  • Standard deviation of the set

Sample Set 2

  • Number of elements in the set
  • Mean of the set
  • Standard deviation of the set

will give us a result called critical t -- and we can establish data set equivalence at different confidence levels for which, if

t crit < t value at a given confidence level

then we cannot, with the chosen confidence level, discard the null hypothesis that the data sets are equivalent. We generate t crit by the following formula:

t crit = (x 1_mean - x 2_mean )/√((n 1 s 1 2 + n 2 s 2 2)/ (n 1 + n 2 - 2) * ((n 1 + n 2) / n 1 n 2 ))

and we compare our t crit values against a table of t-distribution values for varying levels of confidence, such as PERCENTAGE POINTS OF THE T DISTRIBUTION

Let's try this out with a simple example, taken from Texas Instrument's Classic Sourcebook for Programmable Calculators:

Two classes are taking Differential Equations. The first class has 12 members, with a mean of 87 and a standard deviation of 3.6. The second class has 14 members, a mean of 85 and a standard deviation of 3.25. Can we say, with 80% confidence, that there's a statistical difference in the results for the two classes? How about 95%? 99%? 99.9%?

Here's what we find:

T-Distribution Distribution Results

Here our two sets of test results, with class averages of 87 and 85 respectively, only show statistical differences at a comparatively forgiving 80% confidence interval, and show no statistical difference at the more restrictive 90, 95, 99 and 99.9% confidence intervals.

These results interesting because modern data analytics presents us with wide varieties of data sets and sometimes little judgment goes into the assessment of just how nuggety those data sets can be. In our example data set here, even if the second class average was to fall a full standard deviation below the first class - coming in at a paltry 83.4% - we cannot establish statistical difference at 99% confidence. With larger data sets our degrees of freedom rise and our t crit 's fall, but even here our standard deviations will also generally fall as our populations grow.

It's a rich world out there, but watch your data and never forget the Spanish proverb: "No es oro todo que brilla..."