“Big Filter”: Intelligence, Analytics and why all the hype about Big Data is focused on the wrong thing

These days, it seems like the tech set, the VC set, Wall Street and even the government can’t shut up about “Big Data”.  An almost meaningless buzzword, “Big Data” is the catch-all used to try and capture the notion of the truly incomprehensible volumes of information now being generated by everything from social media users – half a billion Tweets, a billion Facebook activities, 8 years of video uploaded to youtube… per day?! – to Internet-connected sensors of endless types, from seismography to traffic cams.   (As an aside, for many more, often mind-blowing, statistics on the relatively minor portion of data generation that is accounted for by humans and social media, check out these two treasure troves of statistics on Cara Pring’s “Social Skinny” blog.)

http://thesocialskinny.com/216-social-media-and-internet-statistics-september-2012/

http://thesocialskinny.com/100-more-social-media-statistics-for-2012/

In my work (and occasionally by baffled relatives) I am now fairly regularly asked “so, what’s all this ‘big data’ stuff about?”  I actually think this is the wrong question.

The idea that there would be lots and lots of machines generating lots and lots… and lots… of data was foreseen long before we mere mortals thought about it.  I mean, the dork set was worrying about  IPv4 Address exhaustion in the late 1980s.  This is when AOL dial-up was still marketed as “Quantum Internet Services” and made money by helping people connect their Commodore64’s to the Internet.  Seriously – while most of us were still saying “what’s a Internet?” and the nerdy kids at school were going crazy because, in roughly 4 hours, you could download and view the equivalent of a single page of Playboy, there were people already losing sleep over the notion then that the Internet was going to run out of it’s roughly four-and-half billion IP addresses.   My point is, you didn’t have to be Ray Kurzweil to see there would be more and more machines generating more and more data.

What I think is important is that more and more data serves no purpose without a way to make sense of it.  Otherwise, more data just adds to the problem of “we have all this data, and no usable information.” Despite all the sound and fury lately about Edward Snowden and NSA, including my own somewhat bemused comments on the topic, the seemingly omnipotent NSA is actually both the textbook example and the textbook victim of this problem.

It seems fairly well understood now that they collect truly ungodly amounts of data.  But they still struggle to make sense of it.  Our government excels at building ever more vast, capable and expensive collection systems.  Which only accentuates what I call the “September 12th problem.”  (Just Google “NSA, FBI al-Mihdhar and al-Hazmi” if you want to learn more.)  We had all the data we ever needed to catch these guys.  We just couldn’t see it in the zetabytes of other data with which it was mixed.  On September twelfth it was “obvious” we should have caught these guys, and Congress predictably (and in my opinion unfairly) took the spook set out to the woodshed perched on the high horse of hindsight.

What they failed to acknowledge was that the fact we had collected the necessary data was irrelevant.  NSA collects so much data they have to build their new processing and storage facilities in the desert because there isn’t enough space or power left in the state of Maryland to support it.  (A million square feet of space, 65 megawatts of power consumption, nearly two million gallons of water a day just to keep the machines cool?  That is BIG data my friends.)  And yet, what is (at least in the circles I run in) one of the most poignant bits of apocrypha about the senior intelligence official’s lament?  “Don’t give me another bit, give me another analyst.”

It is this problem that has made “data scientist” the hottest job title in the universe, and made the founders of Splunk, Palantir and a host of other analytical tool companies a great deal of money.  In the end, I believe we need to focus not just on rule-based systems, or cool visualizations, or fancy algorithms from Isreali and Russian Ph.Ds.  We have to focus on technologies that can encapsulate how people, people who know what they’re doing on a given topic, can inform those systems to scale up to the volumes of data we now have to deal with.  We need to teach the machines to think like us, at least about the specific problem at hand.  Full disclosure, working on exactly this kind of technology is what I do in my day job, but just because my view is parochial doesn’t make it wrong.  The need for human-like processing of data based on expertise, not just rules, was poignantly illustrated by Malcolm Gladwell’s classic piece on mysteries and puzzles.

The upshot of that fascinating post (do read it, it’s outstanding) was in part this.  Jeffrey Skilling, the now-imprisoned CEO of Enron, proclaimed to the end he was innocent of lying to investors. I’m not a lawyer, and certainly the company did things I think were horrible, unethical, financially outrageous and predictably self-destructive, but that last is the point.  They were predictably self-destructive, predictable because, whatever else, Enron didn’t, despite reports to the contrary, hide the evidence of what they were doing. As Gladwell explains in his closing shot, for the exceedingly rare few willing to wade through hundreds or thousands of pages of incomprehensible Wall Street speak, all the signs, if not the out-and-out evidence, that Enron was a house of cards, were there for anyone to see.

Jonathan Weil of the Wall Street Journal wrote the September, 2000 article that got the proverbial rock rolling down the mountain, but long before that, a group of Cornell MBA students sliced and diced Enron as a school project and found it was a disaster waiting to happen.  Not the titans of Wall Street, six B-school students with a full course load. (If you’re really interested, you can still find the paper online 15 years later.)    My point is this – the data were all there. In a world awash in “Big Data”, collection of information will have ever-declining value.  Cutting through the noise, filtering it all down to which bits of it matter to your topic of choice; from earthquake sensors to diabetes data to intelligence on terrorist cells, that will be where the value, the need and the benefits to the world will lie. 

Screw “Big Data”, I want to be in the “Big Filter” business.

%d bloggers like this: