“Big Filter”: Intelligence, Analytics and why all the hype about Big Data is focused on the wrong thing

These days, it seems like the tech set, the VC set, Wall Street and even the government can’t shut up about “Big Data”.  An almost meaningless buzzword, “Big Data” is the catch-all used to try and capture the notion of the truly incomprehensible volumes of information now being generated by everything from social media users – half a billion Tweets, a billion Facebook activities, 8 years of video uploaded to youtube… per day?! – to Internet-connected sensors of endless types, from seismography to traffic cams.   (As an aside, for many more, often mind-blowing, statistics on the relatively minor portion of data generation that is accounted for by humans and social media, check out these two treasure troves of statistics on Cara Pring’s “Social Skinny” blog.)

http://thesocialskinny.com/216-social-media-and-internet-statistics-september-2012/

http://thesocialskinny.com/100-more-social-media-statistics-for-2012/

In my work (and occasionally by baffled relatives) I am now fairly regularly asked “so, what’s all this ‘big data’ stuff about?”  I actually think this is the wrong question.

The idea that there would be lots and lots of machines generating lots and lots… and lots… of data was foreseen long before we mere mortals thought about it.  I mean, the dork set was worrying about  IPv4 Address exhaustion in the late 1980s.  This is when AOL dial-up was still marketed as “Quantum Internet Services” and made money by helping people connect their Commodore64’s to the Internet.  Seriously – while most of us were still saying “what’s a Internet?” and the nerdy kids at school were going crazy because, in roughly 4 hours, you could download and view the equivalent of a single page of Playboy, there were people already losing sleep over the notion then that the Internet was going to run out of it’s roughly four-and-half billion IP addresses.   My point is, you didn’t have to be Ray Kurzweil to see there would be more and more machines generating more and more data.

What I think is important is that more and more data serves no purpose without a way to make sense of it.  Otherwise, more data just adds to the problem of “we have all this data, and no usable information.” Despite all the sound and fury lately about Edward Snowden and NSA, including my own somewhat bemused comments on the topic, the seemingly omnipotent NSA is actually both the textbook example and the textbook victim of this problem.

It seems fairly well understood now that they collect truly ungodly amounts of data.  But they still struggle to make sense of it.  Our government excels at building ever more vast, capable and expensive collection systems.  Which only accentuates what I call the “September 12th problem.”  (Just Google “NSA, FBI al-Mihdhar and al-Hazmi” if you want to learn more.)  We had all the data we ever needed to catch these guys.  We just couldn’t see it in the zetabytes of other data with which it was mixed.  On September twelfth it was “obvious” we should have caught these guys, and Congress predictably (and in my opinion unfairly) took the spook set out to the woodshed perched on the high horse of hindsight.

What they failed to acknowledge was that the fact we had collected the necessary data was irrelevant.  NSA collects so much data they have to build their new processing and storage facilities in the desert because there isn’t enough space or power left in the state of Maryland to support it.  (A million square feet of space, 65 megawatts of power consumption, nearly two million gallons of water a day just to keep the machines cool?  That is BIG data my friends.)  And yet, what is (at least in the circles I run in) one of the most poignant bits of apocrypha about the senior intelligence official’s lament?  “Don’t give me another bit, give me another analyst.”

It is this problem that has made “data scientist” the hottest job title in the universe, and made the founders of Splunk, Palantir and a host of other analytical tool companies a great deal of money.  In the end, I believe we need to focus not just on rule-based systems, or cool visualizations, or fancy algorithms from Isreali and Russian Ph.Ds.  We have to focus on technologies that can encapsulate how people, people who know what they’re doing on a given topic, can inform those systems to scale up to the volumes of data we now have to deal with.  We need to teach the machines to think like us, at least about the specific problem at hand.  Full disclosure, working on exactly this kind of technology is what I do in my day job, but just because my view is parochial doesn’t make it wrong.  The need for human-like processing of data based on expertise, not just rules, was poignantly illustrated by Malcolm Gladwell’s classic piece on mysteries and puzzles.

The upshot of that fascinating post (do read it, it’s outstanding) was in part this.  Jeffrey Skilling, the now-imprisoned CEO of Enron, proclaimed to the end he was innocent of lying to investors. I’m not a lawyer, and certainly the company did things I think were horrible, unethical, financially outrageous and predictably self-destructive, but that last is the point.  They were predictably self-destructive, predictable because, whatever else, Enron didn’t, despite reports to the contrary, hide the evidence of what they were doing. As Gladwell explains in his closing shot, for the exceedingly rare few willing to wade through hundreds or thousands of pages of incomprehensible Wall Street speak, all the signs, if not the out-and-out evidence, that Enron was a house of cards, were there for anyone to see.

Jonathan Weil of the Wall Street Journal wrote the September, 2000 article that got the proverbial rock rolling down the mountain, but long before that, a group of Cornell MBA students sliced and diced Enron as a school project and found it was a disaster waiting to happen.  Not the titans of Wall Street, six B-school students with a full course load. (If you’re really interested, you can still find the paper online 15 years later.)    My point is this – the data were all there. In a world awash in “Big Data”, collection of information will have ever-declining value.  Cutting through the noise, filtering it all down to which bits of it matter to your topic of choice; from earthquake sensors to diabetes data to intelligence on terrorist cells, that will be where the value, the need and the benefits to the world will lie. 

Screw “Big Data”, I want to be in the “Big Filter” business.

The Goal, Finding Ultra, and The Agile Manifesto, or “What does running 52 miles have to do with writing good code?”

So, in the mental Mulligan Stew that is my brain, I find odd patterns and connections emerging, or re-emerging, often out of whatever happens to be on my Reading List at the time.  This morning was a perfect example of this happening, and (if you can tough it out the three minutes to the end of this post) I think there’s something useful in it, at least if you’re part of the nerd herd (yes, Jeanne, this one’s for you. 🙂 )

I was meeting with a colleague this morning and we were discussing one of the challenges organizations can face moving product/development teams to SCRUM, a flavor of Agile Development.  The topic we were discussing was both the personal bias among some developers for, and the business or upper-management pressures to fall back on, short-term, informal or “hackish” solutions to problems when something just needs to get done and get in production.

A casual reader might even think that this might make sense.  Isn’t Agile after all, supposed to be, well, agile?  Get something out, test it, get feedback, fix it later as needs be?  Kind of all “Lean Startup“-y?

I’m still relatively new to Scrum myself.  I am a CSPO, but this is still my first year leading a Scrum product development initiative, yet I can say already that I believe that this casual read would be wrong.  One of the central tenets of Scrum and Agile is that test-driven development, or, if you prefer to think of it in terms of the Lean Manufacturing process (from which the Agile disciplines were derived), “designing in quality from the get-go”.  In other words, yes, the principles (see the Principles Document accompanying The Agile Manifesto) strive to be responsive, get stuff out the door, and iterate quickly.  However, whatever does go out the door is meant to be fully-tested, production ready and of high quality, even if it is very small in scope.

You can test out a concept car with no power windows, radio or A/C and painted in primer, and people can still love the styling, fuel economy and future vision that’s rough and unfinished.  But if you put out a jalopy that can’t be trusted not to fall apart or crash, you’ll never get another fair shot at those early reviewers.  Rough is ok.  Even incomplete is ok.  Dangerous or unreliable, that’s not ok.

So, what’s wrong with a short-term hack that you know won’t hold up for the long term or under heavy load or whatever the future is, if that hack buys you some time now or gets management off your back?  The problem, in my opinion, with kicking the can down the road is that is so often makes the eventual solution more expensive; sometimes – given the law of unintended consequences – vastly more so.  The actual comment my friend made this morning was along these lines, In this scenario, which happens all the time in the real world, “the team that takes the shortcut ends up saving half the time now, but spending ten times the effort when they’re all done.”

So, they cut today’s cost by 50%, and raise the total cost by 500%.  In some cases, and this is reality unfortunately, the fast fix is a source of praise or recognition, while the long term impact is often buried in later, routine work.  The result  is that an organization can actually encourage the bad behavior that has an eventual 10x cost.  I don’t have a calculator handy, but I’m pretty sure a bad deal.  What really tickled my brain somewhere is what my colleague said next, which was roughly this; “Somehow I think some development teams lose sight of the actual goal.  In their effort to go faster, they end up actually slowing themselves down.”

It was this particular phrasing that caused the asteroid collision of two books in my head.  I just finished “Finding Ultra” by Rich Roll, overweight-middle-aged-lawyer-turned-extreme-endurance-athlete,  [you should click that one – you gotta see the pictures].  Early in the book, Rich describes the first prescription he received from his coach, when he decided (with no real experience whatsoever) that he was going to become an Ultraman.  One of the first rules his coach imposed was that he had to learn and understand where his aerobic/anearobic threshold was, and change his habits to manage his metabolism around this breakpoint.  He was not initially moving at a steady and sustainable pace, a pace that (once he switched to it) initially felt painfully slow.  This change, he was instructed, was necessary because without that change, he would burn out too fast and slow his later progress, or cause physical problems that would interrupt or end a long event.

In other words, until he changed how he approached each element or sub-part of the race, the faster he ran, the slower he finished.

Back in school, I read The Goal by Eli Goldratt.  In this fictional tale, a factory manager (and his Socratic mentor) work to understand and fix the problems in a production plant plagued by delays, high costs and poor outputs.  Everything from his marital life to a scene involving a marching cub scout troop eventually reveal the underlying principles that help solve the problem. (If you’re interested in production operations or business at all, this book remains a quick and relevant read.)  While there are a number of more detailed lessons on Operations Management to be found there, I remember discussing the “big takeaway” with Ricardo Ernst, my ops professor at Georgetown and one of the funniest, smartest and most valuable teachers it has been my honor to study with.  The bullet-point version was this.

  • If you have a guy putting 10 wheels an hour on cars, and you provide the right incentives to make it 11, he will.
  • If you have another guy putting on 14 hoods an hour and you provide the right incentives to make it 16, he will.
  • Do this all down the line, and what you have is a crew of “top performers”, every one of them beating their quotas and earning bonuses… and a factory that’s going to be shut down because everything is going wrong.

Huh?

The system can’t run any faster than it’s slowest step, plus if you incent only speed, quality will suffer besides.  So what happens? Raw unit throughput is constrained by the slowest part of the process (say, the wheel guy), rework costs balloon (because quality inevitably falls), inventory expense explodes (because of all the half finished cars piling up before the wheel station), and finished-product output craters. All the while, your individual performers are each beating their quotas and earning bonuses, while the business loses its shirt.

Oops.

What’s the point?  Well, here’s the (possibly?) useful thought I’m hoping came out of the mental Mulligan Stew.  Whether the Goal-with-a-capital-G (hey, there’s a reason he titled the book that way) is cars produced, the finishing time in a 320 mile race, or, where this all started, which is writing good software, when you focus on  local rather than global optima, what you get is counter-productivity.  Maybe that tortoise was on to something…

Nate Silver, Fox News, and the Gutenberg Effect, or “How a World Awash in Data Explains GOP Befuddlement on Nov 7th”

Plenty of ink has already been spilled over, at and about Nate Silver and the 538 Blog this election cycle, and even after the election is over, there are still some folks who both deny his math and/or claim that the problem was Hurricane Sandy, Chris Christie or that the Obama campaign “stole the election” or “suppressed the vote“.

What in the world does any of this have to do with the (somewhat intermittent) “Digital Water” meme I’m supposed to be so focused on and my obsession with how people will, and do, react to a world ever-more awash in data?

What was interesting to me as an analysis guy, and appalling to me as a data head and independent voter,  was watching the comments and criticisms of Silver’s 538 Blog before the election.  The astonishing litany of rationales assembled by Fox et al for why Silver was wrong, and just how wrong he was, defied both advanced statistics of the type in which Silver is an expert and the common sense in which we mere mortals are more versed.  While he admits to being an Obama supporter, he’s first and foremost a statistician and forecaster dedicated to understanding the science of accurate predictions.  Yet there were volumes written on critiques of his methodology, his assumptions, his math skills, and probably far more personal attacks on blogs I don’t read.

Nevertheless, Silver has now shown in two elections in a row and 99 out of 100 states called correctly that a deep understanding of not just polls and statistics, but a respect for math and facts can not be undone by all the denials (google “Karl Rove + election night + meltdown”)  and logical contortions (see “Dick Morris + prediction + landslide”)  that kept the conservative faithful, engaged, entertained and ultimately, completely unprepared for Election Day.

In the inevitable party navel-gazing the follows an election-year blowout, two questions have been haunting the conservative rank-and-file.  The first is the obvious “how could America have voted for this guy again?”  This is basically a partisan and political discussion of little interest to me, at least in this context.

More salient to this discussion are “How did we get it THAT wrong?”  This has mostly been addressed in the press by dissecting the exit polls, and talking changing demographics, Hispanic turnout and the fallout among sensible centrists like me from Republican candidates who don’t believe in eighth grade biology or a planet much older than Hal Holbrook.  (While much ignored nationally compared to Todd Akin, this last one, an unopposed Congressman who believes Earth is 9,000 years old, evolution is a lie created by Satan himself and – most insultingly – who also sits on the House Science Committee, is exactly the kind of story that sends sane moderates like me running into the arms of an otherwise completely beatable incumbent.  God Bless Bobby Jindal and his “we have just GOT to stop saying stupid shit” speech.)

Is that what really happened?  I think there’s more going on here, and my answer is two parts.  The first comes from Silver, not in his blog, but in his book, The Signal and The Noise.  I was listening to it on audio CD in my car this week and had to back it up and listen to it three times.  Silver was speaking about the changes that came after Gutenberg’s invention of the printing press, but the same is even more relevant to the “Digital Water” phenomenon, where the world is awash not only in objective and numerical data but the self-published content of every opinion, theory and form of intellectual quackery imaginable.   He explained what I am calling here the “Gutenberg Effect” as follows:

“Paradoxically, the result of having so much more shared knowledge was increasing isolation…  The instinctual shortcut that we take when we have too much information is to engage with it selectively, picking out the parts we like and ignoring the remainder, making allies with those who have made the same choices, and enemies of the rest.”

Put into the context of the 2012 Election Cycle, I think what went wrong was the intellectual and media isolation that many partisans, but particularly those on the right, increasingly engaged in.  The so-called echo chamber, in which attitudes and platitudes of an openly partisan nature ricochet and amplify through the canyons of Fox News, RedState.com and Rush Limbaugh’s radio show (or, if you prefer, MSNBC, the Daily Kos and the Rachel Madow Show) increasingly discount or vilify any opinion or person with an alternate view.

Many or even most of the criticisms however, are ideological, personal, unsubstantiated and/or filled with logical fallacies and downright absurdity, but not facts, and not math.  And here is where the world awash in data rears its head in Election 2012.  The Gutenberg Effect that Silver describes appears to have actually caused the Republican Party to drink so much of of its own pre-filtered Kool-Aid that a “shellshocked” Mitt Romney seems to have been telling the truth when he told reporters early on November 6th that his staff hadn’t even written a concession speech.

Despite the fact that (as Silver’s blog highlights) an objective read of the numbers showed Romney would have to essentially run the table on the swing states and catch every break to win, the Romney campaign – and millions of hardworking and genuinely dedicated supporters – quite literally couldn’t believe it when he, conclusively and resoundingly, lost.

If the first thing that happened was this Gutenberg Effect, an ideologically aligned group of people taking stock of data selectively to support their pre-established beliefs, I believe the second was a staggering act of exploitation by the very purveyors of that selectively-chosen information.  Check out the video below starting at 5:01, an exchange between David Frum and Joe Scarborough, two guys I don’t always agree with but who I think generally put “smart”, “factual”, and “conservative” rightly back together in one sentence.

To quote Frum, “…the real locus of the problem is the Republican activist base, and the Republican donor base. They went apocalyptic over the past four years, and that was exploited by a lot of people in the conservative world.  I won’t soon forget the lupine smile that played over the head of one major conservative institution when he told me that ‘our donors think the apocalypse has arrived‘. Republicans have been fleeced, exploited and lied to by a conservative entertainment complex.”

Taken together, I believe these can show both the root cause of the completely dumbfounded Republican reaction on November 7th, and also, I believe, a guide to a much truer understanding of on-the-ground election realities for any national campaign going forward.  A clear-eyed view of the state of the race should start with three things:

1.  Understand the Gutenberg Effect and realize the election-strategy dangers in an intentionally (and ideologically tilted) selective filter when viewing an over-abundance of opinions, polls and data;

2.  Acknowledge that the media makes far more money if they denigrate the opposition and radicalize and rile up the faithful than if they help their chosen team actually win elections; and

3.  Take these facts together and strive for the most objective, fact-based view possible of polls, voters, the economy and the country over the coming election cycle, and make sure you listen to, and account (literally) for the views, numbers and opinions presented by the people who most disagree with you.

While I think the right currently has a larger problem than the left in this area, at least for now (i.e. they are often a party whose candidates lose swing votes like mine when they not only ignore but vilify math, science, and objective, rigorous analysis), the lesson for all sides is, I believe, to separate your opinions from the data.  Stop attacking people like Nate Silver, and perhaps start reading his book instead.

Rebranding Day!

I haven’t been writing much, but the notion of “a world awash in data” is appearing in ever-more contexts.  Thus, I decided the Digital Water concept is actually a pretty good framework for most of the things that interest me, so I’m just taking what started as one post (then a bunch) and moving it to a kind of running thread through the conversation.  (Also, “ericolsonblog.wordpress.com” was annoying even to me.)

SO… welcome to DigitalWaterBlog.com and thanks for sticking around.  I’m relaunching the site as of today.  There’ll be much more frequent updates, a Twitter Feed (@DigitalH20) and hopefully a lot more useful stuff.  I’ve also been doing a bit more work with or about Law Enforcement lately, so there will be some interesting topics coming up regarding mining open source data, investigations vs. privacy, tech for LE and so on.

And of course, the Chinese, the endless train of mass data breaches and, of course, extremely stupid people on Facebook will all keep the stream of topics far outpacing the time to talk about them all.

 

Thanks everybody!

 

Why Google is great, but not a complete solution, for Intel and Law Enforcement. (Part I)

I read Nick Selby’s piece on Police Led Intelligence this morning talking about more effective use of online search engines for police officers. Nick’s right – many in the Law Enforcement and Intelligence communities can do even more than they are by learning more about how search engines work, but there’s a second part to that story.

This has been on my mind a lot, especially since my company was acquired by a defense firm and I’ve been spending a lot more time with intelligence, law enforcement and other folks working in public health or safety.

Let me preface this by saying I’m not knocking the traditional search portals, they are extremely useful and powerful.

However, they do suffer from built-in biases, blind spots and restrictions that many analysts and law enforcement officers aren’t even aware are affecting them when they use a search engine.

By all means, use these fantastic tools to the fullest possible extent (see Nick’s comments and links to Google-hacking Jedi Master Johnny Long’s book and presentation).

Just understand that what they offer is not search of “the Internet” or “The Web”, they offer search of that portion of the online content world that is in their index, and then give you an even smaller slice of that.

With Johnny’s expansive help, and perhaps one or two easy tricks from yours truly, you can get to a decent portion of whatever they have. Just understand that what they have, what you can get to of what they have, and what they don’t have at all, is an important part of using them effectively.

The first three problems with traditional search engines

There is a wide range of built in biases, problems and blind spots with the traditional search engines of the world, and understanding them will make you a better user and consumer of the great things they CAN do for you.

Problem 1 – They only find things that don’t mind being found

Whether it’s a full Web site, a specific user’s blog on WordPress or a subset of pages within a much larger set of content, it is usually a 10-second exercise to hide content from search engines.

Search engines harvest pages by “crawling”, that is downloading a page from an address (i.e. a full path URL, e.g. http://host.domain.com/page.html), finding links on the page and “crawling” or following the link by requesting the linked page, finding more links on that page and sequentially requesting and indexing those pages ad infinitum.

The problem? On arrival at a site, the first page a crawler or spider will often request is called “robots.txt”, which essentially says “If you are an automated requester, i.e. not a person, please follow these links or help yourself to these pages, or all pages, or the entire domain” or whatever they want. To permanently say “Google, Yahoo et al, you are not welcome to see anything on my entire Web site” requires the extremely sophisticated programming below:

User-agent: * Disallow: /

In people speak, this says, “If you’re a robot (crawler, spider, automated agent etc.) not a human’s browser program, screw off.”

That’s it. Terrorists, child pornographers, anti-government radical or hate groups who want to hide from easy detection need less than 25 characters to ensure that an FBI or ATF agent can Google ’til the second coming, and their Web site will never show up in a traditional search engine.

On pre-built blogging platforms like WordPress, where the user is assumed to have NO technical or programming knowledge at all, it’s even easier. You check a box when you sign up that says “Do you want your posts to show up in search engines?” Check No, and you’re hidden from GoogleBot and its Bing-y, Yahoo!-y etc. cousins.

Yay for stupid criminals who don’t know this, and by all means let’s use Johnny-Fu to find what IS in those search engines’ indexes. Just understand that what Google or Bing has and what’s actually out there are not synonymous.

Problem 2 – We have nine BILLION results for your query…

“…but we’ll only show you 658 of them.”

Here’s another little-known and ill-understood factoid/problem (By the way, to make this problem easy to see, I recommend you go to preferences or settings on Google and set your account to “show 100 answers per page).

Type a query into Google that might have lots of results, e.g. Osama Bin Laden. Google will say something insane like “We have 44 million results for your query.”

Great. Insane as it sounds, suppose I had a room full of analysts and I actually wanted them to spend the next five years eyeballing every one of those results. Go to the bottom of the page where you see all the “O’s” in Goooooogle representing the next ten pages of results (at 100 per page). Click next three or four times….

Did you see ten pages of “next” shrink to 9 or 8 or 7? Click it some more. Without fail, around page 7 or 8, they cut you off. So Google has 44 million pages about Bin Laden. You can actually have about 658 of them.

Try it.

Problem 3 – Here are those 658 results….

Here’s the third of today’s often-unknown problems with search engines. By the way, there are six or seven more, but I’m already getting long-winded here.

The brilliant original insight that made Google a zillion-dollar success was a notion called PageRank (actually named for Larry Page, one of its four authors, and not for “Web Page” as many people think).

PageRank essentially codifies the notion of Vox Populi Vox Dei. See, Anatomy of a Large-Scale Hypertextual Web Search Engine if you feel like seeing where 100 Billion Dollars started.

Let me save you reading a Stanford PhDs worth of math in seven words:

“That which is most popular is best.”

Google, not you, decides which 658 of the 44,000,000 results you get, and it does so by ranking them using today’s version of PageRank.

PageRank, for all of its enormous evolution and complexity since the mid-90s, still says basically, the most popular answer is the best answer.

If you are looking for who actually has those Paparrazi-snapped photos of Britney Spears in a bikini, you can be pretty sure that the page with the best, highest resolution versions of those pics will have the most inbound links, comments etc.

If what you’re looking for is likely to be unloved, unpopular, private or hidden in a dark digital corner of the Web, you’d best hope it has some extremely unusual or deterministic keyword in it if you want it to appear in the top slice of results for your query.

So to recap:

  1. There’s a lot of stuff that isn’t in Google or Yahoo and they’re easy to hide from.
  2. The stuff that is in there is only accessible up to about the 700th result.
  3. The 700 that are both actually in Google and in the top 700 you get to see are chosen by popularity. This is a terrible prioritization scheme if what you are looking for or care about is not likely to be, or meant to be, found or popular.

Well, that all kind of sucks. So what can I do about it?

The first of these is the hardest to address and (sorry) really needs to be the subject of another post.

As for the second two, which are related, there are some good things you can do to get the best of what is in the index to be found. Here, in no particular order are my favorite three things:

1. Read the works from Obi Wan Kenobi (Johnny Long) on Google Hacking. I am but a Padawan apprentice and a poor one at that. If you need the distilled version, start with his Black Hat presentation, especially page 5.

2. Define the weirdest query you can. What I mean is, if Google has a zillion results and you can only have 600 of them, or if they have 600 and you want to read 20, not all of them, do not ask the most generic version of your question.

Let’s go back to the Bin Laden example (yes I know he’s dead, but for years it was all any govvie I talked to wanted to use as their example.)

If you had, a few months ago, typed in “Bin Laden whereabouts” or “Bin Laden location” you’d get millions of results. Here’s the crazy thing: While I don’t think it’s likely, is it possible, based on Problems #2 and #3 that there actually WAS a page saying “Hey, I heard a rumor OBL is living in Abottabad. Anyone snooped around that big new house they built there yet?” Yep.

If that page/author/blog was, in the godlike wisdom of PageRank, considered an unpopular crank and therefore placed in just, say, position 902 out of five million results, you’d NEVER see it, even though it was sitting in there for years. Why? Results cut off at 856.

However, if the assumption was, “Well, the guy’s getting messages out somehow, and living someplace, probably pretty secure” then you have a much better way to query. How about this:

Bin Laden +

Compound +fence +security +rumor
messages +Pakistan +(Courier OR messenger)
Videotape +al Jazeera +(courier OR messenger)

Would these have led right to OBL? Maybe not, but at least you’re slicing the available index of material much more granularly and intelligently.

Intelligence is about pulling tenous threads and connecting sometimes-nearly-invisible dots. Results to a query like this might surface a datum, or even just trigger a thought in a talented analyst’s mind, that could lead somewhere useful.

3. Randomly “split” your queries: Here’s a neat trick. Even if you don’t have additional terms like the cases above, you can do something I call splitting your queries. Stick in random words like “Bin Laden + Thursday” or “Bin Laden + Baseball”.

Bin Laden + Baseball? Seriously?

Twenty eight million results.

Seriously.

Will this get you the 28 million results? Nope. Will it get you a different 658 than “Bin Laden + Thursday”? Yep.  And that’s how you can carve out a different slice of what Google has but normally wouldn’t show you.

Try months, days of the week and colors.

If you find it useful to keep going, stick in sports and the last twenty US Presidents.

You’d be surprised how different the results are.

Throw in some of the available Google restriction parameters (date, blogs vs. normal Web pages, file type) and you can suck ever bigger slices of what they have on ever more granular axes.

That’s probably more than anyone would want on the subject, but like I said, these are just three of about nine important limitations and attributes of traditional search engines that you should be aware of when using them for Intelligence or Law Enforcement.

I’ll try to post more later on the other six.

Disclaimer: The views expressed on this blog are mine alone, and do not represent the views, policies or positions of Cyveillance, Inc. or its parent, QinetiQ-North America.  I speak here only for myself and no postings made on this blog should be interpreted as communications by, for or on behalf of, Cyveillance (though I may occasionally plug the extremely cool work we do and the fascinating, if occasionally frightening, research we openly publish.)

Yet another water metaphor, or “Why physically securing our borders is a dumb way to spend money…”

I read this post earlier today, and was struck by the fact that (IMO and with all due respect to the author) it was the commentors, i.e. the community of people actually doing the work and using the technology, not the ones selling or buying it, who hit the right point.  So many of them have so many valid, and in many cases obvious, complaints, it got me wondering about the border protection issue, one which I hadn’t really thought about before.

So, as I so often do, I find my mind melds in the original bean counter nerd (my formal education) and the computer nerd (most of my actual career) and here’s what comes out from under the green-eyeshade-cum-propeller beanie – As a purely economic and technological matter, I can’t understand the math of trying to physically secure our borders.  It asks for all kinds of really expensive solutions and technologies that are unlikely to work, when the money could be better spent securing our SOCIETY in ways that would dis-incent the illegal migration in the first place.

Here’s my unscientific argument:

People are like water too (see Digital Water series (see here, here and here).  If you put something in their way, they flow around it to get where they want to go.  Given that fact, plus a bit of rudimentary economics and technology, “securing our borders” is in my uninformed opinion, mathematically provable to be a dumb way to spend scarce resources, both human and financial.

What’s the problem?  Well, there are many, but let’s start with the water metaphor.  Imagine people as a river, pouring across one or several points along a thousand-mile stretch of border.  With water, the pull is of gravity, with people, that of economic prosperity, freedom from persecution, reunification with loved ones or other human need.  Either way, the flow is basically unidirectional and essentially irresistible.  So what happens if you dam up one little piece of the border?  The water flows around it.  Duh, right?  Well, most of the proposals involve some combination of physical barriers (hopeless – I’m not making this up, it cost an average of nearly $4 MILLION bucks a mile!) and digital barriers.

If physical barriers are impractical (but we sure spent a bunch on it anyway) we get into cameras, drones, IR imaging, blah blah blah.  OK, suppose for a moment, that the physical gear were actually available in sufficient volume to secure a thousand mile stretch of border (remember, that with the crenelations of the great lakes, for example, a thousand miles of shoreline may have a lateral distance of only a few hundred miles in a two- or three state region).  So now you’ve got how many thousands of cameras, sensors, drones, and collection nodes running.  How much did that gear cost?

Here’s the resultant problem – How many people do you need to watch, sift, prioritize and act on all that data?   We’ve run into the same problem in the war in Afghanistan.  We’ve spent so much money on drones and other eyes in the sky, and they are so great at producing data, that we have far more full motion video from every corner of the theatre than we have people to watch the screens.  So without some smart downstream systems or algorithms (full disclosure, I make prioritization algorithms for a living, so I have a bias here), what good is 300 TV screens worth of video running at the same time if you only have 6 guys to watch them.  So what systems will you need to address that problem?  Those cost more money.

Now calculate how many people sneaking over the border from Canada have you just stopped.  What is the economic damage to the US you have just saved?  So on a per-person-stopped basis, what was the investment?  What is the ROI? Don’t forget to fully load the headcount (as we say in the bean counter biz) with the cost of catching, incarcerating, processing and deporting each of them.

Now add in how long it will take (in a world awash in data) for the coyotes (or Canadian equivalents) to work out the areas that are and aren’t effectively covered, camera’d and patrolled.  So, in a shrinking-budget political and fiscal environment, we have to ask, what is the ROI on all this technology and expense?

So, what’s a possible alternative? Spend the money putting up “anti-gravity” barriers.  Make the pull less pull-y.  Securing physical space over long distances is not in the “sweet spot” of what technology is good at.  Gathering, storing, sharing and disseminating information? Computers ARE good at that.  For the amounts of money involved in “securing the border”, how much technology could we create that makes the appeal and viability of illegal immigration much lower?

Could we make it way harder to get a job or paid work of ANY kind?  Sure.  Could we map historical data and interview deportees to understand how they stayed as long as they did? Who hired them? What work they did?  Then use the data to identify the most likely places illegals are being employed now and put pressure on, and/or incent those industries to hire only documented workers? Yep.  Data is great for all that kind of stuff.

Here’s another thought – how did those people not just get paid, but live at all?  Proof of legal presence is (where I live) already required to get hired, register a car, get a license, or lease an apartment.  Close the loopholes that allow illegal immigrants to live and work (while simultaneously creating sensible, pro-economic growth policies to bring in needed guest labor), and there will be less incentive to come illegally.  If you’ll be just as broke, homeless and hungry in Texas as you will in Mexico, and smart data modeling ensures you will be found and deported a lot faster, the incentives to come in the first place start to dry up, don’t they?

Why can’t we do the things technology is GOOD at to address illegal immigration cost effectively? (Oh, bureaucratic inertia, local politics, resistance to change, agency turf wars and vested interests and lobbying on the part of the vendors getting paid to do it the dumb way. But aside from all that?)

ONE FINAL NOTE – The “security” argument.  Here’s one I love.  “We HAVE to physically secure the borders to keep out terrorists.” The most transparently dumb defense ever of profligate wasteful spending.  A border fence, patrols, sensors, etc. These are all meant to address illegal immigration as a macro economic issue and a legitimate crime/LE concern.  It’s about JOBS and routine criminal concerns (both valid, but there are better ways to spend the money, as I argue above).  You can make a dent in the river of people flowing over a border with nothing but the shirt on their back.  It is supremely unlikely a fence will keep out the ONE guy whose coming to blow up the Sears Tower. Why?

A, you’ll never stop more than a portion of the river, how do you know you’ll catch the half of the water containing the next Mohammed Atta.

B, A determined, well funded illegal or someone with the backing of a terrorist organization has options for getting in that don’t involve swimming the Rio Grande or Lake Ontario.

And C) the people we really need to be afraid of don’t come in that way anyway.  Let’s see here:

  • All of the 9/11 hijackers? Entered the US through legal channels.
  • Khalid Sheik Mohammed?  Entered the US on a legal visa.
  • Ramzi Yousef, the 1993 WTC bomber? Arrived through JFK airport in NY.
  • Times Square bomber? Naturalized US citizen.
  • Y2K Bomber? Stopped by an astute agent at a standard border crossing station.

You get the idea.  Anyone who says securing our borders (not better customs control, not immigration control but our physical territorial lines with the rest of the world) is necessary because it will stop terrorism is either too dumb to know that’s a silly argument, knows it but thinks voters are too dumb to know it’s a silly argument, or has a stake in the contractors building the fence.  Any way you slice it, the data say otherwise and it makes me seriously question the person making the argument.

 

Disclaimer: The views expressed on this blog are mine alone, and do not represent the views, policies or positions of Cyveillance, Inc. or its parent, QinetiQ-North America.  I speak here only for myself and no postings made on this blog should be interpreted as communications by, for or on behalf of, Cyveillance (though I may occasionally plug the extremely cool work we do and the fascinating, if occasionally frightening, research we openly publish.)

Hello world!

Good morning, Dr. Chandra, this is HAL.  I’m ready for my first lesson…

%d bloggers like this: