Why Google is great, but not a complete solution, for Intel and Law Enforcement. (Part I)

I read Nick Selby’s piece on Police Led Intelligence this morning talking about more effective use of online search engines for police officers. Nick’s right – many in the Law Enforcement and Intelligence communities can do even more than they are by learning more about how search engines work, but there’s a second part to that story.

This has been on my mind a lot, especially since my company was acquired by a defense firm and I’ve been spending a lot more time with intelligence, law enforcement and other folks working in public health or safety.

Let me preface this by saying I’m not knocking the traditional search portals, they are extremely useful and powerful.

However, they do suffer from built-in biases, blind spots and restrictions that many analysts and law enforcement officers aren’t even aware are affecting them when they use a search engine.

By all means, use these fantastic tools to the fullest possible extent (see Nick’s comments and links to Google-hacking Jedi Master Johnny Long’s book and presentation).

Just understand that what they offer is not search of “the Internet” or “The Web”, they offer search of that portion of the online content world that is in their index, and then give you an even smaller slice of that.

With Johnny’s expansive help, and perhaps one or two easy tricks from yours truly, you can get to a decent portion of whatever they have. Just understand that what they have, what you can get to of what they have, and what they don’t have at all, is an important part of using them effectively.

The first three problems with traditional search engines

There is a wide range of built in biases, problems and blind spots with the traditional search engines of the world, and understanding them will make you a better user and consumer of the great things they CAN do for you.

Problem 1 – They only find things that don’t mind being found

Whether it’s a full Web site, a specific user’s blog on WordPress or a subset of pages within a much larger set of content, it is usually a 10-second exercise to hide content from search engines.

Search engines harvest pages by “crawling”, that is downloading a page from an address (i.e. a full path URL, e.g. http://host.domain.com/page.html), finding links on the page and “crawling” or following the link by requesting the linked page, finding more links on that page and sequentially requesting and indexing those pages ad infinitum.

The problem? On arrival at a site, the first page a crawler or spider will often request is called “robots.txt”, which essentially says “If you are an automated requester, i.e. not a person, please follow these links or help yourself to these pages, or all pages, or the entire domain” or whatever they want. To permanently say “Google, Yahoo et al, you are not welcome to see anything on my entire Web site” requires the extremely sophisticated programming below:

User-agent: * Disallow: /

In people speak, this says, “If you’re a robot (crawler, spider, automated agent etc.) not a human’s browser program, screw off.”

That’s it. Terrorists, child pornographers, anti-government radical or hate groups who want to hide from easy detection need less than 25 characters to ensure that an FBI or ATF agent can Google ’til the second coming, and their Web site will never show up in a traditional search engine.

On pre-built blogging platforms like WordPress, where the user is assumed to have NO technical or programming knowledge at all, it’s even easier. You check a box when you sign up that says “Do you want your posts to show up in search engines?” Check No, and you’re hidden from GoogleBot and its Bing-y, Yahoo!-y etc. cousins.

Yay for stupid criminals who don’t know this, and by all means let’s use Johnny-Fu to find what IS in those search engines’ indexes. Just understand that what Google or Bing has and what’s actually out there are not synonymous.

Problem 2 – We have nine BILLION results for your query…

“…but we’ll only show you 658 of them.”

Here’s another little-known and ill-understood factoid/problem (By the way, to make this problem easy to see, I recommend you go to preferences or settings on Google and set your account to “show 100 answers per page).

Type a query into Google that might have lots of results, e.g. Osama Bin Laden. Google will say something insane like “We have 44 million results for your query.”

Great. Insane as it sounds, suppose I had a room full of analysts and I actually wanted them to spend the next five years eyeballing every one of those results. Go to the bottom of the page where you see all the “O’s” in Goooooogle representing the next ten pages of results (at 100 per page). Click next three or four times….

Did you see ten pages of “next” shrink to 9 or 8 or 7? Click it some more. Without fail, around page 7 or 8, they cut you off. So Google has 44 million pages about Bin Laden. You can actually have about 658 of them.

Try it.

Problem 3 – Here are those 658 results….

Here’s the third of today’s often-unknown problems with search engines. By the way, there are six or seven more, but I’m already getting long-winded here.

The brilliant original insight that made Google a zillion-dollar success was a notion called PageRank (actually named for Larry Page, one of its four authors, and not for “Web Page” as many people think).

PageRank essentially codifies the notion of Vox Populi Vox Dei. See, Anatomy of a Large-Scale Hypertextual Web Search Engine if you feel like seeing where 100 Billion Dollars started.

Let me save you reading a Stanford PhDs worth of math in seven words:

“That which is most popular is best.”

Google, not you, decides which 658 of the 44,000,000 results you get, and it does so by ranking them using today’s version of PageRank.

PageRank, for all of its enormous evolution and complexity since the mid-90s, still says basically, the most popular answer is the best answer.

If you are looking for who actually has those Paparrazi-snapped photos of Britney Spears in a bikini, you can be pretty sure that the page with the best, highest resolution versions of those pics will have the most inbound links, comments etc.

If what you’re looking for is likely to be unloved, unpopular, private or hidden in a dark digital corner of the Web, you’d best hope it has some extremely unusual or deterministic keyword in it if you want it to appear in the top slice of results for your query.

So to recap:

  1. There’s a lot of stuff that isn’t in Google or Yahoo and they’re easy to hide from.
  2. The stuff that is in there is only accessible up to about the 700th result.
  3. The 700 that are both actually in Google and in the top 700 you get to see are chosen by popularity. This is a terrible prioritization scheme if what you are looking for or care about is not likely to be, or meant to be, found or popular.

Well, that all kind of sucks. So what can I do about it?

The first of these is the hardest to address and (sorry) really needs to be the subject of another post.

As for the second two, which are related, there are some good things you can do to get the best of what is in the index to be found. Here, in no particular order are my favorite three things:

1. Read the works from Obi Wan Kenobi (Johnny Long) on Google Hacking. I am but a Padawan apprentice and a poor one at that. If you need the distilled version, start with his Black Hat presentation, especially page 5.

2. Define the weirdest query you can. What I mean is, if Google has a zillion results and you can only have 600 of them, or if they have 600 and you want to read 20, not all of them, do not ask the most generic version of your question.

Let’s go back to the Bin Laden example (yes I know he’s dead, but for years it was all any govvie I talked to wanted to use as their example.)

If you had, a few months ago, typed in “Bin Laden whereabouts” or “Bin Laden location” you’d get millions of results. Here’s the crazy thing: While I don’t think it’s likely, is it possible, based on Problems #2 and #3 that there actually WAS a page saying “Hey, I heard a rumor OBL is living in Abottabad. Anyone snooped around that big new house they built there yet?” Yep.

If that page/author/blog was, in the godlike wisdom of PageRank, considered an unpopular crank and therefore placed in just, say, position 902 out of five million results, you’d NEVER see it, even though it was sitting in there for years. Why? Results cut off at 856.

However, if the assumption was, “Well, the guy’s getting messages out somehow, and living someplace, probably pretty secure” then you have a much better way to query. How about this:

Bin Laden +

Compound +fence +security +rumor
messages +Pakistan +(Courier OR messenger)
Videotape +al Jazeera +(courier OR messenger)

Would these have led right to OBL? Maybe not, but at least you’re slicing the available index of material much more granularly and intelligently.

Intelligence is about pulling tenous threads and connecting sometimes-nearly-invisible dots. Results to a query like this might surface a datum, or even just trigger a thought in a talented analyst’s mind, that could lead somewhere useful.

3. Randomly “split” your queries: Here’s a neat trick. Even if you don’t have additional terms like the cases above, you can do something I call splitting your queries. Stick in random words like “Bin Laden + Thursday” or “Bin Laden + Baseball”.

Bin Laden + Baseball? Seriously?

Twenty eight million results.

Seriously.

Will this get you the 28 million results? Nope. Will it get you a different 658 than “Bin Laden + Thursday”? Yep.  And that’s how you can carve out a different slice of what Google has but normally wouldn’t show you.

Try months, days of the week and colors.

If you find it useful to keep going, stick in sports and the last twenty US Presidents.

You’d be surprised how different the results are.

Throw in some of the available Google restriction parameters (date, blogs vs. normal Web pages, file type) and you can suck ever bigger slices of what they have on ever more granular axes.

That’s probably more than anyone would want on the subject, but like I said, these are just three of about nine important limitations and attributes of traditional search engines that you should be aware of when using them for Intelligence or Law Enforcement.

I’ll try to post more later on the other six.

Disclaimer: The views expressed on this blog are mine alone, and do not represent the views, policies or positions of Cyveillance, Inc. or its parent, QinetiQ-North America.  I speak here only for myself and no postings made on this blog should be interpreted as communications by, for or on behalf of, Cyveillance (though I may occasionally plug the extremely cool work we do and the fascinating, if occasionally frightening, research we openly publish.)

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: