Skynet, Smugglers and The Gift of Fear: What we can learn from snap judgements, and machines can learn from us

So, in the day or two since I posted the piece about “Big Filter“, I’ve gotten several calls, comments and emails that all seemed to focus on the scary notion of “machines that think like us”.  Some folks went all “isn’t that what Skynet and The Matrix, and (if you’re older, like me) The Forbin Project, and W.O.P.R were on about?”  If machines start to think like us, doesn’t that mean all kinds of bad things for humanity? 

Actually, what I said was, “We have to focus on technologies that can encapsulate how people, people who know what they’re doing on a given topic, can inform those systems… We need to teach the machines to think like us, at least about the specific problem at hand.”  Unlike some people, I have neither unrealistic expectations for the grand possibilities of “smart machines”, nor do I fear that they will somehow take over the world and render us all dead or irrelevant.  (Anyone who has ever tried to keep a Windows machine from crashing or bogging down or “acting weird” after about age 2 should share my comfort in knowing that machines can’t even keep themselves stable, relevant or serviceable for very long.) 

No, what I was talking about, to use a terribly out-of-date phrase, was what used to be known as “Expert Systems”, a term out of favor now, but that doesn’t mean the basic idea is wrong. I was talking about systems that are “taught” how someone who knows a very specific topic or field of knowledge thinks about a very specific problem.  If, and this is a big if, you can ring-fence the explicit question you’re trying to answer, then it is, I believe, possible, to teach a machine to replicate the basic decision tree that will get you to a clear, and correct, answer most of the time.  (I’m a huge believer in the Pareto Principle or “80-20 rule” and most of the time is more than good enough to save gobs and gobs of time and money on many many things.  More on that in a moment.) 

A few years ago now, I read a book called “The Gift of Fear” by Gavin de Becker, an entertaining and easy read for anyone interested in psychology, crime fighting, or the stuff I’m talking about.  The very basic premise of that book, among other keen insights, is that our rational minds can get in the way of our limbic or caveman brains telling us things we already “know”, the kind of instantaneous, can’t-explain-it-but-I-know-I’m-right, in-our-gut knowledge that our rational brains sometimes override or interfere with, occasionally to our great harm.  (See the opening chapter of The Gift of Fear, in which a woman who’s “little voice” as I call it told her there was something wrong with that guy, but she didn’t listen, and was assaulted as a result.  Spoiler alert, she did, however, also escape that man, who intended to kill, her using the same intuition. Give it a read.) 

De Becker, himself a survivor of abuse and violence, went on to study the evil that men do in great detail, and from there, to codify a set of principles and metrics that, encoded into a piece of software, enabled his firm to evaluate risk and “take-it-seriously-or-not-ness” for threats against the battered spouses, movies stars and celebrities his Physical Security firm often protects.  Is this Skynet taking over NORAD and annihilating humanity? Of course not.  What is is, however, is the codification of often-hard-won experience and painful learning, the systematizing of smarts. 

I was thinking about all this in part because, in addition to the comments on my last post, I’m in the middle of re-reading “Blink” (sorry, I appear to be on a Malcolm Gladwell kick these days.)  It’s about snap decision making and the part of our brain that decides things in two seconds without rational input or logical thought.  A few years ago, as some of you know, my good friend Nick Selby of (among many other capes and costumes) the Police Led Intelligence Blog, decided he was so passionate about applying technology to making the world better and communities safer that he both founded a software company (streetcred software – Congrats on winning the Code for America competition this year!) and became a police officer to gain that expertise he and his partner would encode into the software.  He told me a story from his days at the Police Academy.  I may have the details wrong on this bit of apocrypha, but you’ll get the point. 

During training outside of Dallas, there was an experienced veteran who would sometimes spend time helping catch smugglers running north through Texas from the Mexican border.  “Magic Mike” I call this guy, I can’t remember his real name, could stand on an overpass and tell the rookies, “Watch this.”  He’d watch the traffic flowing by beneath him, pick out one car seemingly at random and say, “That one.” (Note that, viewed at 60 mph and looking at the roof from above, age, gender, race or other “profiling” concerns of the occupants is essentially a non-issue here.) 

Another officer would pull over the car in question a bit down the road, and, with shocking regularity, Magic Mike was exactly right.  How does that happen?!  And can we capture it?  My argument from yesterday is that we can, and should.  We’re not teaching intelligent machines in any kind of scary, Turing-Test kind of way.  No, it’s much clearer and more focused than that.  Whatever went on in Magic Mike’s head – the instantaneous Mulligan Stew of car make, model, year, speed, pattern of motion, state of license plate, condition etc. – if it can be extracted, codified and automated, then we can catch a lot more bad guys. 

I personally led a similar effort in Cyber space.  Some years ago, AOL decided that member safety was a costly luxury and stared laying off lots of people who knew an awful lot about Phishing and spoof sites.  Among those in the groups being RIF’ed was a kid named Brian, who had spent untold hours sitting in a cube looking at Web pages that appeared to be banks, or Paypal or whatever, saying, “That one’s real. That one’s fake.  That one’s real, that one’s fake.”  He could do it in seconds. So, we hired him, locked him in an office and said, “You can’t go to the bathroom til you write down how you do that.” 

He said it was no big deal – over the years he’d developed a 27-step process so he could teach it to new guys on the team.  Just one of those steps turned out to be “does it look like any of the thousands of fake sites I’ve gotten to know over the years?”  Encapsulating Brian’s 27 steps in a form a machine could understand took 400 algorithms and nearly 5,000 individual steps.  But… so what?  When weeks of effort was done, we had the world’s most experienced Phish-spotter built into a machine that thought the way he did, and worked 24×7 with no bathroom breaks.  We moved this very bright person on to other useful things, while a machine now did what AOL used to pay a team of people to do, and it did it based not on simple queries or keywords, but by mimicking the complex thought process of the best guy there was. 

If we can sit with Brian, who can spot a Phishing site, or De Becker who can spot a serious threat among the celebrity-stalker wannabes, or Magic Mike who can spot a smuggler’s car from an overpass at 70 miles an hour, when we can understand how they know what they know in those instant flashes of insight or experience, then we can teach machines to produce an outcome based not just on simple rules but by modeling the thoughts of the best in the business.  Whatever that business is – catching bad guys, spotting fraudulent Web sites, diagnosing cancer early or tracking terrorist financing through the banking system, that (to me) is not Skynet, or WOPR, or Colossus.  That’s a way to better communities, better policing, better healthcare, and a better world. 

Corny? Sure.  Naive? Probably.  Worth doing?  Definitely.  

 

 

How to Hack Like Homer Simpson…

A few weeks ago, I gave a talk to a room full of police chiefs. I was talking about the goods, bads and unknowns of Social Media use by and for Law Enforcement (#LESM or #SM4LE).

One of the slides looked like this:

Image

It shows how, unless you explicitly change the default settings, in many cases everything from Tweets to photos are tagged with a variety of metadata.  In some cases this can include geotags for the location of the device that produced the photo, tweet or update, the model number and make of the camera or phone, etc.

I suppose if you flip the “goods” and the “bads” I could have given the same speech to hackers, but of course they are way to tech savvy to need any such guidance.

Well, most of them. There’s always the exception

http://www.informationweek.com/news/security/government/232900329

I couldn’t help but smile.  A hacker implicated in the recent Texas DPS breach, in painfully cliche fashion, decided that a bit of geek chest thumping was in order.  In a bugs-bunny-esque “you’ll never catch me coppers! Mwah hah hah!” moment, he decided to post pics on Social Media of his girlfriend holding signs taunting law enforcement.

The only problem?  Hacker-genius-computer-expert guy neglected to remove the geotagging from the photos, which were taken in her back yard. Police took the arcane and Star-Treky step of reading the lat/long coordinates on the files and looking them up on a map.

What I wouldn’t have given to be a fly on the wall when he was told how they got him.

Image

Spontaneous Help For Law Enforcement – Baltimore PD gets Social Media tips it didn’t even ask for.

So one of the things cops ask me about all the time is, “Can I go out and search Twitter and Facebook when investigating a crime?”  The answer is “it depends”.  It depends on what your department’s rules are.  It depends on what your DA thinks about admissibility of evidence you proactively gather from SM.  It depends on whether you were (please say you weren’t) actually logged into your own account vs. searching publicly accessible posts.  It depends on…, and it depends on… and it depends.

However, there is a model under which you don’t need to worry about any of that.  It’s the inbound model of “SM4LE” as I call it, where the public brings information in to you via social media.  In yesterday’s post, I previously noted cases where material posted on Department SM feeds and pages brought in responses, tips and definitive IDs on various criminals.

Here’s a different variant, one I actually found kind of heartening.  Not long ago, not one but two different videos surfaced online of a man being publicly beaten, stripped and humiliated in Baltimore.  There has since been an arrest thanks to material gleaned from Social Media.  So what’s different about this case?  Unlike the Utica and Texas cases noted yesterday, this wasn’t a video the police put out on SM asking for help, it was a video that was going viral on various video sharing sites.  What the press reports indicate is that the video so incensed some of its viewers, they spontaneously worked through social media to identify the men in the video and voluntarily notify BPD of their findings without being asked.

People just saw injustice and stepped up of their own accord.  That’s kind of cool, given the original video could make one wonder about people. As the Baltimore Police Commissioner says in the article:

“It’s easy to [see a video like that] and think, ‘Damn, what’s happening to the fabric of our society.’ But to come in the next day and know that we’ve got leads on who the suspect is — just when you think we’ve left the rails, people help bring you back. That’s enormously gratifying.”

OK Coppers, Let’s see some SM4LE in action!

So I recently posted a few example cases of how information police departments were successfully catching criminals with the help of information let out to the community via Social Media (the outbound aspect of what I called SM4LE or “Social Media for Law Enforcement”).

Let’s see if SM4LE plays a role here. http://on.msnbc.com/I0FuIw

Genius Boy tails a woman from his car, overtly menaces and follows her, is dumb enough to get out and chase her on foot, all while she is holding my oft-mentioned camera phone.  I give this guy 48 hours. :), less obviously if she snapped the tags on his vehicle.  Yay technology and Stupid Criminals!  Oh what the hell…. and even stupider criminals!

 

Inventing My Own Hashtag… #SM4LE

So I recently spoke at a gathering of Police Chiefs on the topic of Social Media and Law Enforcement.  I covered a bunch of topics over the course of my bit (it was a full hour speaking slot, which allowed us to cover a lot of ground), but one came out of some of the things I learned while preparing.

  1.  Something like 90% of all US law enforcement agencies have fewer than 50 employees.
  2. Nearly all US municipalities are facing declining tax revenues and police departments are under more pressure than ever to “do more with less”.
  3. A significant portion of non-violent crimes that don’t involve drugs are never even pursued.  Consider this example?  In Spokane, the Police Department actually told the press that only 5% of property crimes will even be investigated! Not solved… looked at.  In 95% of all cases involving simple property theft, the victim filing the police report (usually done for their insurance company) is the end of the process.  I’m not knocking them – there just aren’t enough resources to go round.

 

Remember this post though?  It was on a more grandiose topic the State and local Law Enforcement, but it closed with an important notion.

“From cults to political parties to hate organizations to repressive regimes, the daylight is coming to shine on you.  If you can’t say it out loud and in public, know that your days are numbered.  I think, whether in months or years, the end is nigh, and your doom will come not from jackbooted troops, police SWAT teams or even intrepid reporters, but in the form of the individual with a conscience and the cheap, ubiquitous camera phone.”

And here’s what I’ve found.  Local Police Departments, those less-than-50-people, short-of-money agencies I talked about, are getting super creative in taking advantage of Social Media, and the power of both free technology platforms and free labor in the form of the engaged citizenry.  Here are just a few of my favorites.

The Utica NY police department (hey, I’m a Hamilton grad, what can I say?) started using its Facebook, Youtube and Twitter accounts to post information about crimes, including the “dead end” ones that no one could afford to spend paid labor to investigate.

The most amazing thing happened.  Shortly after the UPD posted store camera video of a liquor thief on its FB and YT channels, they received multiple calls ID’ing her by name, AND she turned herself in before UPD even tasked an officer to pick up the suspect.  My favorite part was her explanation when she showed up.  When asked why, she said “I wanted my face off Facebook and Youtube!” and her phone was “blowing up” with calls from angry family and friends. As I said before, this generation has a relationship to their phones and social networks that us “old people” (by which I mean, over 30) simply do not appreciate.

So a crime that a few years ago would fall into the “never even investigated pile” was now solved, in 24 hours complete with confession, at a total cost to the department of roughly NOTHING. They didn’t even spend the gas to have a cruiser pick her up.  She drove herself to the police station.  Awesome.

In another case described by the Chief of the UPD, there was a series of high-value thefts during one night between 11 and 3.  In the ensuing 24 hours, thanks to quick action on social media channels, they got two dozen tips, every single one of which named the same two guys. 

In LA, social media information was critical to catching a serial arsonist that had set more than 30 fires.  And this posted just this week – Police in Denton, Texas have caught eight of their Most-Wanted local criminals since February, thanks to their Facebook page.

What’s my point?  Two things – first, when every citizen is a free employee armed with a sophisticated, interlinked sensor (i.e. the camera phone and SMS), it’s amazing what under-staffed, under-funded and low-tech Law Enforcement Agencies  can do with a mouse, social media and an engaged community.

 

 

 

 

 

 

Second – This was about a one-minute part of an hour long discussion.  There are just so many aspects of social media and Open Source Intelligence that touch on state and local Law Enforcement. All  the money and fancy systems and “big data” being discussed at the Federal level? None of that trickles down well to the local LEAs whose concerns are the daily block and tackle of community policing.  I talked about the pitfalls, legal issues, and problems, yes, but also about the incredible opportunities that Social Media presents for improving outcomes and actually improving quality of life in communities large and small.  This is a broad, rich and fascinating topic I hope will get a lot more coverage in the future.

In the meantime, since I’m back on the blog and Twitter’s 140 character limit is not conducive to “#SocialMediaForLawEnforcement” I’m going to us #SM4LE (it didn’t already exist, I checked) as shorthand and hope I can help start a conversation.  I’ve met so many great cops, and dedicated civil servants who are trying to hard to work with what little they have in their own communities.  Maybe some of what I know about Open Source and Social Media can help the folks too small, too short-handed or too underfunded to enjoy the solutions that benefit the national agencies and major cities.

See you in the Twitterverse.

Eric  (@DigitalH20)

 

 

 

A really smart guy blows it completely…Malcolm Gladwell isn’t exactly wrong, he just missed the point.

So let me start with a couple of quick disclaimers.

  1. Malcolm Gladwell is a really smart guy, I respect a lot of his ideas, and I really liked several of his books.
  2. I’m not trying to pick a fight with someone famous just to elevate my blog.  What might I accomplish? Doubling the two dozen people who read it?  This isn’t gratuitous, and (as evidenced by my spotty posting record) I’m obviously not trying to make this blog a platform for fame or visibility.
  3. He’s also a lot more famous, rich and brainy than I am, so if those are the metrics that serve as proxy for right and wrong, maybe I should shut up. That said… Yeah, he totally blew it.

So a while back, I wrote a post called “Tech Coup 2.0 – The Revolution Will Be Twittervised…”, one in a long list of plays on the original title, poem and song, (The Revolution Will Not Be Televised, Gil Scott-Heron, 1970).

Unbeknownst to me at the time, Gladwell had written a piece a few months before for the New Yorker called “Small Change: The Revolution Will Not Be Tweeted”.  The reason I’m taking this on now, when the question might seem oh-so-totally-six-months-ago is not just to defend my position, but because I think this is going to be the question of 2012 far more than it was the question of 2010.  First let’s talk about how he’s missed the point, then I’ll touch on why I think the impact of this is going to reach far beyond the past year’s “Arab Spring”.

Gladwell’s argument, as well as those of several learned and impressive people he sites including Golnaz Esfandiari’s excellent piece in Foreign Policy “Misreading Tehran: The Twitter Devolution”, is that social media and virtual networks have fundamental flaws as a tool for organizing revolution.  I won’t recap his whole argument here, but citing examples from East Germany to the US civil rights movement, he explains that, among other things, social uprising against the status quo requires two very important elements.

The first is what he calls “strong ties”.  It’s easy, he argues, to “join a cause” by clicking the “Like” button facebook or giving a dollar via Web site, but when we’re talking about rising up against a regime or authority with the ability and willingness to use coercion and force, it’s a different ballgame.  To be willing to stand in front of the proverbial tank or put flowers in a rifle barrel aimed at your head, true (that is, physically dangerous) stands against authority have traditionally required a personal connection to others involved.  Flash-mobbing Wall Street in New York, where the rule of law and one’s physical safety are essentially not in question (recent left-wing hysteria about pepper spray and fascism not withstanding), is totally different than coming out of your house to face down Assad’s security forces because of a text message or Tweet.  People, he argues, put their asses on the line, because people they know and care about are taking to the streets too, and/or have been victims of the condition against which they protest.

The second factor a true uprising requires to be sustained, he argues, is hierarchy and organizational control, the very antithesis of social, informal and virtual networks.  If one’s goal is just to create havoc, then sure, a loose confederation of like-minded individuals acting semi-autonomously is fine.  But if your goal is explicit, specific and clear policy and governmental changes, then (citing examples like the NAACP), a clearly structured organization and chain of command is explicitly required.

He also does a fine job of pointing out the flaws of, if not completely tearing down, the arguments for the power of social media that are made in Smith and Aaker’s “The Dragonfly Effect” and Shirky’s “Here Comes Everybody”.  My favorite nugget:

“ ‘Social networks are particularly effective at increasing motivation,’ Aaker and Smith write.  But that’s not true.  Social networks are effective at increasing participation – by lessening the level of motivation that ‘participation’ requires.” 

Again, I won’t restate his whole argument here, it’s really worth it to read Gladwell’s piece.  And I say that because, (and here comes the potentially confusing part) I think he’s absolutely right. I think his critiques of the whole “social media will change the world” view is dead on in terms of the flaws he exposes in social media as a tool of organization for large scale social or revolutionary change.

So… HUH?  Didn’t I start this whole discussion saying Gladwell’s wrong?  Nope.  I said he missed the point.  Not the same thing at all.  He’s absolutely right that technology, social networks and the like will not likely play (and explicitly have NOT to date played), the role it’s cheerleaders have claimed.

Here’s the point I think he missed, and it was the core point of my own Revolution post, which perhaps I didn’t state explicitly enough.  Technology and social networks will not bring the tools and organization and strong ties required to bring people out in the face of the threat of physical force.  But let’s remember what gets people out in the street in the first place – a motivation to take the risk, something so inspiring, egregious or powerful it overcomes the collective inertia of not revolting.  And that is what technology can, and will, bring.

Gladwell is right that it took organization, strong ties, and deeply seated moral beliefs among both the black protesters and the white freedom riders and volunteers who eventually rose up to begin changing life for black Americans.  Twitter and YouTube can’t provide the ties, or the organizational structure.  What they can provide is the motivation, the evidence, the “why”.  How much sooner, and how many more, white supporters might have come, how many more black students might have sat in, if lynchings and beatings and rapes of black girls by white men had been caught on cellphone cameras and posted on YouTube.

What was the catalyst that started the Tunisian upheaval? One poor street vendor, despondent and disheartened to the point of self-immolation, became the (literal) match that lit the fuse of revolution.

Can Twitter or SMS really provide the the organizational structure and the belief systems to make thousands turn out in the face of arrest, imprisonment or worse and keep them focused on a long term goal or societal change?  Not at all.  Does it provide the strong familial or social ties that get folks to link arms in front of a machine gun?  Nope.  But…

Can it, in seconds and nearly unstoppably, communicate out to a million people the photo, video, report or account of an atrocity, injustice or societal wrong that will get them in the streets and provide the motivation to organize, reach out and engage one’s close ties?

It has (flip phone vid of Saddam Huessein being hanged anyone?), it can and it will.

Like I said, Gladwell wasn’t wrong, in fact I agree with his criticisms of the social-media evangelist set. Social media doesn’t play the role it’s cheerleaders claim.  On this, he’s right. I just think he’s arguing the wrong point.  I’ll close by repeating my own thought from the previous post, for whatever that’s worth.

If, and where, keeping the world from knowing “what’s really happening” is important to maintaining advantage, power or undeserved legitimacy,  the inability to keep the information genie in the bottle ever, at all, anywhere, is going to catch a whole lot of employers, governments and belief systems up short.

From cults to political parties to hate organizations to repressive regimes, the daylight is coming to shine on you and your beliefs.  If you cant say it out loud and in public without losing support, money or legitimacy, know that your days are numbered.  I think, whether in months or years, the end is nigh, and your doom will come not from jackbooted troops, police SWAT teams or even intrepid reporters, but in the form of the individual with a conscience and the cheap, ubiquitous camera phone.

Disclaimer: The views expressed on this blog are mine alone, and do not represent the views, policies or positions of Cyveillance, Inc. or its parent, QinetiQ-North America.  I speak here only for myself and no postings made on this blog should be interpreted as communications by, for or on behalf of, Cyveillance (though I may occasionally plug the extremely cool work we do and the fascinating, if occasionally frightening, research we openly publish.)

Why Google is great, but not a complete solution, for Intel and Law Enforcement. (Part I)

I read Nick Selby’s piece on Police Led Intelligence this morning talking about more effective use of online search engines for police officers. Nick’s right – many in the Law Enforcement and Intelligence communities can do even more than they are by learning more about how search engines work, but there’s a second part to that story.

This has been on my mind a lot, especially since my company was acquired by a defense firm and I’ve been spending a lot more time with intelligence, law enforcement and other folks working in public health or safety.

Let me preface this by saying I’m not knocking the traditional search portals, they are extremely useful and powerful.

However, they do suffer from built-in biases, blind spots and restrictions that many analysts and law enforcement officers aren’t even aware are affecting them when they use a search engine.

By all means, use these fantastic tools to the fullest possible extent (see Nick’s comments and links to Google-hacking Jedi Master Johnny Long’s book and presentation).

Just understand that what they offer is not search of “the Internet” or “The Web”, they offer search of that portion of the online content world that is in their index, and then give you an even smaller slice of that.

With Johnny’s expansive help, and perhaps one or two easy tricks from yours truly, you can get to a decent portion of whatever they have. Just understand that what they have, what you can get to of what they have, and what they don’t have at all, is an important part of using them effectively.

The first three problems with traditional search engines

There is a wide range of built in biases, problems and blind spots with the traditional search engines of the world, and understanding them will make you a better user and consumer of the great things they CAN do for you.

Problem 1 – They only find things that don’t mind being found

Whether it’s a full Web site, a specific user’s blog on WordPress or a subset of pages within a much larger set of content, it is usually a 10-second exercise to hide content from search engines.

Search engines harvest pages by “crawling”, that is downloading a page from an address (i.e. a full path URL, e.g. http://host.domain.com/page.html), finding links on the page and “crawling” or following the link by requesting the linked page, finding more links on that page and sequentially requesting and indexing those pages ad infinitum.

The problem? On arrival at a site, the first page a crawler or spider will often request is called “robots.txt”, which essentially says “If you are an automated requester, i.e. not a person, please follow these links or help yourself to these pages, or all pages, or the entire domain” or whatever they want. To permanently say “Google, Yahoo et al, you are not welcome to see anything on my entire Web site” requires the extremely sophisticated programming below:

User-agent: * Disallow: /

In people speak, this says, “If you’re a robot (crawler, spider, automated agent etc.) not a human’s browser program, screw off.”

That’s it. Terrorists, child pornographers, anti-government radical or hate groups who want to hide from easy detection need less than 25 characters to ensure that an FBI or ATF agent can Google ’til the second coming, and their Web site will never show up in a traditional search engine.

On pre-built blogging platforms like WordPress, where the user is assumed to have NO technical or programming knowledge at all, it’s even easier. You check a box when you sign up that says “Do you want your posts to show up in search engines?” Check No, and you’re hidden from GoogleBot and its Bing-y, Yahoo!-y etc. cousins.

Yay for stupid criminals who don’t know this, and by all means let’s use Johnny-Fu to find what IS in those search engines’ indexes. Just understand that what Google or Bing has and what’s actually out there are not synonymous.

Problem 2 – We have nine BILLION results for your query…

“…but we’ll only show you 658 of them.”

Here’s another little-known and ill-understood factoid/problem (By the way, to make this problem easy to see, I recommend you go to preferences or settings on Google and set your account to “show 100 answers per page).

Type a query into Google that might have lots of results, e.g. Osama Bin Laden. Google will say something insane like “We have 44 million results for your query.”

Great. Insane as it sounds, suppose I had a room full of analysts and I actually wanted them to spend the next five years eyeballing every one of those results. Go to the bottom of the page where you see all the “O’s” in Goooooogle representing the next ten pages of results (at 100 per page). Click next three or four times….

Did you see ten pages of “next” shrink to 9 or 8 or 7? Click it some more. Without fail, around page 7 or 8, they cut you off. So Google has 44 million pages about Bin Laden. You can actually have about 658 of them.

Try it.

Problem 3 – Here are those 658 results….

Here’s the third of today’s often-unknown problems with search engines. By the way, there are six or seven more, but I’m already getting long-winded here.

The brilliant original insight that made Google a zillion-dollar success was a notion called PageRank (actually named for Larry Page, one of its four authors, and not for “Web Page” as many people think).

PageRank essentially codifies the notion of Vox Populi Vox Dei. See, Anatomy of a Large-Scale Hypertextual Web Search Engine if you feel like seeing where 100 Billion Dollars started.

Let me save you reading a Stanford PhDs worth of math in seven words:

“That which is most popular is best.”

Google, not you, decides which 658 of the 44,000,000 results you get, and it does so by ranking them using today’s version of PageRank.

PageRank, for all of its enormous evolution and complexity since the mid-90s, still says basically, the most popular answer is the best answer.

If you are looking for who actually has those Paparrazi-snapped photos of Britney Spears in a bikini, you can be pretty sure that the page with the best, highest resolution versions of those pics will have the most inbound links, comments etc.

If what you’re looking for is likely to be unloved, unpopular, private or hidden in a dark digital corner of the Web, you’d best hope it has some extremely unusual or deterministic keyword in it if you want it to appear in the top slice of results for your query.

So to recap:

  1. There’s a lot of stuff that isn’t in Google or Yahoo and they’re easy to hide from.
  2. The stuff that is in there is only accessible up to about the 700th result.
  3. The 700 that are both actually in Google and in the top 700 you get to see are chosen by popularity. This is a terrible prioritization scheme if what you are looking for or care about is not likely to be, or meant to be, found or popular.

Well, that all kind of sucks. So what can I do about it?

The first of these is the hardest to address and (sorry) really needs to be the subject of another post.

As for the second two, which are related, there are some good things you can do to get the best of what is in the index to be found. Here, in no particular order are my favorite three things:

1. Read the works from Obi Wan Kenobi (Johnny Long) on Google Hacking. I am but a Padawan apprentice and a poor one at that. If you need the distilled version, start with his Black Hat presentation, especially page 5.

2. Define the weirdest query you can. What I mean is, if Google has a zillion results and you can only have 600 of them, or if they have 600 and you want to read 20, not all of them, do not ask the most generic version of your question.

Let’s go back to the Bin Laden example (yes I know he’s dead, but for years it was all any govvie I talked to wanted to use as their example.)

If you had, a few months ago, typed in “Bin Laden whereabouts” or “Bin Laden location” you’d get millions of results. Here’s the crazy thing: While I don’t think it’s likely, is it possible, based on Problems #2 and #3 that there actually WAS a page saying “Hey, I heard a rumor OBL is living in Abottabad. Anyone snooped around that big new house they built there yet?” Yep.

If that page/author/blog was, in the godlike wisdom of PageRank, considered an unpopular crank and therefore placed in just, say, position 902 out of five million results, you’d NEVER see it, even though it was sitting in there for years. Why? Results cut off at 856.

However, if the assumption was, “Well, the guy’s getting messages out somehow, and living someplace, probably pretty secure” then you have a much better way to query. How about this:

Bin Laden +

Compound +fence +security +rumor
messages +Pakistan +(Courier OR messenger)
Videotape +al Jazeera +(courier OR messenger)

Would these have led right to OBL? Maybe not, but at least you’re slicing the available index of material much more granularly and intelligently.

Intelligence is about pulling tenous threads and connecting sometimes-nearly-invisible dots. Results to a query like this might surface a datum, or even just trigger a thought in a talented analyst’s mind, that could lead somewhere useful.

3. Randomly “split” your queries: Here’s a neat trick. Even if you don’t have additional terms like the cases above, you can do something I call splitting your queries. Stick in random words like “Bin Laden + Thursday” or “Bin Laden + Baseball”.

Bin Laden + Baseball? Seriously?

Twenty eight million results.

Seriously.

Will this get you the 28 million results? Nope. Will it get you a different 658 than “Bin Laden + Thursday”? Yep.  And that’s how you can carve out a different slice of what Google has but normally wouldn’t show you.

Try months, days of the week and colors.

If you find it useful to keep going, stick in sports and the last twenty US Presidents.

You’d be surprised how different the results are.

Throw in some of the available Google restriction parameters (date, blogs vs. normal Web pages, file type) and you can suck ever bigger slices of what they have on ever more granular axes.

That’s probably more than anyone would want on the subject, but like I said, these are just three of about nine important limitations and attributes of traditional search engines that you should be aware of when using them for Intelligence or Law Enforcement.

I’ll try to post more later on the other six.

Disclaimer: The views expressed on this blog are mine alone, and do not represent the views, policies or positions of Cyveillance, Inc. or its parent, QinetiQ-North America.  I speak here only for myself and no postings made on this blog should be interpreted as communications by, for or on behalf of, Cyveillance (though I may occasionally plug the extremely cool work we do and the fascinating, if occasionally frightening, research we openly publish.)

%d bloggers like this: