Unsurprisingly, the ever-innovative Google is conducting intensive research into improving image search. The web giant’s mission – ”to organize the world’s information and make it universally accessible and useful” — requires that its computers be able to interpret and index images, in addition to text. To date, as Mike Arrington explains, computers have not been so good at this:
Today when we talk about search all we really mean is text search. That’s sort of like only being able to see in one color. And when we search for image, video and audio content, the only data that search engines use to do those searches is the text that is associated with those files. That’s like trying to describe the color green when you can only see in red.
One approach to solving this dilemma is giving humans an incentive to label images themselves (see my earlier post on human computation). Luis Von Ahn, the brain behind Google Image Labeler (an addictive game that pairs users together to attribute labels to images), says that all the images on the web could be sufficiently labeled in a short amount time with a critical mass of participants; to drive home his point, he often references the millions of potentially productive hours that go wasted on Solitaire each year.
There are two major shortcomings to this approach. First, it is still completely text based — what happens when a certain image is only labeled in a certain language, or when pranksters “Google bomb“ image results (imagine every result for “miserable failure” being the face of George W. Bush)? The second, major shortcoming of this approach is that there are untold numbers of new images being uploaded to the Internet every day. Flickr alone gets as many as one million new photos from its users every 24 hours. Is a human-centric approach to putting images in context sustainable? Google doesn’t think so, and so it is beefing up its computer-based image search strategy.
The Research
The research paper written by Google employees Yushi Jing and Shumeet Baluja is available here, but the jist is that by analyzing images for their visual qualities (rather than their textual context) and comparing them to each other, a computer is able to determine which images are most original in the sense of being a sort of seed image. As they put it in their conclusion, “those images that capture the common themes from many of the other images are those that will have higher rank.” The example they use in the paper is illustrated below:
The two largest images of the Mona Lisa in the center would have the highest PageRank (Google’s proprietary system for determining importance or relevance) because they are the common denominator of the other images, so to speak. The other images are distortions of those two, or cropped, or slightly recolored, or have text added, etc. Because the middle two are more or less exact reproductions of the original work by da Vinci, they have the highest PageRank value.
The Take Away
Sweet, right? It’s definitely cool to see Google thinking in such an innovative way about how to make it easier to find images. But, as with all things technologically awesome these days, there’s a certain scare factor that comes with it. Ready?
-
In 2006, Google acquired Neven Vision. According to the announcement on the Official Google Blog, “Neven Vision comes to Google with deep technology and expertise around automatically extracting information from a photo. It could be as simple as detecting whether or not a photo contains a person, or, one day, as complex as recognizing people, places, and objects.”
-
Many of you have probably played with Google Street View, the function found in Google Maps that allows you to explore cities as if you’re driving through them in your car. Pretty cool, but lots of privacy concerns (see this link from one of Taylor’s past Monday Morning Links to see some examples). Street View has recently been rolled into Google Earth, as well, albeit a little clumsily from my experience.
Add #1 and #2 to this research paper, and you will probably get a little Big Brother vibe from Google. For fun, throw in my preflection on augmented reality; in it, I imagine a world where ads are triggered based on what we look at, and where Facebook profiles are automatically displayed when we look at a friend. The ability for a computer to read an image, and by extension, to read “real life”, is exciting but also a little unsettling.
First image used under a Creative Commons license courtesy of Flickr user feastoffools. Second image courtesy extracted from Google research paper via TechCrunch.

Even though Microsoft’s release software is often bogged down by bureaucracy, their research labs produce some unbelievably awesome applications. One of my favorites is a program that indexes images, and stitches them together. You can zoom out, and see the big picture, and it lets you zoom in by seamlessly selecting a closer photograph.
I can’t do it justice, but check it out here. Make sure you stick with it through the 3d integration with some satellite imagery. I rarely say this about any program, but it is beyond cool.
my HTML sucks a lot, apparently. There’s a video here (new window)