Smart image and video search
Recently, I was searching the Web for images of spiders to help my 10-year-old son on a science project about arachnids. I used a number of different image search facilities on the Internet: Google, MSN, PicSearch, Ask, DogPile and Lycos. I received mixed responses to my query for images of spiders.
As some of the search results in Figure 1 on page 8 show, the search engines almost universally insisted that a spider was more than an insect—it was a car, a flow chart describing Web crawling software, and a species of primate … among other things. Of course, the computer was correct, because as far as it was concerned, a spider was all of those things and more. But I was interested in only one kind of spider, an insect, and, without my guidance, those search engines would have missed the boat.
I asked myself why it is that a two-year-old child easily understands and can recognize a spider, but extremely sophisticated and mature search engines such as Google, MSN, Lycos and others can’t figure it out. It turns out that humans are extremely good at visual pattern recognition, and that many cognitive scientists believe that our visual abilities may be what led us to cognizance in the first place.
Web search technologies are extremely good at understanding and identifying text. The problem arises in searches for images, because images are NOT text. As anyone who regularly works with computers can attest, mainstream recognition of visual images is not a computer’s strong suit—or at least not yet. There is extremely limited cognizance, understanding and automatic context available in the Web-based image search tools you and I use every day.
It’s all about metadata
Today’s search engines employ image and video search systems that require textual metadata tags in order to describe the contents of an image or video clip. Those contents include not only the date, time, title, photographer, copyright and other mundane everyday metadata, but also contextual metadata describing what the picture is actually about. For example, YouTube, AOL Pictures (http://pictures.aol.com/), Live Leak and Yahoo’s flickr all provide multiple ways for users to upload images with additional descriptive textual information. As seen in Figure 2 on page 8, each of those upload methods asks the user to enter descriptive tags and information describing the content and its context.
Image and video content upload systems rely on simple things like asking the user to file the image under a category, or to click a set of checkboxes of descriptive tags, or to type in a one-sentence description that can later be automatically parsed by the system to generate detailed metadata. However, even those minimal requests for metadata can overburden users when they are uploading megabytes of image and video information.
Nevertheless, the need for useful, descriptive metadata is a standard requirement for almost all search engines. As a rule, the metadata humans need for context in images is only available for an image in most search systems when somebody manually enters it. In other words, few systems are available today that automatically understand what an image is about without metadata that must be manually entered. Innovative approaches and new technologies are emerging across the board to try to improve this process.
For example, Google has come up with an interesting approach to getting more descriptive information about images called Google Image Labeler. It is described as a way to “label random images to help improve the quality of Google’s image search results.” This somewhat game-like tool automatically pairs you with another online user to have you (and your partner) add as many labels to as many images as possible in a 90-second period. The more images and labels you add, the more points you get. The site lists that day’s and the all-time top-point winners. The points don’t earn you anything except the satisfaction that you are helping Google deliver more relevant search results.
If you are like me, you may find it perplexing to have to jump through hoops and even to play games to get better metadata for images and video. Why do we have to manually enter that information? Certainly, computers are smart enough to be able to “automatically see what is in an image” just like we do … aren’t they? It turns out that, on many fronts, we are headed toward a world of automatically created context, in which the computer precisely understands a query and recognizes what it is indexing without users having to intervene.
Enter Web 2.0, the world of automatic contextEvery day, I read something new about Web 2.0 and how important it is to the future of Internet computing. Though definitions of Web 2.0 abound, I have finally decided that Web 2.0 means a smarter and more automatically context-aware World Wide Web. In my mind, Web 2.0 means that when I search for a term like “spider,” the search engine will seek context first and content second. It will still have a list of items that are more than just insects. However, the Web 2.0 search engine will ask me for clarification rather than relying on me to seek that clarification myself.
To achieve those kinds of capabilities, software engineers are increasingly turning to models of the human brain. The approach of modeling the human brain to improve not only healthcare but also computer systems and human productivity makes a lot of sense—after all, why shouldn’t we go with what works after several million years of evolution? The key here is that with technology we can take what works well for us as humans vis-à-vis the human brain, and we can make it even better using our faster and more efficient computer technology.