18 July 2007

Seeing the Power of the Visual Commons

I've written before about Microsoft's Photosynth, which draws on the Net's visual commons - Flickr, typically - to create three-dimensional images. Here's another research project that's just as cool - and just as good a demonstration of why every contribution to a commons enriches us all:

What can you do with a million images? In this paper we present a new image completion algorithm powered by a huge database of photographs gathered from the Web. The algorithm patches up holes in images by finding similar image regions in the database that are not only seamless but also semantically valid. Our chief insight is that while the space of images is effectively infinite, the space of semantically differentiable scenes is actually not that large. For many image completion tasks we are able to find similar scenes which contain image fragments that will convincingly complete the image. Our algorithm is entirely data-driven, requiring no annotations or labelling by the user.

One of the most interesting discoveries was the following:

It takes a large amount of data for our method to succeed. We saw dramatic improvement when moving from ten thousand to two million images. But two million is still a tiny fraction of the high quality photographs available on sites like Picasa or Flickr (which has approximately 500 million photos). The number of photos on the entire Internet is surely orders of magnitude larger still. Therefore, our approach would be an attractive web-based application. A user would submit an incomplete photo and a remote service would search a massive database, in parallel, and return results.

In other words, the bigger the commons, the more everyone benefits.


Beyond the particular graphics application, the deeper question for all appearance-based data-driven methods is this: would it be possible to ever have enough data to represent the entire visual world? Clearly, attempting to gather all possible images of the world is a futile task, but what about collecting the set of all semantically differentiable scenes? That is, given any input image can we find a scene that is “similar enough” under some metric? The truly exciting (and surprising!) result of our work is that not only does it seem possible, but the number of required images might not be astronomically large. This paper, along with work by Torralba et al. [2007], suggest the feasibility of sampling from the entire space of scenes as a way of exhaustively modelling our visual world.

But that is only feasible if that "space of scenes" is a commons. (BTW, do check out the paper's sample images - they're amazing.)

No comments: