Last modified: 2014-09-29 09:57:56 UTC
Current image dumps include the full-resolution files; as such, it is very difficult to download a large (> 10⁶) number of images. E.g: the 2013-11-29 image grab (latest available from https://archive.org/details/wikimediacommons) is 68 Gb but contains only ~22k media files. This amounts to ~3Mb / file. Images resized to 1024x800 typically use 300 kb/image. Images resized to 800x600 typically use 200 kb/image. Images resized to 500x375 typically use 90 kb/image. It would be great if torrents were available for monthly or yearly resized media uploads. So why would anyone need that? I am currently working on an open source, scalable implementation of an image search engine. This engine can return the images in a database that are most similar to another query image. This is useful for casual browsing of an image database, but also for copyright infringement or duplicate image detection. Now that the engine is ready (based on a published, state-of-the art method http://www.robots.ox.ac.uk/~vgg/publications/2013/arandjelovic13/?update=1 ) I am looking for a large (10⁶-10⁸) dataset to provide a convincing demo. For that demo, 10⁷ 500x375 images would "just" require ~1 Tb, which is a tractable torrent download. Of course, this dataset would be relevant not just to me, but also to the computer vision community at large. Since category information is associated to each image, this dataset would constitute a great resource for image recognition and classification. It should be noted that we are talking about a subset of all media files: e.g, not pdf, animated gif of video files.
I have met Régis in my coworking place. I believe a dump of 800px images thumbnails can probably be useful to a wide researching audience :)