Last modified: 2014-09-29 09:57:56 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T73405, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 71405 - Medium-sized image dump
Medium-sized image dump
Status: UNCONFIRMED
Product: Datasets
Classification: Unclassified
General/Unknown (Other open bugs)
unspecified
All All
: Unprioritized normal (vote)
: ---
Assigned To: Ariel T. Glenn
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2014-09-29 09:49 UTC by Régis Behmo
Modified: 2014-09-29 09:57 UTC (History)
3 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Régis Behmo 2014-09-29 09:49:01 UTC
Current image dumps include the full-resolution files; as such, it is very difficult to download a large (> 10⁶) number of images. E.g: the 2013-11-29 image grab (latest available from https://archive.org/details/wikimediacommons) is 68 Gb but contains only ~22k media files. This amounts to ~3Mb / file.

Images resized to 1024x800 typically use 300 kb/image. 
Images resized to 800x600  typically use 200 kb/image.
Images resized to 500x375  typically use  90 kb/image.

It would be great if torrents were available for monthly or yearly resized media uploads. 

So why would anyone need that? I am currently working on an open source, scalable implementation of an image search engine. This engine can return the images in a database that are most similar to another query image. This is useful for casual browsing of an image database, but also for copyright infringement or duplicate image detection. Now that the engine is ready (based on a published, state-of-the art method http://www.robots.ox.ac.uk/~vgg/publications/2013/arandjelovic13/?update=1 ) I am looking for a large (10⁶-10⁸) dataset to provide a convincing demo.

For that demo, 10⁷ 500x375 images would "just" require ~1 Tb, which is a tractable torrent download.

Of course, this dataset would be relevant not just to me, but also to the computer vision community at large. Since category information is associated to each image, this dataset would constitute a great resource for image recognition and classification.

It should be noted that we are talking about a subset of all media files: e.g, not pdf, animated gif of video files.
Comment 1 Antoine "hashar" Musso (WMF) 2014-09-29 09:57:56 UTC
I have met Régis in my coworking place.   I believe a dump of 800px images thumbnails can probably be useful to a wide researching audience :)

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links