Last modified: 2013-07-25 10:31:45 UTC
The monuments database should contain a table with all the images at Commons which have a valid identifier. The code to extract the identifiers is already available in the unused images bot (https://fisheye.toolserver.org/browse/erfgoed/erfgoedbot/unused_monument_images.py?hb=true). The bot should get valid template/tracker categories from the configuration (https://fisheye.toolserver.org/browse/erfgoed/erfgoedbot/monuments_config.py?hb=true) Loop over these and for each source get all the images + metadata.
The database should at least contain: * Filename * Monuments ID as in templates * Uploader * Upload date * Is there {{Wiki Loves Monuments yyyy}} and year Optional: * coordinates * categories * image resolution * file size
Probably best to start with a minimal implementation where wikitext parsing is not needed. Existing tools can be converted to make use of this table. Later this information can be extended.
Did a first implementation. CREATE TABLE `image` ( `country` varchar(10) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL DEFAULT '', `id` varchar(25) NOT NULL DEFAULT '0', `img_name` varchar(255) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL DEFAULT '', PRIMARY KEY (`country`,`id`,`img_name`), KEY `country_id` (`country`,`id`), KEY `img_name` (`img_name`) ) ENGINE=MyISAM DEFAULT CHARSET=utf8 mysql> SELECT COUNT(*) FROM image; +----------+ | COUNT(*) | +----------+ | 878926 | +----------+ 1 row in set (0.00 sec) Playing around with api at http://toolserver.org/~multichill/monapi/api.php?action=images&country=pl&id=MA/A-1028&format=json&width=2000&limit=999999
Great start, thx Maarten! Beside the additional wished fields I noticed some errors, most should be no problem for you ;) * IDs are filled with zeroes (28985 appears as 00028985) * IDs got uppercase (ArD-9-006 appears as ARD-9-006) * some IDs are the uppercase image name (ie 'WIGANDG 29.JPG') * Pictures with more than one ID-template are just once in the table, but should be as often as there are templates (ie http://commons.wikimedia.org/wiki/File:Murtalbahnbr%C3%BCcke_1.JPG)
I think I've fixed this zero problem already, just haven't updated the database yet. Same for the lowercase/uppercase thing. I'm struggling a bit with how to do the padding on Commons. Currently it's padded with '0', but this is causing problems in the USA. The NRHP uses the first two characters for the year so all the nominations in 2000 (00xxxx) get their two zero's chopped off. I use the categorylinks table for the id's. Multiple templates adding the same category just gives one entry, so that's what I'm using.