Last modified: 2013-05-24 08:12:42 UTC
I'm enhancing my dump indexing and extraction tools to work with multi-file dumps as currently used only for English Wikipedia. One problem with developing these tools is the enormous size of the files. They take a lot of time and bandwidth to download and consume a lot of hard drive space. It also takes a long time for a tool to run over an entire dump of this size, which is a problem during testing and development. It would be great if we could have one of each type of multi-file XML dump provided specifically for testing purposes. They could be dumps of one of our actual smallest wikis or they could be dumps of a test wiki set up specifically for this purpose. Contents being in Latin script would be an advantage. - One "pages-articles" multi-part dump, which has exactly 27 parts numbered 1-27. - One "meta-history" multi-part dump, which has many more parts with the numbers 1-27 occurring multiple times.
Bear in mind that having 27 pieces is not guaranteed; we'll likely move to more very soon (more cores!) and other large wikis will probably start being dumped with multiple processes (anywhere from 4 to 8 I imagine).
The xmldumps-test project [1] comes with some /really/ small wikis and known good dumps. They're not 27 part, but 2 part. But they're much more decent in file size. Maybe those suffice? Look for example at the dumps: verified_dumps/prefetch_base verified_dumps/prefetch [1] https://gerrit.wikimedia.org/r/#/admin/projects/operations/dumps/test
@Ariel: That's exactly the kind of reason it would be great to have test dumps so we have something real yet not unwieldy to base our dev work on instead of making assumptions.
I've put up a tarball of some experimental files based on tenwiki, relatively small, latin script, but not 27 pieces. Hope 4 will do instead. They are produced from a local copy of the content rather than the live db, as I was doing testing of my own, but they should be ok for your purposes. See http://dumps.wikimedia.org/other/experimental/testfiles/
That does look useful, thank you.