Last modified: 2013-05-24 08:12:42 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T48912, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 46912 - Provide one each small multi-part "pages-articles" and "meta-history" dump for testing purposes.
Provide one each small multi-part "pages-articles" and "meta-history" dump fo...
Status: RESOLVED FIXED
Product: Datasets
Classification: Unclassified
General/Unknown (Other open bugs)
unspecified
All All
: Unprioritized normal (vote)
: ---
Assigned To: Ariel T. Glenn
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2013-04-05 02:19 UTC by Andrew Dunbar
Modified: 2013-05-24 08:12 UTC (History)
2 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Andrew Dunbar 2013-04-05 02:19:53 UTC
I'm enhancing my dump indexing and extraction tools to work with multi-file dumps as currently used only for English Wikipedia.

One problem with developing these tools is the enormous size of the files. They take a lot of time and bandwidth to download and consume a lot of hard drive space. It also takes a long time for a tool to run over an entire dump of this size, which is a problem during testing and development.

It would be great if we could have one of each type of multi-file XML dump provided specifically for testing purposes.

They could be dumps of one of our actual smallest wikis or they could be dumps of a test wiki set up specifically for this purpose. Contents being in Latin script would be an advantage.

- One "pages-articles" multi-part dump, which has exactly 27 parts numbered 1-27.
- One "meta-history" multi-part dump, which has many more parts with the numbers 1-27 occurring multiple times.
Comment 1 Ariel T. Glenn 2013-04-10 09:27:42 UTC
Bear in mind that having 27 pieces is not guaranteed; we'll likely move to more very soon (more cores!) and other large wikis will probably start being dumped with multiple processes (anywhere from 4 to 8 I imagine).
Comment 2 christian 2013-04-10 09:56:47 UTC
The xmldumps-test project [1] comes with some /really/ small wikis and
known good dumps. They're not 27 part, but 2 part. But they're much
more decent in file size. Maybe those suffice?

Look for example at the dumps:
  verified_dumps/prefetch_base
  verified_dumps/prefetch

[1] https://gerrit.wikimedia.org/r/#/admin/projects/operations/dumps/test
Comment 3 Andrew Dunbar 2013-04-14 01:17:29 UTC
@Ariel: That's exactly the kind of reason it would be great to have test dumps so we have something real yet not unwieldy to base our dev work on instead of making assumptions.
Comment 4 Ariel T. Glenn 2013-05-23 12:13:51 UTC
I've put up a tarball of some experimental files based on tenwiki, relatively small, latin script, but not 27 pieces.  Hope 4 will do instead.  They are produced from a local copy of the content rather than the live db, as I was doing testing of my own, but they should be ok for your purposes.

See http://dumps.wikimedia.org/other/experimental/testfiles/
Comment 5 Andrew Dunbar 2013-05-24 08:11:24 UTC
That does look useful, thank you.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links