Last modified: 2013-03-12 08:18:03 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T47974, the corresponding Phabricator task for complete and up-to-date bug report information.

Bug 45974 - Publish a metadata file for each multipart dump


Summary:	Publish a metadata file for each multipart dump

Status:	RESOLVED WORKSFORME

Product:	Datasets
Classification:	Unclassified
Component:	General/Unknown (Other open bugs)
Version:	unspecified
Hardware:	All All

Importance:	Low enhancement (vote)
Target Milestone:	---
Assigned To:	Ariel T. Glenn

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:
	Show dependency tree / graph

Reported:	2013-03-11 03:07 UTC by Andrew Dunbar
Modified:	2013-03-12 08:18 UTC (History)
CC List:	1 user (show)

See Also:
Web browser:	---
Mobile Platform:	---
Assignee Huggle Beta Tester:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Description Andrew Dunbar 2013-03-11 03:07:17 UTC

Currently there is no way to programmatically determine the names of all the parts of multipart dump files.

As far as I know only the English Wikipedia currently employs multipart dump files.

Most such dumps are split into exactly 27 parts with names in the following format:

enwiki-20130204-pages-meta-current1.xml-p000000010p000010000.bz2

enwiki-20130204-pages-articles27.xml-p029625017p038424363.bz2

If we assume that there will only ever be exactly 27 parts to each such dump we can still only predetermine the part of the dump name before the .xml suffix - We still have no way to know the part between the .xml suffix and the .bz2 suffix

But then we have the full history dumps, for which each of the 27 parts is itself split into further parts. Examples:

enwiki-20130204-pages-meta-history1.xml-p000000010p000002141.bz2
enwiki-20130204-pages-meta-history1.xml-p000002142p000004315.bz2
enwiki-20130204-pages-meta-history1.xml-p000004318p000005912.bz2
enwiki-20130204-pages-meta-history1.xml-p000005913p000008179.bz2
enwiki-20130204-pages-meta-history1.xml-p000008180p000009875.bz2
enwiki-20130204-pages-meta-history1.xml-p000009877p000010000.bz2

The only way to currently automate the process of downloading all the parts of the dumps relies on parsing the HTML pages about the dumps such as at http://dumps.wikimedia.org/enwiki/20130204/

But this is not officially supported and if we were to make it so then we would have to officially standardize the HTML format of those pages and ensure that it doesn't change.

It seems a much more stable and future-proof option would be to come up with some simple XML or other text format file for each multipart dump listing at least the full file name for each part, though it's conceivable that other helpful info could also be included.

Comment 1 Ariel T. Glenn 2013-03-11 17:27:16 UTC

Instead of parsing the XML, it would be better if you download the file of md5 sums (which you will want anyways to verify the files just downloaded).  In the above example this would be at
http://dumps.wikimedia.org/enwiki/20130204/enwiki-20130204-md5sums.txt
The format is pretty boring and therefore good for machines: md5sum, space, filename.  That format is not expected to change anytime soon, and if it were to change I am sure there would be a giant dicussion about it on the various lists.

Assumning that you know which type of file you want (pages-meta-history, stub-articles, etc) you can check for the existence in the md5 file of enwiki-date-filestring.xml.{gz,bz2,7z} and grab the compressed file of your choice if it's there.  Otherwise look for enwiki-date-filestring[0-9+].xml.{gz,bz2,7z} andf get those; if you don't see those, look for enwiki-date-filestring[0-9+].xml*{gz,bz,7z} and get those instead.

I think there are tools out there already for scripted download, you might poke folks on the xmldatadumps-l list about that.

As an aside, it's quite likely that we will go to multipart soon for a few of the other large projects since they take so long to complete running them as one single job.

Comment 2 Andrew Dunbar 2013-03-12 08:00:54 UTC

Thanks. The md5sums file does look like a good enough metadata file for these purposes.

Wikimedia Bugzilla is closed!

Search

Personal tools

Navigation

Links