Last modified: 2014-03-02 19:13:04 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T64109, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 62109 - Add canonical namespaces and aliases to XML dumps
Add canonical namespaces and aliases to XML dumps
Status: NEW
Product: MediaWiki
Classification: Unclassified
Export/Import (Other open bugs)
1.23.0
All All
: Normal enhancement (vote)
: ---
Assigned To: Nobody - You can work on this!
: easy
Depends on:
Blocks: 62111
  Show dependency treegraph
 
Reported: 2014-03-01 19:43 UTC by Aaron Halfaker
Modified: 2014-03-02 19:13 UTC (History)
3 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Aaron Halfaker 2014-03-01 19:43:25 UTC
The XML dump contains a siteinfo header with a <namespaces> tag that is very useful for processing the text in the dumps.  It looks something like this:

<mediawiki ...snip... >
  <siteinfo>
    <sitename>Վիքիպեդիա</sitename>
    <base>http://hy.wikipedia.org/wiki/%D4%B3%D5%AC%D5%AD%D5%A1%D5%BE%D5%B8%D6%80_%D5%A7%D5%BB</base>
    <generator>MediaWiki 1.23wmf15</generator>
    <case>first-letter</case>
    <namespaces>
      <namespace key="-2" case="first-letter">Մեդիա</namespace>
      <namespace key="-1" case="first-letter">Սպասարկող</namespace>
      <namespace key="0" case="first-letter" />
      <namespace key="1" case="first-letter">Քննարկում</namespace>
      <namespace key="2" case="first-letter">Մասնակից</namespace>

  ...snip...

    </namespaces>
  </siteinfo>

Regretfully, this header does not include canonical namespace names or namespace aliases.  However, an API request for "meta=siteinfo" does include these bits.  For example, the call for http://hy.wikipedia.org/w/api.php?action=query&meta=siteinfo&siprop=namespaces|namespacealiases returns the following XML:

<api>
  <query>
    <namespaces>
      <ns id="-2" case="first-letter" canonical="Media" xml:space="preserve">Մեդիա</ns>
      <ns id="-1" case="first-letter" canonical="Special" xml:space="preserve">Սպասարկող</ns>
      <ns id="0" case="first-letter" content="" xml:space="preserve" />
      <ns id="1" case="first-letter" subpages="" canonical="Talk" xml:space="preserve">Քննարկում</ns>
      <ns id="2" case="first-letter" subpages="" canonical="User" xml:space="preserve">Մասնակից</ns>

  ...snip...

    </namespaces>
    <namespacealiases>
      <ns id="6" xml:space="preserve">Image</ns>
      <ns id="7" xml:space="preserve">Image talk</ns>
    </namespacealiases>
  </query>
</api>

The XML dump should be updated to include this important metadata about namespaces.
Comment 1 Jesús Martínez Novo (Ciencia Al Poder) 2014-03-01 19:51:21 UTC
What would be the use case of having this information in the dump?
Comment 2 MZMcBride 2014-03-01 20:22:54 UTC
(In reply to Jesús Martínez Novo (Ciencia Al Poder) from comment #1)
> What would be the use case of having this information in the dump?

As I understand it, the XML dumps are targeted for offline use.

(In reply to Aaron Halfaker from comment #0)
> Regretfully, this header does not include canonical namespace names or
> namespace aliases.  However, an API request for "meta=siteinfo" does include
> these bits.

This sounds as though people trying to re-use the dumps need to go online to get this information. I think this is a perfectly reasonable enhancement request.

I'm marking this ticket with the "easy" keyword because it shouldn't be very difficult to add this additional information to the XML dumps. The most challenging part here is figuring out whether it's the PHP or the Python maintenance scripts that generate these particular dumps. The actual output logic can probably be cribbed from the MediaWiki API.
Comment 3 Aaron Halfaker 2014-03-01 21:25:16 UTC
Re. use case,

One common activity when processing wiki dumps is to extract historical link information -- something that can't be done with pagelinks.  Let's say I'm processing an enwiki dump and I encounter the following link:

[[WP:Foo]]

Without knowing that "WP" is an alias of ns=4 ("Project"/"Wikipedia") I'd have to assume that "WP:Foo" is the title of an ns=0 article.  

This is a problem for canonical namespace names too.  The following link would reference the same page:

[[Project:Foo]]
Comment 4 Jesús Martínez Novo (Ciencia Al Poder) 2014-03-02 19:13:04 UTC
What processing are you talking about? Do you have any script that handles the dump, other than importDump.php?

And what about interwiki links? Would you assume that [[commons:Foo]] would be also a page in the main namespace?

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links