Last modified: 2014-11-17 10:36:45 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T38432, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 36432 - Normalize titles and namespaces
Normalize titles and namespaces
Status: VERIFIED FIXED
Product: MediaWiki extensions
Classification: Unclassified
WikidataRepo (Other open bugs)
unspecified
All All
: Highest enhancement with 1 vote (vote)
: ---
Assigned To: Wikidata bugs
storypoints: 5
: testme
Depends on:
Blocks: 36986
  Show dependency treegraph
 
Reported: 2012-05-02 12:46 UTC by denny vrandecic
Modified: 2014-11-17 10:36 UTC (History)
5 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description denny vrandecic 2012-05-02 12:46:21 UTC
On all accesses to Wikidata we need to normalize the page title of the client, especially including the namespace. This needs to be done when querying Wikidata but also when storing. Also consider the constraints and their rationales in the secondary storage system.
Comment 1 jeblad 2012-05-03 23:06:51 UTC
The normalization done in the API during lookup should probably be reported somehow. Also note that there are both normalization and redirects in the present system.

Normalization is always done, while redirects is something the normal API must be told to follow.

Take for example a look at the URL for "_noreg" at no.wp (http://no.wikipedia.org/w/api.php?action=query&prop=info&titles=_noreg&format=jsonfm&redirects) and what it reports back. It will report both a normalization for "_noreg" into "Noreg" and then a redirect from "Noreg" into "Norge" (first form is Nynorsk, last form is Bokmål).

Actual output is:
{
  "query": {
    "normalized": [
      {
        "from": "_noreg",
        "to": "Noreg"
      }
    ],
    "redirects": [
      {
        "from": "Noreg",
        "to": "Norge"
      }
    ],
    "pages": {
      "728": {
        "pageid": 728,
        "ns": 0,
        "title": "Norge",
        "touched": "2012-05-03T00:05:03Z",
        "lastrevid": 10449638,
        "counter": "",
        "length": 59329
      }
    }
  }
}

Also note the difference between "WP:T" at no.wp (http://no.wikipedia.org/w/api.php?action=query&prop=info&titles=WP:T&format=jsonfm&redirects) which is a normal redirect

Actual output is:
{
  "query": {
    "redirects": [
      {
        "from": "WP:T",
        "to": "Wikipedia:Tinget"
      }
    ],
    "pages": {
      "1230": {
        "pageid": 1230,
        "ns": 4,
        "title": "Wikipedia:Tinget",
        "touched": "2012-05-03T21:15:50Z",
        "lastrevid": 10454551,
        "counter": "",
        "length": 126570
      }
    }
  }
}

Then consider the same lookup at en.wp (http://en.wikipedia.org/w/api.php?action=query&prop=info&titles=WP:T&format=jsonfm&redirects) which is involving an namespace alias

Actual output is:
{
  "query": {
    "normalized": [
      {
        "from": "WP:T",
        "to": "Wikipedia:T"
      }
    ],
    "redirects": [
      {
        "from": "Wikipedia:T",
        "to": "Wikipedia:Tutorial"
      }
    ],
    "pages": {
      "497846": {
        "pageid": 497846,
        "ns": 4,
        "title": "Wikipedia:Tutorial",
        "touched": "2012-05-02T14:28:59Z",
        "lastrevid": 484946903,
        "counter": "",
        "length": 4224
      }
    }
  }
}

If the requested site-title pair is not equal to the site-title pair in the found item the normalized form should be reported in the API. 

Not all wikis have the same set of namespace aliases, and the same namespaces can have different name in different languages. There are also "canonical names" for the namespaces that is reported and available for the browser. For example a page in the "Bruker" ("User") namespace in no.wp will have the following definitions


mw.config.set({
  "wgCanonicalNamespace":"User",
  "wgCanonicalSpecialPageName":false,
  "wgNamespaceNumber":2,
  "wgPageName":"Bruker:John_Erling_Blad_(WMDE)",
  "wgTitle":"John Erling Blad (WMDE)",
  "wgCurRevisionId":0,
  "wgArticleId":0,
  "wgIsArticle":true,
  "wgAction":"view",
  "wgUserName":"John Erling Blad (WMDE)",
...
  "wgRelevantPageName":"Bruker:John_Erling_Blad_(WMDE)",
...
});

There are several values in there that can be interesting, but those are the ones I usually use.

Note also whats happend if you try to follow "WP:T" at en.wp (en.wikipedia.org/wiki/WP:T)
mw.config.set({
  "wgCanonicalNamespace":"Project",
  "wgCanonicalSpecialPageName":false,
  "wgNamespaceNumber":4,
  "wgPageName":"Wikipedia:Tutorial",
  "wgTitle":"Tutorial",
...
  "wgRedirectedFrom":"Wikipedia:T",
...
});

In this case you will have the source of the redirect available. You will although not have the prenormalized form.

If a page manipulates the title through {{DISPLAYTITLE}} like "iPad" on en.wp (http://en.wikipedia.org/wiki/IPad) the wgTitle is still the correct one for the page (Note that wgPageTitle has an "invisible" namespace)
mw.config.set({
  "wgCanonicalNamespace":"",
  "wgCanonicalSpecialPageName":false,
  "wgNamespaceNumber":0,
  "wgPageName":"IPad",
  "wgTitle":"IPad",
...
});

Short answare seems to be to use the "wgCanonicalNamespace" and "wgTitle" to form a new "wgCanonicalPageName" and use that as the page title for later requests from the client an browsers. This will work even if there is no common canonical name among the wikis I believe, but I have not checked. The important thing is to avoid "wgPageName" as it is now.
Comment 2 Nikola Smolenski 2012-05-04 09:07:59 UTC
An excellent analysis! I fully agree, except that I don't think wgCanonicalNamespace should be used, since:

1) If it is used, it will be displayed in link titles and on hover, so it will look uglier and might even be not understandable to all users; and,

2) I'm not sure that custom namespaces exist in this form.

Since when adding a link we are contacting the local Wikipedia anyway in order to get the autocomplete list, we can request the canonical name at the same time with minimal overhead.

Another posibility is to make a function similar to Title::newFromText() that could be locale-aware and normalize links in any locale (but what to do with custom namespaces again?)
Comment 3 denny vrandecic 2012-05-08 15:24:44 UTC
I would suggest to use the normalized name, not wgCanonicalNamespace + wgTitle, just what we get in query.normalized.to from the API.

I am not sure about automatically resolving redirects. I guess we can leave this out for now and maybe consider later. But for now, this item means just "normalize".
Comment 4 denny vrandecic 2012-06-21 12:37:01 UTC
Picked in Sprint 7.
Comment 5 Anja Jentzsch 2012-11-29 12:37:11 UTC
Verified in Wikidata demo time for sprint 8
Comment 6 Jarry1250 2013-01-21 14:38:00 UTC
Shouldn't https://wikidata.org/w/api.php?action=wbgetentities&sites=enwiki&titles=Ren%C3%A9_Vautier work now then? Why isn't the underscore understood as a space?

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links