Last modified: 2013-01-14 17:21:56 UTC
Some urls from the article texts are changed (urldecoded, I guess) in the externallinks table. Examples: [[de:Bahnhof Aachen Schanz]] contains link to http://www.aachen-kapstadt.de/?PROJEKTE/laufende_Projekte/Mural_Global_-_Wandmalprojekte/2005_Welthaus_-_Bahnhof_Schanz%2C_Aachen, but externallinks has http://www.aachen-kapstadt.de/?PROJEKTE/laufende_Projekte/Mural_Global_-_Wandmalprojekte/2005_Welthaus_-_Bahnhof_Schanz,_Aachen. [[de:Brandnew Oldies Volume 1]] contains http://www.kinokoma.de/?text:lang:carsten_bohn%27s_bandstand, externallinks http://www.kinokoma.de/?text:lang:carsten_bohn's_bandstand. [[de:Bahnstrecke Randers–Hadsund]] http://www.baner-omkring-aalborg.dk/?Randers%26nbsp%3BHadsund_jernbane / http://www.baner-omkring-aalborg.dk/?Randers %3BHadsund_jernbane. [[de:Einen Augenblick Zeit]] http://www.dasbiber.at/content/%2526quot%3Bdas-auge%2526quot%3B-verl%C3%A4sst-wien%3F / http://www.dasbiber.at/content/%26quot%3Bdas-auge%26quot%3B-verl%C3%A4sst-wien%3F.
This is intentional. There is a function in the parser, called replaceUnusualEscapes, that normalizes the URL by removing all URL escapes that are not prescribed in the RFC 1738 are dequoted, so only characters outside of ASCII range (32,127) and those having some meaning in the URL are escaped <>"#{}|\^~[]`;/? https://gerrit.wikimedia.org/r/gitweb?p=mediawiki/core.git;a=blob;f=includes/parser/Parser.php;h=59d379a06ea8bc16fe24dfa28754688d1b8d1247;hb=HEAD#l1624 The rationale is this: * Convert unnecessary URL escape codes in external links to their equivalent character before doing anything with them. This prevents certain kinds of spam filter evasion. (Parser.php only) https://gerrit.wikimedia.org/r/gitweb?p=mediawiki%2Fcore.git&h=eb53cc08560721208e195c0f073809e7b3eee485 RFC 3986 defines :/?#[]@ as generic delimiters and !$&'()*+,;= as delimiters that can be used by particular schemes (or more). It also defines "unreserved characters": Berners-Lee, et al. Standards Track [Page 12] RFC 3986 URI Generic Syntax January 2005 2.3. Unreserved Characters Characters that are allowed in a URI but do not have a reserved purpose are called unreserved. These include uppercase and lowercase letters, decimal digits, hyphen, period, underscore, and tilde. unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~" URIs that differ in the replacement of an unreserved character with its corresponding percent-encoded US-ASCII octet are equivalent: they identify the same resource. However, URI comparison implementations do not always perform normalization prior to comparison (see Section 6). For consistency, percent-encoded octets in the ranges of ALPHA (%41-%5A and %61-%7A), DIGIT (%30-%39), hyphen (%2D), period (%2E), underscore (%5F), or tilde (%7E) should not be created by URI producers and, when found in a URI, should be decoded to their corresponding unreserved characters by URI normalizers. But here only externalinks table is affected, which is useful for tracking links. The URLs in the wikitext are displayed and linked as they were. Therefore I am not this is a bug. What's the problem?
Sorry for messed up links. The current Parser.php code is here: https://gerrit.wikimedia.org/r/gitweb?p=mediawiki/core.git;a=blob;f=includes/parser/Parser.php#l1624 The change from 2006 that introduced external links and quoting is eb53cc08560721208e195c0f073809e7b3eee485 (mangled links highlighting in gerrit is http://code.google.com/p/gerrit/issues/detail?id=1451 looks like Bugzilla has a very similar problem)
(In reply to comment #1) > But here only externalinks table is affected, which is useful for tracking > links. The URLs in the wikitext are displayed and linked as they were. > Therefore I am not this is a bug. What's the problem? I am checking for dead links and in the mentioned cases the decoded urls are dead whereas the original urls are ok. If it is not fixable by you I consider that as a bug on your side and make exceptions for those links.
Well, this was r12874.
I am not sure externallinks is designed for this purpose. It tries to canonicalize links to make blacklist matching better. weblinkchecker.py from the pywikipediabot package (http://svn.wikimedia.org/viewvc/pywikipedia/trunk/pywikipedia/weblinkchecker.py?revision=10457&view=markup#l777) fetches article contents and check URLs from the page, applying some heuristics.
(In reply to comment #5) > I am not sure externallinks is designed for this purpose. It tries to > canonicalize links to make blacklist matching better. > > weblinkchecker.py from the pywikipediabot package > (http://svn.wikimedia.org/viewvc/pywikipedia/trunk/pywikipedia/weblinkchecker.py?revision=10457&view=markup#l777) > fetches article contents and check URLs from the page, applying some > heuristics. That's totally outdated. It is meant to fetch it from API/database, where the problem reappears. Isn't blacklist matching only needed in the stage of editing? Can't we compare it then and take the correct url to the database?
Aside from spam filtering / blacklisting, normalization also makes it easier to find links through Special:LinkSearch.
I'd say the current behaviour of externallinks is correct given its function in MediaWiki. Reporting links for analysis is not this function, there should be another way to do that then... I only wonder if API documentation shouldn't be explicit about this... One can have impression, that those are external links as stored in the article.