Last modified: 2013-01-14 17:21:56 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T42267, the corresponding Phabricator task for complete and up-to-date bug report information.

Bug 40267 - urls are changed in externallinks table


Summary:	urls are changed in externallinks table

Status:	RESOLVED WONTFIX

Product:	MediaWiki
Classification:	Unclassified
Component:	Parser (Other open bugs)
Version:	unspecified
Hardware:	All All

Importance:	Low minor (vote)
Target Milestone:	---
Assigned To:	Nobody - You can work on this!

URL:	https://de.wikipedia.org/w/index.php?...
Whiteboard:
Keywords:	parser

Depends on:
Blocks:
	Show dependency tree / graph

Reported:	2012-09-15 10:51 UTC by Giftpflanze
Modified:	2013-01-14 17:21 UTC (History)
CC List:	2 users (show)

See Also:
Web browser:	---
Mobile Platform:	---
Assignee Huggle Beta Tester:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Description Giftpflanze 2012-09-15 10:51:00 UTC

Some urls from the article texts are changed (urldecoded, I guess) in the externallinks table. Examples: [[de:Bahnhof Aachen Schanz]] contains link to http://www.aachen-kapstadt.de/?PROJEKTE/laufende_Projekte/Mural_Global_-_Wandmalprojekte/2005_Welthaus_-_Bahnhof_Schanz%2C_Aachen, but externallinks has http://www.aachen-kapstadt.de/?PROJEKTE/laufende_Projekte/Mural_Global_-_Wandmalprojekte/2005_Welthaus_-_Bahnhof_Schanz,_Aachen. [[de:Brandnew Oldies Volume 1]] contains http://www.kinokoma.de/?text:lang:carsten_bohn%27s_bandstand, externallinks http://www.kinokoma.de/?text:lang:carsten_bohn's_bandstand. [[de:Bahnstrecke Randers–Hadsund]] http://www.baner-omkring-aalborg.dk/?Randers%26nbsp%3BHadsund_jernbane / http://www.baner-omkring-aalborg.dk/?Randers&nbsp%3BHadsund_jernbane. [[de:Einen Augenblick Zeit]] http://www.dasbiber.at/content/%2526quot%3Bdas-auge%2526quot%3B-verl%C3%A4sst-wien%3F / http://www.dasbiber.at/content/%26quot%3Bdas-auge%26quot%3B-verl%C3%A4sst-wien%3F.

Comment 1 Marcin Cieślak 2012-09-20 20:35:09 UTC

This is intentional. There is a function in the parser, called replaceUnusualEscapes, that normalizes the URL by removing all URL escapes that are not prescribed in the RFC 1738 are dequoted, so only characters outside of ASCII range (32,127) and those having some meaning in the URL are escaped <>"#{}|\^~[]`;/?

https://gerrit.wikimedia.org/r/gitweb?p=mediawiki/core.git;a=blob;f=includes/parser/Parser.php;h=59d379a06ea8bc16fe24dfa28754688d1b8d1247;hb=HEAD#l1624

The rationale is this:

    * Convert unnecessary URL escape codes in external links to their equivalent
      character before doing anything with them. This prevents certain kinds of
      spam filter evasion. (Parser.php only)

https://gerrit.wikimedia.org/r/gitweb?p=mediawiki%2Fcore.git&h=eb53cc08560721208e195c0f073809e7b3eee485

RFC 3986 defines :/?#[]@ as generic delimiters and !$&'()*+,;= as delimiters that can be used by particular schemes (or more). It also defines "unreserved characters":

Berners-Lee, et al.         Standards Track                    [Page 12]
 
RFC 3986                   URI Generic Syntax               January 2005


2.3.  Unreserved Characters

   Characters that are allowed in a URI but do not have a reserved
   purpose are called unreserved.  These include uppercase and lowercase
   letters, decimal digits, hyphen, period, underscore, and tilde.

      unreserved  = ALPHA / DIGIT / "-" / "." / "_" / "~"

   URIs that differ in the replacement of an unreserved character with
   its corresponding percent-encoded US-ASCII octet are equivalent: they
   identify the same resource.  However, URI comparison implementations
   do not always perform normalization prior to comparison (see Section
   6).  For consistency, percent-encoded octets in the ranges of ALPHA
   (%41-%5A and %61-%7A), DIGIT (%30-%39), hyphen (%2D), period (%2E),
   underscore (%5F), or tilde (%7E) should not be created by URI
   producers and, when found in a URI, should be decoded to their
   corresponding unreserved characters by URI normalizers.


But here only externalinks table is affected, which is useful for tracking links. The URLs in the wikitext are displayed and linked as they were. Therefore  I am not this is a bug. What's the problem?

Comment 2 Marcin Cieślak 2012-09-20 20:40:32 UTC

Sorry for messed up links.

The current Parser.php code is here:

https://gerrit.wikimedia.org/r/gitweb?p=mediawiki/core.git;a=blob;f=includes/parser/Parser.php#l1624

The change from 2006 that introduced external links and quoting is eb53cc08560721208e195c0f073809e7b3eee485

(mangled links highlighting in gerrit is http://code.google.com/p/gerrit/issues/detail?id=1451 looks like Bugzilla has a very similar problem)

Comment 3 Giftpflanze 2012-09-20 20:45:01 UTC

(In reply to comment #1)
> But here only externalinks table is affected, which is useful for tracking
> links. The URLs in the wikitext are displayed and linked as they were.
> Therefore  I am not this is a bug. What's the problem?

I am checking for dead links and in the mentioned cases the decoded urls are dead whereas the original urls are ok. If it is not fixable by you I consider that as a bug on your side and make exceptions for those links.

Comment 4 Marcin Cieślak 2012-09-20 20:46:47 UTC

Well, this was r12874.

Comment 5 Marcin Cieślak 2012-09-20 20:53:53 UTC

I am not sure externallinks is designed for this purpose. It tries to canonicalize links to make blacklist matching better.

weblinkchecker.py from the pywikipediabot package (http://svn.wikimedia.org/viewvc/pywikipedia/trunk/pywikipedia/weblinkchecker.py?revision=10457&view=markup#l777) fetches article contents and check URLs from the page, applying some heuristics.

Comment 6 Giftpflanze 2012-09-20 20:56:33 UTC

(In reply to comment #5)
> I am not sure externallinks is designed for this purpose. It tries to
> canonicalize links to make blacklist matching better.
> 
> weblinkchecker.py from the pywikipediabot package
> (http://svn.wikimedia.org/viewvc/pywikipedia/trunk/pywikipedia/weblinkchecker.py?revision=10457&view=markup#l777)
> fetches article contents and check URLs from the page, applying some
> heuristics.

That's totally outdated. It is meant to fetch it from API/database, where the problem reappears.

Isn't blacklist matching only needed in the stage of editing? Can't we compare it then and take the correct url to the database?

Comment 7 Krinkle 2012-09-20 21:07:33 UTC

Aside from spam filtering / blacklisting, normalization also makes it easier to find links through Special:LinkSearch.

Comment 8 Marcin Cieślak 2013-01-14 17:21:56 UTC

I'd say the current behaviour of externallinks is correct given its function in MediaWiki. Reporting links for analysis is not this function, there should be another way to do that then... 

I only wonder if API documentation shouldn't be explicit about this... One can have impression, that those are external links as stored in the article.

Wikimedia Bugzilla is closed!

Search

Personal tools

Navigation

Links