Last modified: 2014-03-23 23:50:31 UTC
The Apache or MediaWiki configuration for Wikipedia appears to decode percent-encoded "/"s and ":"s in URLs. This means that, for example, https://en.wikipedia.org/wiki/Wikipedia%3AAbout shows the same page as https://en.wikipedia.org/wiki/Wikipedia:About. Similarly, https://en.wikipedia.org/wiki/Wikipedia_talk%3AArbitration%2FRequests%2FCase%2FShakespeare_authorship_question%2FProposed_decision is a valid URL for https://en.wikipedia.org/wiki/Wikipedia_talk:Arbitration/Requests/Case/Shakespeare_authorship_question/Proposed_decision. The escaping results in incredibly-verbose robots.txt rules, see https://en.wikipedia.org/robots.txt , but even our existing rules don't account for %2Fs in place of "/"s. We should either redirect or reject these URLs.
(In reply to LFaraone from comment #0) > The escaping results in incredibly-verbose robots.txt rules ...which is not a problem per-se. > but even our existing rules don't account for %2Fs in place of "/"s. So these rules could be fixed to support that too? > We should either redirect or reject these URLs. I cannot follow yet the advantage of this proposal. What is the problem you would like to solve in this bug report?
LFaraone: Can you please answer comment 1?
(In reply to Andre Klapper from comment #1) > > The escaping results in incredibly-verbose robots.txt rules > > ...which is not a problem per-se. > > > but even our existing rules don't account for %2Fs in place of "/"s. > > So these rules could be fixed to support that too? > > > We should either redirect or reject these URLs. > > I cannot follow yet the advantage of this proposal. > > What is the problem you would like to solve in this bug report? Yes, we could try and guess *every* *single* *possible* encoding of a URL and include it in robots.txt. So, for WT:BLP/N, that means we'll need to have these entries: Disallow: /wiki/Wikipedia_talk:Biographies_of_living_persons/Noticeboard/ Disallow: /wiki/Wikipedia_talk:Biographies_of_living_persons/Noticeboard%2F* Disallow: /wiki/Wikipedia_talk:Biographies_of_living_persons%2FNoticeboard/ Disallow: /wiki/Wikipedia_talk:Biographies_of_living_persons%2FNoticeboard%2F* Disallow: /wiki/Wikipedia_talk%3ABiographies_of_living_persons/Noticeboard/ Disallow: /wiki/Wikipedia_talk%3ABiographies_of_living_persons/Noticeboard%2F* Disallow: /wiki/Wikipedia_talk%3ABiographies_of_living_persons%2FNoticeboard/ Disallow: /wiki/Wikipedia_talk%3ABiographies_of_living_persons%2FNoticeboard%2F* This way leads madness. We want to express one specific thing, disallowing access to pages that are below an article name path. To accomplish that, we need 8 (!!) rules. This makes the list hard to manage, especially since it is edited by hand. We'll almost certainly miss things.
Is there any reason to believe that more aggressive URL canonicalization will affect robots.txt entries? I'm not sure there's a valid use-case here. In reply to comment 3, I'd suggest that you could turn each of those underscores into " " or "%20" or "__" and come up with thousands more permutations. :-) Given that Squid caching is prefix-based, more aggressive URL canonicalization would have been (or would be) helpful in that context. That is, as I understand it, Squid viewed "/wiki/Wikipedia_talk%3AB" and "/wiki/Wikipedia_talk:B" as distinct URLs and would cache both separately. I'm not sure the same is true of Varnish (which is what Wikimedia wikis now use), though improving Squid behavior alone might make this a valid request.
As the question whether (our) Varnish can handle this was raised: Yes it can, there's normalize_path (which essentially does the same as MediaWiki's wfUrlencode) in modules/varnish/templates/vcl/wikimedia.vcl.erb.