Last modified: 2014-03-23 23:50:31 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T63553, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 61553 - Percent-escaped slashes and colons should not be alternative page URLs
Percent-escaped slashes and colons should not be alternative page URLs
Status: UNCONFIRMED
Product: Wikimedia
Classification: Unclassified
Apache configuration (Other open bugs)
wmf-deployment
All All
: Low normal (vote)
: ---
Assigned To: Nobody - You can work on this!
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2014-02-19 18:49 UTC by LFaraone
Modified: 2014-03-23 23:50 UTC (History)
5 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description LFaraone 2014-02-19 18:49:44 UTC
The Apache or MediaWiki configuration for Wikipedia appears to decode percent-encoded "/"s and ":"s in URLs. 

This means that, for example, https://en.wikipedia.org/wiki/Wikipedia%3AAbout shows the same page as https://en.wikipedia.org/wiki/Wikipedia:About.

Similarly, https://en.wikipedia.org/wiki/Wikipedia_talk%3AArbitration%2FRequests%2FCase%2FShakespeare_authorship_question%2FProposed_decision is a valid URL for https://en.wikipedia.org/wiki/Wikipedia_talk:Arbitration/Requests/Case/Shakespeare_authorship_question/Proposed_decision.

The escaping results in incredibly-verbose robots.txt rules, see https://en.wikipedia.org/robots.txt , but even our existing rules don't account for %2Fs in place of "/"s. 

We should either redirect or reject these URLs.
Comment 1 Andre Klapper 2014-02-20 11:45:11 UTC
(In reply to LFaraone from comment #0)
> The escaping results in incredibly-verbose robots.txt rules

...which is not a problem per-se.

> but even our existing rules don't account for %2Fs in place of "/"s. 

So these rules could be fixed to support that too?

> We should either redirect or reject these URLs.

I cannot follow yet the advantage of this proposal. 

What is the problem you would like to solve in this bug report?
Comment 2 Andre Klapper 2014-02-28 16:01:28 UTC
LFaraone: Can you please answer comment 1?
Comment 3 LFaraone 2014-02-28 16:32:55 UTC
(In reply to Andre Klapper from comment #1)
> > The escaping results in incredibly-verbose robots.txt rules
> 
> ...which is not a problem per-se.
> 
> > but even our existing rules don't account for %2Fs in place of "/"s. 
> 
> So these rules could be fixed to support that too?
> 
> > We should either redirect or reject these URLs.
> 
> I cannot follow yet the advantage of this proposal. 
> 
> What is the problem you would like to solve in this bug report?

Yes, we could try and guess *every* *single* *possible* encoding of a URL and include it in robots.txt.

So, for WT:BLP/N, that means we'll need to have these entries:

Disallow: /wiki/Wikipedia_talk:Biographies_of_living_persons/Noticeboard/
Disallow: /wiki/Wikipedia_talk:Biographies_of_living_persons/Noticeboard%2F*
Disallow: /wiki/Wikipedia_talk:Biographies_of_living_persons%2FNoticeboard/
Disallow: /wiki/Wikipedia_talk:Biographies_of_living_persons%2FNoticeboard%2F*
Disallow: /wiki/Wikipedia_talk%3ABiographies_of_living_persons/Noticeboard/
Disallow: /wiki/Wikipedia_talk%3ABiographies_of_living_persons/Noticeboard%2F*
Disallow: /wiki/Wikipedia_talk%3ABiographies_of_living_persons%2FNoticeboard/
Disallow: /wiki/Wikipedia_talk%3ABiographies_of_living_persons%2FNoticeboard%2F*


This way leads madness. 

We want to express one specific thing, disallowing access to pages that are below an article name path. To accomplish that, we need 8 (!!) rules. This makes the list hard to manage, especially since it is edited by hand. We'll almost certainly miss things.
Comment 4 MZMcBride 2014-03-23 23:33:57 UTC
Is there any reason to believe that more aggressive URL canonicalization will affect robots.txt entries? I'm not sure there's a valid use-case here.

In reply to comment 3, I'd suggest that you could turn each of those underscores into " " or "%20" or "__" and come up with thousands more permutations. :-)

Given that Squid caching is prefix-based, more aggressive URL canonicalization would have been (or would be) helpful in that context. That is, as I understand it, Squid viewed "/wiki/Wikipedia_talk%3AB" and "/wiki/Wikipedia_talk:B" as distinct URLs and would cache both separately.

I'm not sure the same is true of Varnish (which is what Wikimedia wikis now use), though improving Squid behavior alone might make this a valid request.
Comment 5 Marius Hoch 2014-03-23 23:50:31 UTC
As the question whether (our) Varnish can handle this was raised: Yes it can,
there's normalize_path (which essentially does the same as MediaWiki's wfUrlencode) in modules/varnish/templates/vcl/wikimedia.vcl.erb.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links