Last modified: 2014-07-21 15:48:29 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T60316, the corresponding Phabricator task for complete and up-to-date bug report information.

Bug 58316 - Javascript escapes in URLs ("\x" rather than "%") are not decoded


Summary:	Javascript escapes in URLs ("\x" rather than "%") are not decoded

Status:	PATCH_TO_REVIEW

Product:	MediaWiki
Classification:	Unclassified
Component:	General/Unknown (Other open bugs)
Version:	unspecified
Hardware:	All All

Importance:	Normal normal (vote)
Target Milestone:	---
Assigned To:	Nobody - You can work on this!

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:
	Show dependency tree / graph

Reported:	2013-12-11 08:12 UTC by rybec
Modified:	2014-07-21 15:48 UTC (History)
CC List:	5 users (show)

See Also:
Web browser:	---
Mobile Platform:	---
Assignee Huggle Beta Tester:	---

Attachments
requests logged on 2012-06-09 for hour 19:00 (19.87 KB, text/plain) 2013-12-11 08:12 UTC, rybec	Details
*logged requests for titles containing "Robinson_Can" (case-insensitive), from 18 November 2013 and the first hour of 19 November 2013 (from "zcat pagecounts-2013111z \| grep -i Robinson_Can")** (6.85 KB, text/plain) 2013-12-11 19:51 UTC, rybec	Details
Add an attachment (proposed patch, testcase, etc.)

Description rybec 2013-12-11 08:12:44 UTC

Created attachment 14056 [details]
requests logged on 2012-06-09 for hour 19:00

Instead of HTML percent encodings, pages are sometimes requested through Javascript-encoded URLs. The difference is that "\x", rather than the "%" symbol, is used to indicate the start of an escape sequence. These requests are not decoded by the Mediawiki software. For example, a request for

https://en.wikipedia.org/w/index.php?title=Robinson_Can%C3%B3

is correctly decoded (the "%C3%B3" is transformed to an accented "o"), whereas a request for

https://en.wikipedia.org/w/index.php?title=Robinson_Can\xC3\xB3

is not decoded and we're told the page doesn't exist.

As I noted at https://en.wikipedia.org/wiki/Wikipedia:Redirects_for_discussion/Log/2013_December_9#.5Cx22Weird_Al.5Cx22_Yankovic there's been a tremendous increase in the amount of this traffic reaching the WMF projects, from about one request per hour in September 2011 to millions of requests per day in November 2013.

Perhaps it would be desirable to transform "\x" to "%" before passing URLs to rawurldecode() so that these requests will reach the intended pages.

Comment 1 Derk-Jan Hartman 2013-12-11 09:30:13 UTC

Are you sure the requests are not being handled ? Isn't it just that the log is written differently for those requests ?

Comment 2 Derk-Jan Hartman 2013-12-11 09:57:07 UTC

I mean I see people are reasoning that https://en.wikipedia.org/w/index.php?title=Robinson_Can\xC3\xB3 should be reachable trough their browser. But that is not correct I think.

It is the technical representation of the input https://en.wikipedia.org/w/index.php?title=Robinson_Canó (a unicode url that is NOT percent encoded)

This technical representation is however not a valid input method in browser URL fields if I remember correctly. I suspect people are making assumptions based on an incorrect interpretation of the logs.

Comment 3 Derk-Jan Hartman 2013-12-11 11:40:03 UTC

In summary:
* Entries in the log of apache that look like: Robinson_Can\xC3\xB3
which is a UTF-8 encoded (Likely a representation of the not percent encoded request containing Robinson_Canó, [possibly even an IRI request?])

* Log entries are NOT canonical on this front. A request for Robinson_Canó is logged differently then a request for Robinson_Can%C3%B3.

* The statistics of stats.grok.se might not handle these properly (collating them, ignoring them, or just not accessible ?)

* Someone else made a tool to detect red links, that does make the \x entries accessible/visible.

* Someone is making mass redirects of \x entries to what they consider to be 'proper' entries. This seems to cause effect in the statistics, but I would say that if the statistics/tools are broken, you are only influencing the statistics most likely, not per se actually fixing something

* There seems to have been a large increase of these kinds of requests (newer browsers or google/bing.com changing their defaults can easily account for this).

* You cannot input a utf-8 sequence in the url field of a browser (because there is no need for this, you would just input ó).

* People can't figure out who is wrong and who is right.

Does that sum it up a bit ?

Comment 4 Andre Klapper 2013-12-11 12:33:18 UTC

(In reply to comment #0)
> Created attachment 14056 [details]
> requests logged on 2012-06-09 for hour 19:00

If you logged something 18 months ago, why do you file a bug report now?

Comment 5 rybec 2013-12-11 19:51:12 UTC

Created attachment 14061 [details]
logged requests for titles containing "Robinson_Can" (case-insensitive), from 18 November 2013 and the first hour of 19 November 2013 (from "zcat pagecounts-2013111*z | grep -i Robinson_Can")

Comment 6 rybec 2013-12-11 19:54:16 UTC

The first attachment is an extract from http://dumps.wikimedia.org/other/pagecounts-raw/2012/2012-06/pagecounts-20120609-190000.gz , a log provided by the WMF of incoming requests for that hour. I've uploaded another attachment, which shows how requests for Robinson_Can\xC3\xB3, Robinson_Can%C3%B3 and Robinson_Canó appear as separate entries in the logs.

Comment 7 rybec 2013-12-13 06:01:06 UTC

Someone has put a redirect at my Robinson_Can\xC3\xB3 example page, but this bug can be confirmed by noting the "redirected from" or by comparing the responses to these two URLs:

https://commons.wikipedia.org/w/index.php?title=File:\x22Holy_Sheykh_Cotton\x22_\x281890\x29_-_TIMEA.jpg

https://commons.wikipedia.org/w/index.php?title=File:%22Holy_Sheykh_Cotton%22_%281890%29_-_TIMEA.jpg

The first brings up an error page, whereas the second gets decoded and brings up a content page.

Comment 8 MZMcBride 2013-12-18 07:29:45 UTC

There's a magical \x syntax hidden in Parser.php, I believe, for Esperanto and such. It was a workaround (a hack) for browsers that used to handle Unicode poorly, as I recall. I'm reminded of it in this bug report.

I'm not sure this is a valid bug.

Comment 9 Gerrit Notification Bot 2013-12-22 22:25:44 UTC

Change 103241 had a related patch set uploaded by QChris:
Add test to guard against encoding mangling of filter

https://gerrit.wikimedia.org/r/103241

Comment 10 Bartosz Dziewoński 2013-12-22 22:44:47 UTC

(In reply to comment #8)
> There's a magical \x syntax hidden in Parser.php, I believe, for Esperanto

That's true, but not related here.

Comment 11 christian 2013-12-22 23:06:15 UTC

(In reply to comment #0)
> Instead of HTML percent encodings, pages are sometimes requested through
> Javascript-encoded URLs.

There are indeed some requests to \x-encoded URLs.
But they are mostly confused bots/clients. They are far from being
page views, and they are really few.
For example in October 2013 we had 20 such request in total in the
sampled-1000 logs.

However, you are correct that we see a lot of \x encoded URLs in
webstatscollector output. Webstatscollector processes udp2log data
unaltered (see comment #9). It seems \x-encoded URLs all stem from
SSL endpoints, and it looks as if those SSL endpoints would throw
misencoded URL requests into udp2log stream. Since that is a
sufficiently different issue, I filed bug 58876 about it.

A solution of bug 58876 will not address the current call for
MediaWiki to decode \x-encoded URLs. But it will make \x-encoded URLs
disappear from the webstatscollector output (thereby also dissappear
from stats.grok.se, and other consumers).

Comment 12 MZMcBride 2013-12-30 02:59:43 UTC

(In reply to comment #10)
> (In reply to comment #8)
>> There's a magical \x syntax hidden in Parser.php, I believe, for Esperanto
> 
> That's true, but not related here.

True, I wasn't really replying to anyone in particular. I was just reminded of it here. :-)

This particular bug falls into the category of "should we try to catch various URL munging?" I think. For example, we probably get _a lot_ of requests that inappropriately omit a trailing ) or inappropriately include a trailing > or ,. Should we try to auto-correct those requests as well? Dunno.

Comment 13 christian 2014-01-14 19:36:59 UTC

(In reply to comment #11)
> A solution of bug 58876 will not address the current call for
> MediaWiki to decode \x-encoded URLs. But it will make \x-encoded URLs
> disappear from the webstatscollector output (thereby also dissappear
> from stats.grok.se, and other consumers).

The fix for bug 58876 just went live, so \x encoded Urls should soon
mostly dissappear.

Comment 14 Gerrit Notification Bot 2014-07-21 15:48:29 UTC

Change 103241 merged by Ottomata:
Add test to guard against encoding mangling of filter

https://gerrit.wikimedia.org/r/103241

Wikimedia Bugzilla is closed!

Search

Personal tools

Navigation

Links