Last modified: 2013-01-20 18:22:20 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T44513, the corresponding Phabricator task for complete and up-to-date bug report information.

Bug 42513 - mediawiki.Uri crashes in pages with '@' in their url


Summary:	mediawiki.Uri crashes in pages with '@' in their url

Status:	RESOLVED FIXED

Product:	MediaWiki
Classification:	Unclassified
Component:	JavaScript (Other open bugs)
Version:	1.19
Hardware:	All All

Importance:	High critical with 1 vote (vote)
Target Milestone:	1.21.0 release
Assigned To:	Krinkle

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:	42306
	Show dependency tree / graph

Reported:	2012-11-28 18:24 UTC by Krinkle
Modified:	2013-01-20 18:22 UTC (History)
CC List:	4 users (show)

See Also:
Web browser:	---
Mobile Platform:	---
Assignee Huggle Beta Tester:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Description Krinkle 2012-11-28 18:24:55 UTC

It isn't just that this would throw an exception:

 new mw.Uri('http://krinkle.dev/mediawiki/alpha/index.php?title=VisualEditor:S@box&action=edit');

Although that is also bad (when done on non-@ pages), it is especially bad that on pages containing an @ in the url, the module won't even finish loading because of this line:

 defaultUri = new Uri( documentLocation );

So instead, mw.Uri stays undefined.

Comment 1 Andre Klapper 2012-12-17 01:40:59 UTC

Krinkle: As dependency bug 42306 is assigned to, do you work on this?

Comment 2 Krinkle 2012-12-17 17:11:17 UTC

Not yet, but if nobody does I will eventually.

Comment 3 Bartosz Dziewoński 2012-12-17 21:21:01 UTC

I googled a little for a reasonable JS URI parsing library that could replace those half-baked regexes, and found this: https://gist.github.com/2428561

It seems genius and is 100% correct. Is it a good idea to use something like this? If it is, I can implement it.

Comment 4 Tyler Romeo 2013-01-18 17:11:01 UTC

It should be noted that those are HTML5-specific properties.

Comment 5 Tyler Romeo 2013-01-18 17:36:06 UTC

Found the problem. Expect a patch in the near future.

Comment 6 Krinkle 2013-01-18 17:37:43 UTC

(In reply to comment #3)
> I googled a little for a reasonable JS URI parsing library that could replace
> those half-baked regexes, and found this: https://gist.github.com/2428561
> 
> It seems genius and is 100% correct. Is it a good idea to use something like
> this? If it is, I can implement it.

That's not a URI parsing library, that's an unreliable hack using <a href=""> parsing. See the comments there.

On further investigation and following links from that gist, it looks like the regexes in mw.Uri aren't so half-baked.

They actually come from here:
http://blog.stevenlevithan.com/archives/parseuri

Comment 7 Tyler Romeo 2013-01-18 17:51:20 UTC

(In reply to comment #6)
> (In reply to comment #3)
> > I googled a little for a reasonable JS URI parsing library that could replace
> > those half-baked regexes, and found this: https://gist.github.com/2428561
> > 
> > It seems genius and is 100% correct. Is it a good idea to use something like
> > this? If it is, I can implement it.
> 
> That's not a URI parsing library, that's an unreliable hack using <a href="">
> parsing. See the comments there.
> 
> On further investigation and following links from that gist, it looks like
> the
> regexes in mw.Uri aren't so half-baked.
> 
> They actually come from here:
> http://blog.stevenlevithan.com/archives/parseuri

Indeed. You'll also notice that the user and password aren't parsed at all in that implementation.

Also, https://gerrit.wikimedia.org/r/44665

Comment 8 Bartosz Dziewoński 2013-01-18 18:06:08 UTC

(In reply to comment #6)
> On further investigation and following links from that gist, it looks like
> the
> regexes in mw.Uri aren't so half-baked.
> 
> They actually come from here:
> http://blog.stevenlevithan.com/archives/parseuri

If they weren't half-baked, mw.Uri wouldn't crash on '@' in the URL. 

These regexes are already awful (177 characters? Seriously?), and are bound to get worse as we discover more edge-cases. 

What about password with a '@' in it? (Yeah, this is allowed as far as I know, and works.) What about whitespace in the authority part? (This isn't allowed, but those regexes match it just fine.)

While on second though the <a>-abuse I linked isn't a good idea, we really need a serious parsing library.

Comment 9 Krinkle 2013-01-18 20:59:30 UTC

(In reply to comment #8)
> (In reply to comment #6)
> > On further investigation and following links from that gist, it looks like
> > the
> > regexes in mw.Uri aren't so half-baked.
> > 
> > They actually come from here:
> > http://blog.stevenlevithan.com/archives/parseuri
> 
> If they weren't half-baked, mw.Uri wouldn't crash on '@' in the URL. 
> 
> These regexes are already awful (177 characters? Seriously?), and are bound
> to
> get worse as we discover more edge-cases. 
> 
> What about password with a '@' in it? (Yeah, this is allowed as far as I
> know,

No, those need to be url escaped afaik. At least in Chrome and in the nodejs modules I used it the password (especially the @ symbol) needs to be url escaped or it won't work.

> While on second though the <a>-abuse I linked isn't a good idea, we really
> need a serious parsing library.

I'm open to suggestions, though it'll need to be flexible and be replaceable in-place to allow smooth transition (same API and and unit tests, different internal implementation).

The module is generic enough to easily allow a different parser.

But there is some ambiguity on the fields to be extracted and their meaning[1], so even a "perfect" library may need modification to work for us and our interpretation of those terms.

[1] http://tantek.com/2011/238/b1/many-ways-slice-url-name-pieces

Comment 10 Tyler Romeo 2013-01-18 21:46:05 UTC

I wouldn't attempt to make a comprehensive URI parsing library in JavaScript. The first reason is because it's unnecessary. With the fix I just uploaded to Gerrit, there should not be any more bugs in this function. Secondly, the regular expression would be absurdly complicated. URIs have specific sets of characters that can be included only in certain sections. I wrote a regex for an official URI once when I was working on the PHP version of this class (still in Gerrit), and it was so confusing it took days to get it working properly.

Comment 11 Bartosz Dziewoński 2013-01-18 22:55:28 UTC

(In reply to comment #10)
> I wouldn't attempt to make a comprehensive URI parsing library in JavaScript.

> The first reason is because it's unnecessary. With the fix I just uploaded to
> Gerrit, there should not be any more bugs in this function. 

I just pointed out a glaring one above.

(In reply to comment #8)
> What about whitespace in the authority part? (This isn't allowed,
> but those regexes match it just fine.)

I'm sure there are more, waiting to be uncovered.


> Secondly, the
> regular expression would be absurdly complicated. URIs have specific sets of
> characters that can be included only in certain sections. I wrote a regex for
> an official URI once when I was working on the PHP version of this class
> (still
> in Gerrit), and it was so confusing it took days to get it working properly.

Of course, I know this. That's why I'm suggesting using the work of someone who has already spent time pulling his hair out to get this right; or at least just taking those crazy regexes in from some other library.

Comment 12 Tyler Romeo 2013-01-19 00:50:27 UTC

Yes, but is it truly necessary to validate that much? I mean, for example, take the (invalid) URI "https://www.google.com/search?q=test test". The URI is invalid because it has a space character in the query, but if you were to do something like "location.href = <that uri>", it would work perfectly fine, because the browser automatically percent-encodes the space.

Basically, I'm wondering for what the mw.Uri class is being used for that requires such strict validation.

Comment 13 Krinkle 2013-01-19 18:53:07 UTC

Change-Id: I4954bfd3750d5c990e91cc0dc8a175225ccbad1e

Comment 14 Tyler Romeo 2013-01-19 22:55:39 UTC

BTW, for whoever's curious, here is what an official URI regular expression would look like (it does't work, needs debugging; it's probably missing a parentheses or something):

/^[A-Za-z][A-Za-z0-9\+\-\.]*:(\/\/(([:A-Z0-9-\._~!\$&'\(\)\*\+,;=]|%[A-Fa-f]{2})@)?(\[(|v[A-Fa-f]+\.[:A-Z0-9-\._~!\$&'\(\)\*\+,;=]+)\]|([0-9]|[1-9][0-9]|2[0-4][0-9]|25[0-5])\.([0-9]|[1-9][0-9]|2[0-4][0-9]|25[0-5])\.([0-9]|[1-9][0-9]|2[0-4][0-9]|25[0-5])\.([0-9]|[1-9][0-9]|2[0-4][0-9]|25[0-5])|([A-Z0-9-\._~!\$&'\(\)\*\+,;=]|%[A-Fa-f]{2}))(:[0-9]*)?(\/(([:@A-Z0-9-\._~!\$&'\(\)\*\+,;=]|%[A-Fa-f]{2}))*)*|\/(([:@A-Z0-9-\._~!\$&'\(\)\*\+,;=]|%[A-Fa-f]{2}))+(\/(([:@A-Z0-9-\._~!\$&'\(\)\*\+,;=]|%[A-Fa-f]{2}))*)*|(([:@A-Z0-9-\._~!\$&'\(\)\*\+,;=]|%[A-Fa-f]{2}))+(\/(([:@A-Z0-9-\._~!\$&'\(\)\*\+,;=]|%[A-Fa-f]{2}))*)*|)\?((([:@A-Z0-9-\._~!\$&'\(\)\*\+,;=]|%[A-Fa-f]{2})|[\/\?])*)?(#(([:@A-Z0-9-\._~!\$&'\(\)\*\+,;=]|%[A-Fa-f]{2})|[\/\?])*)?$/

Comment 15 Bartosz Dziewoński 2013-01-20 13:26:12 UTC

Tyler, this looks way too short on the first glance and doesn't work. It also explicitly doesn't support protocol-relative URIs.

Here's one that works (this is used by Ruby's URI parsing library). Of course it's not hardcoded in the code like this, but constructed on the fly from regexes for specific URI parts. I could port this to JS in a reasonable way, if anybody except for me actually wants this.

/^(?:[a-zA-Z][\-+.a-zA-Z\d]*:(?:(?:\/\/(?:(?:(?:[\-_.!~*'()a-zA-Z\d;:&=+$,]|%[a-fA-F\d]{2})*@)?(?:(?:[a-zA-Z0-9\-.]|%\h\h)+|\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}|\[(?:(?:[a-fA-F\d]{1,4}:)*(?:[a-fA-F\d]{1,4}|\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})|(?:(?:[a-fA-F\d]{1,4}:)*[a-fA-F\d]{1,4})?::(?:(?:[a-fA-F\d]{1,4}:)*(?:[a-fA-F\d]{1,4}|\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}))?)\])(?::\d*)?|(?:[\-_.!~*'()a-zA-Z\d$,;:@&=+]|%[a-fA-F\d]{2})+)(?:\/(?:[\-_.!~*'()a-zA-Z\d:@&=+$,]|%[a-fA-F\d]{2})*(?:;(?:[\-_.!~*'()a-zA-Z\d:@&=+$,]|%[a-fA-F\d]{2})*)*(?:\/(?:[\-_.!~*'()a-zA-Z\d:@&=+$,]|%[a-fA-F\d]{2})*(?:;(?:[\-_.!~*'()a-zA-Z\d:@&=+$,]|%[a-fA-F\d]{2})*)*)*)?|\/(?:[\-_.!~*'()a-zA-Z\d:@&=+$,]|%[a-fA-F\d]{2})*(?:;(?:[\-_.!~*'()a-zA-Z\d:@&=+$,]|%[a-fA-F\d]{2})*)*(?:\/(?:[\-_.!~*'()a-zA-Z\d:@&=+$,]|%[a-fA-F\d]{2})*(?:;(?:[\-_.!~*'()a-zA-Z\d:@&=+$,]|%[a-fA-F\d]{2})*)*)*)(?:\?(?:(?:[\-_.!~*'()a-zA-Z\d;\/?:@&=+$,\[\]]|%[a-fA-F\d]{2})*))?|(?:[\-_.!~*'()a-zA-Z\d;?:@&=+$,]|%[a-fA-F\d]{2})(?:[\-_.!~*'()a-zA-Z\d;\/?:@&=+$,\[\]]|%[a-fA-F\d]{2})*)|(?:\/\/(?:(?:(?:[\-_.!~*'()a-zA-Z\d;:&=+$,]|%[a-fA-F\d]{2})*@)?(?:(?:[a-zA-Z0-9\-.]|%\h\h)+|\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}|\[(?:(?:[a-fA-F\d]{1,4}:)*(?:[a-fA-F\d]{1,4}|\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})|(?:(?:[a-fA-F\d]{1,4}:)*[a-fA-F\d]{1,4})?::(?:(?:[a-fA-F\d]{1,4}:)*(?:[a-fA-F\d]{1,4}|\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}))?)\])(?::\d*)?|(?:[\-_.!~*'()a-zA-Z\d$,;:@&=+]|%[a-fA-F\d]{2})+)(?:\/(?:[\-_.!~*'()a-zA-Z\d:@&=+$,]|%[a-fA-F\d]{2})*(?:;(?:[\-_.!~*'()a-zA-Z\d:@&=+$,]|%[a-fA-F\d]{2})*)*(?:\/(?:[\-_.!~*'()a-zA-Z\d:@&=+$,]|%[a-fA-F\d]{2})*(?:;(?:[\-_.!~*'()a-zA-Z\d:@&=+$,]|%[a-fA-F\d]{2})*)*)*)?|\/(?:[\-_.!~*'()a-zA-Z\d:@&=+$,]|%[a-fA-F\d]{2})*(?:;(?:[\-_.!~*'()a-zA-Z\d:@&=+$,]|%[a-fA-F\d]{2})*)*(?:\/(?:[\-_.!~*'()a-zA-Z\d:@&=+$,]|%[a-fA-F\d]{2})*(?:;(?:[\-_.!~*'()a-zA-Z\d:@&=+$,]|%[a-fA-F\d]{2})*)*)*|(?:[\-_.!~*'()a-zA-Z\d;@&=+$,]|%[a-fA-F\d]{2})+(?:\/(?:[\-_.!~*'()a-zA-Z\d:@&=+$,]|%[a-fA-F\d]{2})*(?:;(?:[\-_.!~*'()a-zA-Z\d:@&=+$,]|%[a-fA-F\d]{2})*)*(?:\/(?:[\-_.!~*'()a-zA-Z\d:@&=+$,]|%[a-fA-F\d]{2})*(?:;(?:[\-_.!~*'()a-zA-Z\d:@&=+$,]|%[a-fA-F\d]{2})*)*)*)?)(?:\?(?:[\-_.!~*'()a-zA-Z\d;\/?:@&=+$,\[\]]|%[a-fA-F\d]{2})*)?)?(?:#(?:[\-_.!~*'()a-zA-Z\d;\/?:@&=+$,\[\]]|%[a-fA-F\d]{2})*)?$/

Comment 16 Tyler Romeo 2013-01-20 18:22:20 UTC

I know mine doesn't work, because I put it together in five minutes, but it's definitely proper. Here's the code I used to construct it: http://pastebin.com/ELJ4LdKy

Wikimedia Bugzilla is closed!

Search

Personal tools

Navigation

Links