Last modified: 2013-01-20 18:22:20 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T44513, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 42513 - mediawiki.Uri crashes in pages with '@' in their url
mediawiki.Uri crashes in pages with '@' in their url
Status: RESOLVED FIXED
Product: MediaWiki
Classification: Unclassified
JavaScript (Other open bugs)
1.19
All All
: High critical with 1 vote (vote)
: 1.21.0 release
Assigned To: Krinkle
:
Depends on:
Blocks: 42306
  Show dependency treegraph
 
Reported: 2012-11-28 18:24 UTC by Krinkle
Modified: 2013-01-20 18:22 UTC (History)
4 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Krinkle 2012-11-28 18:24:55 UTC
It isn't just that this would throw an exception:

 new mw.Uri('http://krinkle.dev/mediawiki/alpha/index.php?title=VisualEditor:S@box&action=edit');

Although that is also bad (when done on non-@ pages), it is especially bad that on pages containing an @ in the url, the module won't even finish loading because of this line:

 defaultUri = new Uri( documentLocation );

So instead, mw.Uri stays undefined.
Comment 1 Andre Klapper 2012-12-17 01:40:59 UTC
Krinkle: As dependency bug 42306 is assigned to, do you work on this?
Comment 2 Krinkle 2012-12-17 17:11:17 UTC
Not yet, but if nobody does I will eventually.
Comment 3 Bartosz Dziewoński 2012-12-17 21:21:01 UTC
I googled a little for a reasonable JS URI parsing library that could replace those half-baked regexes, and found this: https://gist.github.com/2428561

It seems genius and is 100% correct. Is it a good idea to use something like this? If it is, I can implement it.
Comment 4 Tyler Romeo 2013-01-18 17:11:01 UTC
It should be noted that those are HTML5-specific properties.
Comment 5 Tyler Romeo 2013-01-18 17:36:06 UTC
Found the problem. Expect a patch in the near future.
Comment 6 Krinkle 2013-01-18 17:37:43 UTC
(In reply to comment #3)
> I googled a little for a reasonable JS URI parsing library that could replace
> those half-baked regexes, and found this: https://gist.github.com/2428561
> 
> It seems genius and is 100% correct. Is it a good idea to use something like
> this? If it is, I can implement it.

That's not a URI parsing library, that's an unreliable hack using <a href=""> parsing. See the comments there.

On further investigation and following links from that gist, it looks like the regexes in mw.Uri aren't so half-baked.

They actually come from here:
http://blog.stevenlevithan.com/archives/parseuri
Comment 7 Tyler Romeo 2013-01-18 17:51:20 UTC
(In reply to comment #6)
> (In reply to comment #3)
> > I googled a little for a reasonable JS URI parsing library that could replace
> > those half-baked regexes, and found this: https://gist.github.com/2428561
> > 
> > It seems genius and is 100% correct. Is it a good idea to use something like
> > this? If it is, I can implement it.
> 
> That's not a URI parsing library, that's an unreliable hack using <a href="">
> parsing. See the comments there.
> 
> On further investigation and following links from that gist, it looks like
> the
> regexes in mw.Uri aren't so half-baked.
> 
> They actually come from here:
> http://blog.stevenlevithan.com/archives/parseuri

Indeed. You'll also notice that the user and password aren't parsed at all in that implementation.

Also, https://gerrit.wikimedia.org/r/44665
Comment 8 Bartosz Dziewoński 2013-01-18 18:06:08 UTC
(In reply to comment #6)
> On further investigation and following links from that gist, it looks like
> the
> regexes in mw.Uri aren't so half-baked.
> 
> They actually come from here:
> http://blog.stevenlevithan.com/archives/parseuri

If they weren't half-baked, mw.Uri wouldn't crash on '@' in the URL. 

These regexes are already awful (177 characters? Seriously?), and are bound to get worse as we discover more edge-cases. 

What about password with a '@' in it? (Yeah, this is allowed as far as I know, and works.) What about whitespace in the authority part? (This isn't allowed, but those regexes match it just fine.)

While on second though the <a>-abuse I linked isn't a good idea, we really need a serious parsing library.
Comment 9 Krinkle 2013-01-18 20:59:30 UTC
(In reply to comment #8)
> (In reply to comment #6)
> > On further investigation and following links from that gist, it looks like
> > the
> > regexes in mw.Uri aren't so half-baked.
> > 
> > They actually come from here:
> > http://blog.stevenlevithan.com/archives/parseuri
> 
> If they weren't half-baked, mw.Uri wouldn't crash on '@' in the URL. 
> 
> These regexes are already awful (177 characters? Seriously?), and are bound
> to
> get worse as we discover more edge-cases. 
> 
> What about password with a '@' in it? (Yeah, this is allowed as far as I
> know,

No, those need to be url escaped afaik. At least in Chrome and in the nodejs modules I used it the password (especially the @ symbol) needs to be url escaped or it won't work.

> While on second though the <a>-abuse I linked isn't a good idea, we really
> need a serious parsing library.

I'm open to suggestions, though it'll need to be flexible and be replaceable in-place to allow smooth transition (same API and and unit tests, different internal implementation).

The module is generic enough to easily allow a different parser.

But there is some ambiguity on the fields to be extracted and their meaning[1], so even a "perfect" library may need modification to work for us and our interpretation of those terms.

[1] http://tantek.com/2011/238/b1/many-ways-slice-url-name-pieces
Comment 10 Tyler Romeo 2013-01-18 21:46:05 UTC
I wouldn't attempt to make a comprehensive URI parsing library in JavaScript. The first reason is because it's unnecessary. With the fix I just uploaded to Gerrit, there should not be any more bugs in this function. Secondly, the regular expression would be absurdly complicated. URIs have specific sets of characters that can be included only in certain sections. I wrote a regex for an official URI once when I was working on the PHP version of this class (still in Gerrit), and it was so confusing it took days to get it working properly.
Comment 11 Bartosz Dziewoński 2013-01-18 22:55:28 UTC
(In reply to comment #10)
> I wouldn't attempt to make a comprehensive URI parsing library in JavaScript.

> The first reason is because it's unnecessary. With the fix I just uploaded to
> Gerrit, there should not be any more bugs in this function. 

I just pointed out a glaring one above.

(In reply to comment #8)
> What about whitespace in the authority part? (This isn't allowed,
> but those regexes match it just fine.)

I'm sure there are more, waiting to be uncovered.


> Secondly, the
> regular expression would be absurdly complicated. URIs have specific sets of
> characters that can be included only in certain sections. I wrote a regex for
> an official URI once when I was working on the PHP version of this class
> (still
> in Gerrit), and it was so confusing it took days to get it working properly.

Of course, I know this. That's why I'm suggesting using the work of someone who has already spent time pulling his hair out to get this right; or at least just taking those crazy regexes in from some other library.
Comment 12 Tyler Romeo 2013-01-19 00:50:27 UTC
Yes, but is it truly necessary to validate that much? I mean, for example, take the (invalid) URI "https://www.google.com/search?q=test test". The URI is invalid because it has a space character in the query, but if you were to do something like "location.href = <that uri>", it would work perfectly fine, because the browser automatically percent-encodes the space.

Basically, I'm wondering for what the mw.Uri class is being used for that requires such strict validation.
Comment 13 Krinkle 2013-01-19 18:53:07 UTC
Change-Id: I4954bfd3750d5c990e91cc0dc8a175225ccbad1e
Comment 14 Tyler Romeo 2013-01-19 22:55:39 UTC
BTW, for whoever's curious, here is what an official URI regular expression would look like (it does't work, needs debugging; it's probably missing a parentheses or something):

/^[A-Za-z][A-Za-z0-9\+\-\.]*:(\/\/(([:A-Z0-9-\._~!\$&'\(\)\*\+,;=]|%[A-Fa-f]{2})@)?(\[(|v[A-Fa-f]+\.[:A-Z0-9-\._~!\$&'\(\)\*\+,;=]+)\]|([0-9]|[1-9][0-9]|2[0-4][0-9]|25[0-5])\.([0-9]|[1-9][0-9]|2[0-4][0-9]|25[0-5])\.([0-9]|[1-9][0-9]|2[0-4][0-9]|25[0-5])\.([0-9]|[1-9][0-9]|2[0-4][0-9]|25[0-5])|([A-Z0-9-\._~!\$&'\(\)\*\+,;=]|%[A-Fa-f]{2}))(:[0-9]*)?(\/(([:@A-Z0-9-\._~!\$&'\(\)\*\+,;=]|%[A-Fa-f]{2}))*)*|\/(([:@A-Z0-9-\._~!\$&'\(\)\*\+,;=]|%[A-Fa-f]{2}))+(\/(([:@A-Z0-9-\._~!\$&'\(\)\*\+,;=]|%[A-Fa-f]{2}))*)*|(([:@A-Z0-9-\._~!\$&'\(\)\*\+,;=]|%[A-Fa-f]{2}))+(\/(([:@A-Z0-9-\._~!\$&'\(\)\*\+,;=]|%[A-Fa-f]{2}))*)*|)\?((([:@A-Z0-9-\._~!\$&'\(\)\*\+,;=]|%[A-Fa-f]{2})|[\/\?])*)?(#(([:@A-Z0-9-\._~!\$&'\(\)\*\+,;=]|%[A-Fa-f]{2})|[\/\?])*)?$/
Comment 15 Bartosz Dziewoński 2013-01-20 13:26:12 UTC
Tyler, this looks way too short on the first glance and doesn't work. It also explicitly doesn't support protocol-relative URIs.

Here's one that works (this is used by Ruby's URI parsing library). Of course it's not hardcoded in the code like this, but constructed on the fly from regexes for specific URI parts. I could port this to JS in a reasonable way, if anybody except for me actually wants this.

/^(?:[a-zA-Z][\-+.a-zA-Z\d]*:(?:(?:\/\/(?:(?:(?:[\-_.!~*'()a-zA-Z\d;:&=+$,]|%[a-fA-F\d]{2})*@)?(?:(?:[a-zA-Z0-9\-.]|%\h\h)+|\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}|\[(?:(?:[a-fA-F\d]{1,4}:)*(?:[a-fA-F\d]{1,4}|\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})|(?:(?:[a-fA-F\d]{1,4}:)*[a-fA-F\d]{1,4})?::(?:(?:[a-fA-F\d]{1,4}:)*(?:[a-fA-F\d]{1,4}|\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}))?)\])(?::\d*)?|(?:[\-_.!~*'()a-zA-Z\d$,;:@&=+]|%[a-fA-F\d]{2})+)(?:\/(?:[\-_.!~*'()a-zA-Z\d:@&=+$,]|%[a-fA-F\d]{2})*(?:;(?:[\-_.!~*'()a-zA-Z\d:@&=+$,]|%[a-fA-F\d]{2})*)*(?:\/(?:[\-_.!~*'()a-zA-Z\d:@&=+$,]|%[a-fA-F\d]{2})*(?:;(?:[\-_.!~*'()a-zA-Z\d:@&=+$,]|%[a-fA-F\d]{2})*)*)*)?|\/(?:[\-_.!~*'()a-zA-Z\d:@&=+$,]|%[a-fA-F\d]{2})*(?:;(?:[\-_.!~*'()a-zA-Z\d:@&=+$,]|%[a-fA-F\d]{2})*)*(?:\/(?:[\-_.!~*'()a-zA-Z\d:@&=+$,]|%[a-fA-F\d]{2})*(?:;(?:[\-_.!~*'()a-zA-Z\d:@&=+$,]|%[a-fA-F\d]{2})*)*)*)(?:\?(?:(?:[\-_.!~*'()a-zA-Z\d;\/?:@&=+$,\[\]]|%[a-fA-F\d]{2})*))?|(?:[\-_.!~*'()a-zA-Z\d;?:@&=+$,]|%[a-fA-F\d]{2})(?:[\-_.!~*'()a-zA-Z\d;\/?:@&=+$,\[\]]|%[a-fA-F\d]{2})*)|(?:\/\/(?:(?:(?:[\-_.!~*'()a-zA-Z\d;:&=+$,]|%[a-fA-F\d]{2})*@)?(?:(?:[a-zA-Z0-9\-.]|%\h\h)+|\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}|\[(?:(?:[a-fA-F\d]{1,4}:)*(?:[a-fA-F\d]{1,4}|\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})|(?:(?:[a-fA-F\d]{1,4}:)*[a-fA-F\d]{1,4})?::(?:(?:[a-fA-F\d]{1,4}:)*(?:[a-fA-F\d]{1,4}|\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}))?)\])(?::\d*)?|(?:[\-_.!~*'()a-zA-Z\d$,;:@&=+]|%[a-fA-F\d]{2})+)(?:\/(?:[\-_.!~*'()a-zA-Z\d:@&=+$,]|%[a-fA-F\d]{2})*(?:;(?:[\-_.!~*'()a-zA-Z\d:@&=+$,]|%[a-fA-F\d]{2})*)*(?:\/(?:[\-_.!~*'()a-zA-Z\d:@&=+$,]|%[a-fA-F\d]{2})*(?:;(?:[\-_.!~*'()a-zA-Z\d:@&=+$,]|%[a-fA-F\d]{2})*)*)*)?|\/(?:[\-_.!~*'()a-zA-Z\d:@&=+$,]|%[a-fA-F\d]{2})*(?:;(?:[\-_.!~*'()a-zA-Z\d:@&=+$,]|%[a-fA-F\d]{2})*)*(?:\/(?:[\-_.!~*'()a-zA-Z\d:@&=+$,]|%[a-fA-F\d]{2})*(?:;(?:[\-_.!~*'()a-zA-Z\d:@&=+$,]|%[a-fA-F\d]{2})*)*)*|(?:[\-_.!~*'()a-zA-Z\d;@&=+$,]|%[a-fA-F\d]{2})+(?:\/(?:[\-_.!~*'()a-zA-Z\d:@&=+$,]|%[a-fA-F\d]{2})*(?:;(?:[\-_.!~*'()a-zA-Z\d:@&=+$,]|%[a-fA-F\d]{2})*)*(?:\/(?:[\-_.!~*'()a-zA-Z\d:@&=+$,]|%[a-fA-F\d]{2})*(?:;(?:[\-_.!~*'()a-zA-Z\d:@&=+$,]|%[a-fA-F\d]{2})*)*)*)?)(?:\?(?:[\-_.!~*'()a-zA-Z\d;\/?:@&=+$,\[\]]|%[a-fA-F\d]{2})*)?)?(?:#(?:[\-_.!~*'()a-zA-Z\d;\/?:@&=+$,\[\]]|%[a-fA-F\d]{2})*)?$/
Comment 16 Tyler Romeo 2013-01-20 18:22:20 UTC
I know mine doesn't work, because I put it together in five minutes, but it's definitely proper. Here's the code I used to construct it: http://pastebin.com/ELJ4LdKy

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links