Last modified: 2011-08-24 23:25:06 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T32269, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 30269 - When a string such as foobar//barfoo is found, //barfoo should not be linked to http://barfoo
When a string such as foobar//barfoo is found, //barfoo should not be linked ...
Status: RESOLVED FIXED
Product: MediaWiki
Classification: Unclassified
Parser (Other open bugs)
1.18.x
All All
: Normal normal (vote)
: ---
Assigned To: Nobody - You can work on this!
: parser
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2011-08-07 17:30 UTC by Gaëtan Landry
Modified: 2011-08-24 23:25 UTC (History)
7 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments
Option 1: somehow remove '//' from the URL protocols list used by doMagicLinks() (920 bytes, patch)
2011-08-12 15:24 UTC, Roan Kattouw
Details
Option 2: add a lookbehind for ^|\W\b (737 bytes, patch)
2011-08-12 15:27 UTC, Roan Kattouw
Details

Description Gaëtan Landry 2011-08-07 17:30:34 UTC
Basically, when a string such as foobar//barfoo is found, //barfoo is linked to http://barfoo

I can't think of any reason for this behaviour.
Comment 1 Niklas Laxström 2011-08-07 17:36:07 UTC
This is most likely caused by the recent changes to enable protocol relative links.
Comment 2 Platonides 2011-08-07 18:25:10 UTC
It thinks this is a protocol-relative link.
Comment 3 Platonides 2011-08-07 18:29:55 UTC
Caused by r91663. Related to bug 29497
I don't think it should be converted to a link unless the double slash appears after a [
The case where people use {{SERVER}}/foo in the source is a bit harder. I think that -for a protocol-relative wiki- it should give a protocol-relative link but still prefix it with a http: protocol
Comment 4 Roan Kattouw 2011-08-10 15:56:21 UTC
This is very strange, because //barfoo doesn't trigger it, and I would expect the opposite behavior.

I'll investigate this today or tomorrow.
Comment 5 Roan Kattouw 2011-08-12 15:15:28 UTC
"easy, parser" is almost an oxymoron :D . Removing "easy"
Comment 6 Roan Kattouw 2011-08-12 15:24:10 UTC
Created attachment 8913 [details]
Option 1: somehow remove '//' from the URL protocols list used by doMagicLinks()
Comment 7 Roan Kattouw 2011-08-12 15:27:41 UTC
Created attachment 8914 [details]
Option 2: add a lookbehind for ^|\W\b
Comment 8 Roan Kattouw 2011-08-12 15:40:12 UTC
I've added two patches that fix this bug in different ways, neither of which I'm particularly happy with.

Option 1 is to create another version of $this->mUrlProtocols that somehow excludes '//'. The way I've done it in my patch is particularly hacky; it could be done a bit more cleanly in wfUrlProtocols() or something, but I didn't feel like making that effort today. I think this option, changed to exclude '//' in a more elegant way, is probably best.

Option 2 adds a lookbehind for ^|\W\b to the regex, which insures that free URLs are only picked up if they are either at the very beginning of the page, or are preceded by a non-word character and a word boundary. In practice, this ensures the URL starts with a word character (because it's preceded by \W\b), which happens to exclude all protocols except '//'. Not only is this hackier than option 1 in principle, it also lets protocol-relative URLs at the very start of the page through (oops!).

Thoughts?
Comment 9 MZMcBride 2011-08-12 22:18:06 UTC
(In reply to comment #8)
> Option 1 is to create another version of $this->mUrlProtocols that somehow
> excludes '//'. The way I've done it in my patch is particularly hacky; it could
> be done a bit more cleanly in wfUrlProtocols() or something, but I didn't feel
> like making that effort today. I think this option, changed to exclude '//' in
> a more elegant way, is probably best.

Yeah, it kinda sucks that wfUrlProtocols returns a string rather than an array. Otherwise you could use unset or array_diff here pretty cleanly.

> Option 2 adds a lookbehind for ^|\W\b to the regex, which insures that free
> URLs are only picked up if they are either at the very beginning of the page,
> or are preceded by a non-word character and a word boundary. In practice, this
> ensures the URL starts with a word character (because it's preceded by \W\b),
> which happens to exclude all protocols except '//'. Not only is this hackier
> than option 1 in principle, it also lets protocol-relative URLs at the very
> start of the page through (oops!).

To me, this option is a complete non-starter. It'll present all sorts of edge cases and weirdness. Imagine a template that contains only a protocol-relative URL being transcluded elsewhere. Will it be linked? Unlinked? Who knows.
Comment 10 Roan Kattouw 2011-08-13 11:34:19 UTC
(In reply to comment #9)
> (In reply to comment #8)
> > Option 1 is to create another version of $this->mUrlProtocols that somehow
> > excludes '//'. The way I've done it in my patch is particularly hacky; it could
> > be done a bit more cleanly in wfUrlProtocols() or something, but I didn't feel
> > like making that effort today. I think this option, changed to exclude '//' in
> > a more elegant way, is probably best.
> 
> Yeah, it kinda sucks that wfUrlProtocols returns a string rather than an array.
> Otherwise you could use unset or array_diff here pretty cleanly.
> 
You could use array_diff() inside wfUrlProtocols() to massage $wgUrlProtocols before implode()ing it. That requires a bit more code, though (wfUrlProtocols() caches its return value, so you need to account for that too) and would've modified a second file, so I took the quick&dirty route for this so I could get the point across more quickly.

> > Option 2 adds a lookbehind for ^|\W\b to the regex, which insures that free
> > URLs are only picked up if they are either at the very beginning of the page,
> > or are preceded by a non-word character and a word boundary. In practice, this
> > ensures the URL starts with a word character (because it's preceded by \W\b),
> > which happens to exclude all protocols except '//'. Not only is this hackier
> > than option 1 in principle, it also lets protocol-relative URLs at the very
> > start of the page through (oops!).
> 
> To me, this option is a complete non-starter. It'll present all sorts of edge
> cases and weirdness. Imagine a template that contains only a protocol-relative
> URL being transcluded elsewhere. Will it be linked? Unlinked? Who knows.
Yeah, I didn't really realize how bad this was until I wrote the Bugzilla comment, and I didn't think about the template case.
Comment 11 Mark A. Hershberger 2011-08-13 17:50:18 UTC
So… option 1 is going to be applied?
Comment 12 Roan Kattouw 2011-08-14 09:36:16 UTC
(In reply to comment #11)
> So… option 1 is going to be applied?
Not in its current form (it needs to be made nicer, and I know how to do that; I was just lazy) and not before I get Tim to look at it.
Comment 13 Roan Kattouw 2011-08-15 12:20:40 UTC
Fixed in r94502 using a variant of option 1 after discussing with Tim on IRC. Will add a parser test shortly.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links