Last modified: 2011-08-24 23:25:06 UTC
Basically, when a string such as foobar//barfoo is found, //barfoo is linked to http://barfoo I can't think of any reason for this behaviour.
This is most likely caused by the recent changes to enable protocol relative links.
It thinks this is a protocol-relative link.
Caused by r91663. Related to bug 29497 I don't think it should be converted to a link unless the double slash appears after a [ The case where people use {{SERVER}}/foo in the source is a bit harder. I think that -for a protocol-relative wiki- it should give a protocol-relative link but still prefix it with a http: protocol
This is very strange, because //barfoo doesn't trigger it, and I would expect the opposite behavior. I'll investigate this today or tomorrow.
"easy, parser" is almost an oxymoron :D . Removing "easy"
Created attachment 8913 [details] Option 1: somehow remove '//' from the URL protocols list used by doMagicLinks()
Created attachment 8914 [details] Option 2: add a lookbehind for ^|\W\b
I've added two patches that fix this bug in different ways, neither of which I'm particularly happy with. Option 1 is to create another version of $this->mUrlProtocols that somehow excludes '//'. The way I've done it in my patch is particularly hacky; it could be done a bit more cleanly in wfUrlProtocols() or something, but I didn't feel like making that effort today. I think this option, changed to exclude '//' in a more elegant way, is probably best. Option 2 adds a lookbehind for ^|\W\b to the regex, which insures that free URLs are only picked up if they are either at the very beginning of the page, or are preceded by a non-word character and a word boundary. In practice, this ensures the URL starts with a word character (because it's preceded by \W\b), which happens to exclude all protocols except '//'. Not only is this hackier than option 1 in principle, it also lets protocol-relative URLs at the very start of the page through (oops!). Thoughts?
(In reply to comment #8) > Option 1 is to create another version of $this->mUrlProtocols that somehow > excludes '//'. The way I've done it in my patch is particularly hacky; it could > be done a bit more cleanly in wfUrlProtocols() or something, but I didn't feel > like making that effort today. I think this option, changed to exclude '//' in > a more elegant way, is probably best. Yeah, it kinda sucks that wfUrlProtocols returns a string rather than an array. Otherwise you could use unset or array_diff here pretty cleanly. > Option 2 adds a lookbehind for ^|\W\b to the regex, which insures that free > URLs are only picked up if they are either at the very beginning of the page, > or are preceded by a non-word character and a word boundary. In practice, this > ensures the URL starts with a word character (because it's preceded by \W\b), > which happens to exclude all protocols except '//'. Not only is this hackier > than option 1 in principle, it also lets protocol-relative URLs at the very > start of the page through (oops!). To me, this option is a complete non-starter. It'll present all sorts of edge cases and weirdness. Imagine a template that contains only a protocol-relative URL being transcluded elsewhere. Will it be linked? Unlinked? Who knows.
(In reply to comment #9) > (In reply to comment #8) > > Option 1 is to create another version of $this->mUrlProtocols that somehow > > excludes '//'. The way I've done it in my patch is particularly hacky; it could > > be done a bit more cleanly in wfUrlProtocols() or something, but I didn't feel > > like making that effort today. I think this option, changed to exclude '//' in > > a more elegant way, is probably best. > > Yeah, it kinda sucks that wfUrlProtocols returns a string rather than an array. > Otherwise you could use unset or array_diff here pretty cleanly. > You could use array_diff() inside wfUrlProtocols() to massage $wgUrlProtocols before implode()ing it. That requires a bit more code, though (wfUrlProtocols() caches its return value, so you need to account for that too) and would've modified a second file, so I took the quick&dirty route for this so I could get the point across more quickly. > > Option 2 adds a lookbehind for ^|\W\b to the regex, which insures that free > > URLs are only picked up if they are either at the very beginning of the page, > > or are preceded by a non-word character and a word boundary. In practice, this > > ensures the URL starts with a word character (because it's preceded by \W\b), > > which happens to exclude all protocols except '//'. Not only is this hackier > > than option 1 in principle, it also lets protocol-relative URLs at the very > > start of the page through (oops!). > > To me, this option is a complete non-starter. It'll present all sorts of edge > cases and weirdness. Imagine a template that contains only a protocol-relative > URL being transcluded elsewhere. Will it be linked? Unlinked? Who knows. Yeah, I didn't really realize how bad this was until I wrote the Bugzilla comment, and I didn't think about the template case.
So… option 1 is going to be applied?
(In reply to comment #11) > So… option 1 is going to be applied? Not in its current form (it needs to be made nicer, and I know how to do that; I was just lazy) and not before I get Tim to look at it.
Fixed in r94502 using a variant of option 1 after discussing with Tim on IRC. Will add a parser test shortly.