Last modified: 2014-09-23 23:07:40 UTC
http://zh.wikipedia.org/w/index.php?oldid=11465238
As described in RFC 2732: http://www.ietf.org/rfc/rfc2732.txt IPv6 addresses in URLs are surrounded in [ ] characters. For instance, in the wiki: http://[2404:130:0:1000::187:2]/index.php should produce a hyperlink, but currently does not. Surrounding it in brackets doesn't work either, e.g.: Click [http://[2404:130:0:1000::187:2]/index.php here] This should be a simple patch to the parser somewhere - I'll have a go some time in the next day or two and attach a patch. Cheers, J
Created attachment 6862 [details] This does not fix it, as it leaves the path off the IPv6 link. Needs a regex genius. Okay so this bug boils down to tweaking the mExtLinkBracketedRegex in includes/parser/Parser.php I'm not sure how to alter this to include either passive groups for the IPv6 address brackets, or use a pipe to supply an almost identical alternative. I'm continuing to fiddle, but there might be a regex genius out there who's seen this before. Cheers, J
A proper regex for IPv6 addresses - http://vernon.mauery.com/content/projects/linux/ipv6_regex :-P
Created attachment 7186 [details] This fixes the problem but might not be safe I have a very simple solution which consists in removing the brackets ('[' and ']') from the list of forbidden entities in a URL. This modifies 3 regular expressions: 2 in Parser.php and 1 in Sanitizer.php However, it should be checked by someone competent because I don't know the possible implications in terms of security (there might be a good reason why the brackets were forbidden). I also note that currently, we do not check whether a URL is valid before adding a link to it (eg. http://thisisan\invalidtest is transformed to a link). But I found no explanation on mediawiki.org about what should be a valid (or invalid) URL.
It is a bit of a kludge but the colons and brackets used for IPv6 addresses can be replaced with their full-width forms. The following code assumes that $hostName has been set. > $host = ( $hostName == '['.substr( $hostName, 1, strlen( $hostName ) - 2 ).']' ) : substr( $hostName, 1, strlen( $hostName ) - 2 ) ? $hostName; > if ( filter_var( $host, FILTER_VALIDATE_IP, FILTER_FLAG_IPV6 ) ) { > str_replace( '[', '[', $hostName ); > str_replace( ':', ':', $hostName ); > str_replace( ']', ']', $hostName ); > } I the meantime this wiki template code may be useful. > [{{{1|}}}:{{{2|}}}:{{{3|1}}}{{#if: > {{{8|}}}|:{{{4|}}}:{{{5|}}}:{{{6|}}}:{{{7|}}}:{{{8}}}|{{#if: > {{{7|}}}|:{{{4|}}}:{{{5|}}}:{{{6|}}}:{{{7}}}|{{#if: > {{{6|}}}|:{{{4|}}}:{{{5|}}}:{{{6}}}|{{#if: > {{{5|}}}|:{{{4|}}}:{{{5}}}|{{#if: > {{{4|}}}|:{{{4}}} }} }} }} }} }}]
Created attachment 9543 [details] PHP script to demonstrate how an IPv6 hostname can be modified. The PHP code I posted in my previous comment was untested and defective. The attachment should show how an IPv6 address enclosed in brackets can be modified before being parsed for linking. The reverse can be done after the HTML markup has been generated. I hope someone can come up with a better way to do this as this seems to be a bit of a kludge.
That doesn't help, since you would still need to detect IPv6 addresses written with normal brackets.
The patch by Peter looks the way to go, but I'm afraid adding brackets to EXT_LINK_URL_CLASS would allow too much recursion. And that wouldn't stop at the ] of a url. The change to the sanitizer is needed to avoid converting http://[2404:130:0:1000::187:2]/index.php to http://%5B2404:130:0:1000::187:2%5D/index.php but although that doesn't break any parser test, I suspect there was a reason they were added as control characters in r14930, but I don't know which one (avoid treating the content like? that would no longer hold...). Note that the change to mExtLinkBracketedRegex won't apply, but it's now constructed from EXT_LINK_URL_CLASS so no problem there. I added a parser test in r104314.
(In reply to comment #7) > That doesn't help, since you would still need to detect IPv6 addresses written > with normal brackets. I am not sure what you mean. Are you referring to my php script? (Corrected lines 11-16 below.) > $host = ( $hostName == '['.substr( $hostName, 1, strlen( $hostName ) - 2 ).']' ) ? substr( $hostName, 1, strlen( $hostName ) - 2 ) : $hostName; > if ( filter_var( $host, FILTER_VALIDATE_IP, FILTER_FLAG_IPV6 ) ) { > $hostName = str_replace( '[', '[', $hostName ); > $hostName = str_replace( ':', ':', $hostName ); > $hostName = str_replace( ']', ']', $hostName ); > } Or are you referring to the template code. (See http://www.mediawiki.org/wiki/User:Allen4names) (In reply to comment #8) > I added a parser test in r104314. You may want to check line 1211 of revision 104314 for missing brackets.
(In reply to comment #9) > > I am not sure what you mean. Are you referring to my php script? (Corrected > lines 11-16 below.) Wikitext is written in text. Replacing [ and ] with something else which doesn't conflict could work but... How do you choose which [ ] to replace? > (In reply to comment #8) > > I added a parser test in r104314. > > You may want to check line 1211 of revision 104314 for missing brackets. Thanks, they were missing in the <a> contents. Fixed in r105356.
(In reply to comment #10) > (In reply to comment #9) > Wikitext is written in text. Replacing [ and ] with something else which > doesn't conflict could work but... How do you choose which [ ] to replace? Let me start by rephrasing your question. How do you find the host name in a URL? In the examples below I have omitted the port number, path name, and query string. ftp://$hostname => ftp://[::1] http://$hostname => http://[::1] https://$hostname => https://[::1] mailto:$username@$hostname => mailto:$username@[::1] In each case the IPv6 address (enclosed in brackets) comes after two slashes "//" or an at sign "@". The filter_var() function is used in my script to test for an IPv6 address. No other brackets or colons should be replaced with their full width equivalents and these full width characters should be replaced by their ascii equivalents after the HTML links have been created. As I said before this is something of a kludge and I would prefer some better way of fixing this.
So you are proposing to replace [] when following // or @. We could add something like (\[(?:[0-9a-fA-F])|:){2,10}\])* to mExtLinkBracketedRegex after wfUrlProtocols() (that would still fail for usernames and passwords in the protocol, and matches too much, but it's a start).
(In reply to comment #12) > So you are proposing to replace [] when following // or @. It may not be enough to replace the brackets surrounding the IPv6 address because a colon is used to separate the host name from the port number so the colons in the IPv6 addresses (in host names) should be replaced as well. The reason I want the full width brackets and colons replaced by their ascii equivalents after the wiki text is converted to HTML is that Internet Explorer will not follow the resulting links if this is not done.
Marking patch reviewed because the author will need to revise patch and resubmit to take care of suggestions from the last several comments. Thanks for the patch!
A related side effect of this bug is that setting $wgServer in LocalSettings.php to an RFC2732 address breaks some auto-generated urls. The unknown page message has broken links for search and edit: "There is currently no text in this page. You can search for this page title in other pages, [http://[::1]/wiki/index.php?title=Special:Log&page=DOESNOTEXIST search the related logs], or [http://[::1]/wiki/index.php?title=DOESNOTEXIST&action=edit edit this page]."
(In reply to comment #15) > A related side effect of this bug is that setting $wgServer in > LocalSettings.php to an RFC2732 address breaks some auto-generated urls. > > The unknown page message has broken links for search and edit: > > "There is currently no text in this page. You can search for this page title in > other pages, [http://[::1]/wiki/index.php?title=Special:Log&page=DOESNOTEXIST > search the related logs], or > [http://[::1]/wiki/index.php?title=DOESNOTEXIST&action=edit edit this page]." So it is more important than a missing parser feature?
This will have to be fixed in parsoid as well as PHP; they share the same URL character classes. It would probably be better to special case a match at the front of the string against http://vernon.mauery.com/content/projects/linux/ipv6_regex (or something similar, that regexp looks way too complicated) rather than just add '[]' to the permitted character class, since brackets are rather special during parsing.