Last modified: 2014-09-23 23:07:40 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T23261, the corresponding Phabricator task for complete and up-to-date bug report information.

Bug 21261 - MediaWiki cannot recognize urls pointing to ipv6 hosts


Summary:	MediaWiki cannot recognize urls pointing to ipv6 hosts

Status:	NEW

Product:	MediaWiki
Classification:	Unclassified
Component:	Parser (Other open bugs)
Version:	1.24rc
Hardware:	All All

Importance:	Normal normal with 2 votes (vote)
Target Milestone:	---
Assigned To:	Nobody - You can work on this!

URL:	http://zh.wikipedia.org/w/index.php?o...
Whiteboard:
Keywords:	ipv6, need-parsertest, parser, patch, patch-reviewed

Depends on:
Blocks:
	Show dependency tree / graph

Reported:	2009-10-24 11:25 UTC by Liangent
Modified:	2014-09-23 23:07 UTC (History)
CC List:	11 users (show)

See Also:
Web browser:	---
Mobile Platform:	---
Assignee Huggle Beta Tester:	---

Attachments
This does not fix it, as it leaves the path off the IPv6 link. Needs a regex genius. (665 bytes, patch) 2009-12-08 02:28 UTC, Jonathan Harker	Details
This fixes the problem but might not be safe (1.56 KB, patch) 2010-03-09 10:08 UTC, Peter Potrowl	Details
PHP script to demonstrate how an IPv6 hostname can be modified. (697 bytes, application/x-httpd-php) 2011-11-24 08:52 UTC, Allen Stambaugh	Details
Show Obsolete (1) Add an attachment (proposed patch, testcase, etc.)

Description Liangent 2009-10-24 11:25:23 UTC

http://zh.wikipedia.org/w/index.php?oldid=11465238

Comment 1 Jonathan Harker 2009-12-03 01:01:27 UTC

As described in RFC 2732: http://www.ietf.org/rfc/rfc2732.txt

IPv6 addresses in URLs are surrounded in [ ] characters. For instance, in the wiki:

http://[2404:130:0:1000::187:2]/index.php

should produce a hyperlink, but currently does not. Surrounding it in brackets doesn't work either, e.g.:

Click [http://[2404:130:0:1000::187:2]/index.php here]

This should be a simple patch to the parser somewhere - I'll have a go some time in the next day or two and attach a patch.

Cheers, J

Comment 2 Jonathan Harker 2009-12-08 02:28:55 UTC

Created attachment 6862 [details]
This does not fix it, as it leaves the path off the IPv6 link. Needs a regex genius.

Okay so this bug boils down to tweaking the mExtLinkBracketedRegex in includes/parser/Parser.php

I'm not sure how to alter this to include either passive groups for the IPv6 address brackets, or use a pipe to supply an almost identical alternative.

I'm continuing to fiddle, but there might be a regex genius out there who's seen this before.

Cheers, J

Comment 3 Jonathan Harker 2009-12-08 02:49:14 UTC

A proper regex for IPv6 addresses - http://vernon.mauery.com/content/projects/linux/ipv6_regex
:-P

Comment 4 Peter Potrowl 2010-03-09 10:08:32 UTC

Created attachment 7186 [details]
This fixes the problem but might not be safe

I have a very simple solution which consists in removing the brackets ('[' and ']') from the list of forbidden entities in a URL.

This modifies 3 regular expressions: 2 in Parser.php and 1 in Sanitizer.php

However, it should be checked by someone competent because I don't know the possible implications in terms of security (there might be a good reason why the brackets were forbidden).

I also note that currently, we do not check whether a URL is valid before adding a link to it (eg. http://thisisan\invalidtest is transformed to a link). But I found no explanation on mediawiki.org about what should be a valid (or invalid) URL.

Comment 5 Allen Stambaugh 2011-11-24 02:16:59 UTC

It is a bit of a kludge but the colons and brackets used for IPv6 addresses can be replaced with their full-width forms. The following code assumes that $hostName
has been set.

> $host = ( $hostName == '['.substr( $hostName, 1, strlen( $hostName ) - 2 ).']' ) : substr( $hostName, 1, strlen( $hostName ) - 2 ) ? $hostName;
> if ( filter_var( $host, FILTER_VALIDATE_IP, FILTER_FLAG_IPV6 ) ) {
> str_replace( '[', '［', $hostName );
> str_replace( ':', '：', $hostName );
> str_replace( ']', '］', $hostName );
> }

I the meantime this wiki template code may be useful.

> ［{{{1|}}}：{{{2|}}}：{{{3|1}}}{{#if:
>   {{{8|}}}|：{{{4|}}}：{{{5|}}}：{{{6|}}}：{{{7|}}}：{{{8}}}|{{#if:
>   {{{7|}}}|：{{{4|}}}：{{{5|}}}：{{{6|}}}：{{{7}}}|{{#if:
>   {{{6|}}}|：{{{4|}}}：{{{5|}}}：{{{6}}}|{{#if:
>   {{{5|}}}|：{{{4|}}}：{{{5}}}|{{#if:
>   {{{4|}}}|：{{{4}}} }} }} }} }} }}］

Comment 6 Allen Stambaugh 2011-11-24 08:52:19 UTC

Created attachment 9543 [details]
PHP script to demonstrate how an IPv6 hostname can be modified.

The PHP code I posted in my previous comment was untested and defective. The attachment should show how an IPv6 address enclosed in brackets can be modified before being parsed for linking. The reverse can be done after the HTML markup has been generated. I hope someone can come up with a better way to do this as this seems to be a bit of a kludge.

Comment 7 Platonides 2011-11-26 20:28:40 UTC

That doesn't help, since you would still need to detect IPv6 addresses written with normal brackets.

Comment 8 Platonides 2011-11-26 21:47:53 UTC

The patch by Peter looks the way to go, but I'm afraid adding brackets to EXT_LINK_URL_CLASS would allow too much recursion. And that wouldn't stop at the ] of a url.
The change to the sanitizer is needed to avoid converting http://[2404:130:0:1000::187:2]/index.php to http://%5B2404:130:0:1000::187:2%5D/index.php but although that doesn't break any parser test, I suspect there was a reason they were added as control characters in r14930, but I don't know which one (avoid treating the content like? that would no longer hold...).

Note that the change to mExtLinkBracketedRegex won't apply, but it's now constructed from EXT_LINK_URL_CLASS so no problem there.

I added a parser test in r104314.

Comment 9 Allen Stambaugh 2011-12-03 04:36:16 UTC

(In reply to comment #7)
> That doesn't help, since you would still need to detect IPv6 addresses written
> with normal brackets.

I am not sure what you mean. Are you referring to my php script? (Corrected lines 11-16 below.)

> $host = ( $hostName == '['.substr( $hostName, 1, strlen( $hostName ) - 2 ).']' ) ? substr( $hostName, 1, strlen( $hostName ) - 2 ) : $hostName;
> if ( filter_var( $host, FILTER_VALIDATE_IP, FILTER_FLAG_IPV6 ) ) {
> 	$hostName = str_replace( '[', '［', $hostName );
> 	$hostName = str_replace( ':', '：', $hostName );
> 	$hostName = str_replace( ']', '］', $hostName );
> }

Or are you referring to the template code. (See http://www.mediawiki.org/wiki/User:Allen4names)

(In reply to comment #8)
> I added a parser test in r104314.

You may want to check line 1211 of revision 104314 for missing brackets.

Comment 10 Platonides 2011-12-06 21:28:12 UTC

(In reply to comment #9)
> 
> I am not sure what you mean. Are you referring to my php script? (Corrected
> lines 11-16 below.)

Wikitext is written in text. Replacing [ and ] with something else which doesn't conflict could work but... How do you choose which [ ] to replace?


> (In reply to comment #8)
> > I added a parser test in r104314.
> 
> You may want to check line 1211 of revision 104314 for missing brackets.

Thanks, they were missing in the <a> contents. Fixed in r105356.

Comment 11 Allen Stambaugh 2011-12-07 02:32:02 UTC

(In reply to comment #10)
> (In reply to comment #9)
> Wikitext is written in text. Replacing [ and ] with something else which
> doesn't conflict could work but... How do you choose which [ ] to replace?

Let me start by rephrasing your question. How do you find the host name in a URL? In the examples below I have omitted the port number, path name, and query string.

ftp://$hostname => ftp://[::1]
http://$hostname => http://[::1]
https://$hostname => https://[::1]
mailto:$username@$hostname => mailto:$username@[::1]

In each case the IPv6 address (enclosed in brackets) comes after two slashes "//" or an at sign "@". The filter_var() function is used in my script to test for an IPv6 address. No other brackets or colons should be replaced with their full width equivalents and these full width characters should be replaced by their ascii equivalents after the HTML links have been created.

As I said before this is something of a kludge and I would prefer some better way of fixing this.

Comment 12 Platonides 2011-12-07 23:02:01 UTC

So you are proposing to replace [] when following // or @.

We could add something like (\[(?:[0-9a-fA-F])|:){2,10}\])* to mExtLinkBracketedRegex after wfUrlProtocols()
(that would still fail for usernames and passwords in the protocol, and matches too much, but it's a start).

Comment 13 Allen Stambaugh 2011-12-08 20:59:00 UTC

(In reply to comment #12)
> So you are proposing to replace [] when following // or @.

It may not be enough to replace the brackets surrounding the IPv6 address because a colon is used to separate the host name from the port number so the colons in the IPv6 addresses (in host names) should be replaced as well. The reason I want the full width brackets and colons replaced by their ascii equivalents after the wiki text is converted to HTML is that Internet Explorer will not follow the resulting links if this is not done.

Comment 14 Sumana Harihareswara 2012-01-23 19:48:06 UTC

Marking patch reviewed because the author will need to revise patch and resubmit to take care of suggestions from the last several comments.  Thanks for the patch!

Comment 15 Chris Steipp 2012-05-23 16:54:02 UTC

A related side effect of this bug is that setting $wgServer in LocalSettings.php to an RFC2732 address breaks some auto-generated urls.

The unknown page message has broken links for search and edit:

"There is currently no text in this page. You can search for this page title in other pages, [http://[::1]/wiki/index.php?title=Special:Log&page=DOESNOTEXIST search the related logs], or [http://[::1]/wiki/index.php?title=DOESNOTEXIST&action=edit edit this page]."

Comment 16 Liangent 2012-05-23 17:03:40 UTC

(In reply to comment #15)
> A related side effect of this bug is that setting $wgServer in
> LocalSettings.php to an RFC2732 address breaks some auto-generated urls.
> 
> The unknown page message has broken links for search and edit:
> 
> "There is currently no text in this page. You can search for this page title in
> other pages, [http://[::1]/wiki/index.php?title=Special:Log&page=DOESNOTEXIST
> search the related logs], or
> [http://[::1]/wiki/index.php?title=DOESNOTEXIST&action=edit edit this page]."

So it is more important than a missing parser feature?

Comment 17 C. Scott Ananian 2014-04-23 16:54:16 UTC

This will have to be fixed in parsoid as well as PHP; they share the same URL character classes.

It would probably be better to special case a match at the front of the string against http://vernon.mauery.com/content/projects/linux/ipv6_regex (or something similar, that regexp looks way too complicated) rather than just add '[]' to the permitted character class, since brackets are rather special during parsing.

Wikimedia Bugzilla is closed!

Search

Personal tools

Navigation

Links