Last modified: 2013-05-04 05:07:46 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T49304, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 47304 - SVG JavaScript detection bypass
SVG JavaScript detection bypass
Status: RESOLVED FIXED
Product: MediaWiki
Classification: Unclassified
General/Unknown (Other open bugs)
unspecified
All All
: Unprioritized normal (vote)
: 1.20.x release
Assigned To: Nobody - You can work on this!
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2013-04-17 03:46 UTC by Jan Schejbal
Modified: 2013-05-04 05:07 UTC (History)
7 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments
Workaround patch (1.13 KB, patch)
2013-04-18 01:43 UTC, Chris Steipp
Details
Whitelist known good xml encodings in svgs (2.19 KB, patch)
2013-04-19 21:23 UTC, Chris Steipp
Details
Whitelist and detect utf16/32 encodings (3.12 KB, patch)
2013-04-23 21:44 UTC, Chris Steipp
Details
Whitelist and detect utf16/32 encodings (4.07 KB, patch)
2013-04-24 20:47 UTC, Chris Steipp
Details

Description Jan Schejbal 2013-04-17 03:46:39 UTC
When an SVG file is uploaded, MediaWiki checks it to ensure that it does not contain JavaScript. Using UTF-7 (which is interpreted by the parser MediaWiki uses for validation, but not by browsers), it is possible to create a document where the script part is considered harmless text by the MediaWiki parser, but interpreted as a script tag by regular browsers (tested with Chrome and Firefox).

The proof-of-concept at <http://upload.wikimedia.org/wikipedia/test/archive/4/4f/20130415163012!Test-active.svg> does this by including a CDATA section with an UTF-7 encoded start marker (treated as text and thus ignored by non-UTF-7 parsers). As can be seen from the URL, I was able to upload this on the test wiki. At least Chrome and Firefox will happily execute the JavaScript when the button is clicked (giving feedback on the JavaScript console), since they don't see the CDATA start marker and thus do render the button with the onclick function.

UTF-7 will certainly not be the only way to cause different interpretation by MediaWiki's parser and browsers. http://www.whatwg.org/specs/web-apps/current-work/multipage/semantics.html#charset has a nice overview of some other problematic charsets. Blacklisting them would certainly be a good start, however given the complexity of XML, I am not sure if it will reliably solve the issue.

Since Wikipedia serves its images from a separate origin, it is probably not seriously affected. For smaller wikis that do not have this separation, this results in an XSS attack.

Please attribute this to "Jan Schejbal / Hatforce.com" in the release announcement.

Kind regards,
Jan Schejbal
Comment 1 Brion Vibber 2013-04-17 15:57:50 UTC
Nice... as I recall the browsers took out UTF-7 support for security reasons, now it's biting us that libxml2 still understands it. :)

Good catch.
Comment 2 Chris Steipp 2013-04-17 20:21:23 UTC
Thanks for the report Jan, I've verified the issue.

You are right, the root seems to be that  xml_parser_create_ns doesn't allow us to force an encoding any more ("Starting from PHP 5, the input encoding is automatically detected, so that the encoding parameter specifies only the output encoding."). I'm working on a couple ways to get around this.
Comment 3 Chris Steipp 2013-04-18 01:43:53 UTC
Created attachment 12132 [details]
Workaround patch

This is a hacky patch to just update the encoding to utf8, if something else is given.

It doesn't seem like php has a way to check the encoding used when using expat, but the header should follow pretty strict formatting, so it may not be too crazy to use a regex like this for pulling out the specified encoding.

XMLReader seems to have the same issue.
Comment 4 Jan Schejbal 2013-04-18 04:12:09 UTC
(In reply to comment #3)
> This is a hacky patch to just update the encoding to utf8, if something else
> is given.

This will not solve the problem. The current problem is "file is treated as UTF-8 by browser and UTF-7 by filter". This patch would just turn the attack into "file is treated as $SpecifiedWeirdEncoding by browser, but as UTF-8 by filter" - the attack will certainly look different, but I would assume at least some of the encodings supported by current browsers will allow to make some kind of creative mess.

It took me a while to find an example, but the "hz" encoding (supported at least by chrome) skips the sequence "~\n" (tilde followed by newline). Thus, 
"<![CDATA[ ]~\n]>" is treated as "CDATA start, random boring text" by a parser using UTF-8, while a parser using HZ (supported e.g. by Chrome 25) will treat it as "CDATA start, CDATA end". This results in basically the same attack, just that instead of hiding the CDATA start from the browser, the CDATA end is hidden from MediaWiki.


Additionally, I'm not sure if this can be reliably hacked in via RegExp. For the suggested patch, I strongly suspect that if the word "encoding" is split across chunk boundaries, the regexp will not find it and the attack will still work. In general, getting a regexp right so it cannot be bypassed is hard. For example, this string should bypass the regexp:

<?xml version="1.0" encoding="UTF-7" standalone="no"?><!-- encoding="UTF-8" ?> -->

The greedy regexp will happily grab "UTF-8" as the second group, even though it just sits in a comment.


Could someone with access to the file repository maybe evaluate what charsets are actually used? I suspect that a whitelist approach may be the way to go here...
Comment 5 Chris Steipp 2013-04-18 16:03:48 UTC
I have about 10k random svgs from commons, and only one had an encoding defined that wasn't utf-8 or iso-8859-1, and that was iso-8859-2. So yes, a whitelist of acceptable encodings seems like it would be fine.
Comment 6 Jan Schejbal 2013-04-18 22:39:26 UTC
Since this is actually browsers misinterpreting the file (the file clearly states that it is UTF-7, yet they treat it as UTF-8), I have filed security bugs with Chromium and Mozilla.
Comment 7 Jan Schejbal 2013-04-19 16:32:24 UTC
Chromium decided to WONTFIX - and set the Issue to public.

https://code.google.com/p/chromium/issues/detail?id=233355

Sorry. I have requested that they hide it again, but so far, they didn't do it. MediaWiki wasn't mentioned explicitly.


(In reply to comment #5)
> I have about 10k random svgs from commons, and only one had an encoding
> defined that wasn't utf-8 or iso-8859-1, and that was iso-8859-2.

Would it be possible to run such a check over *all* 670k SVGs in the repo? It may provide valuable data not only for us, but for others (e.g. it could help Mozilla decide if breaking certain exotic encodings is OK) and be interesting in general.
Comment 8 Chris Steipp 2013-04-19 17:27:51 UTC
I'm working on getting a way to access our full dataset to check. In the meantime, I may whitelist utf-8, iso-8859-1, iso-8859-2 with a patch on the cluster, just to make sure no one tries to run the poc you gave them.
Comment 9 Chris Steipp 2013-04-19 21:23:41 UTC
Created attachment 12148 [details]
Whitelist known good xml encodings in svgs

This checks to see if the xml encoding is on of 'UTF-8', 'ISO-8859-1', or 'ISO-8859-2', if an encoding is specified in the option xml declaration.
Comment 10 Chris Steipp 2013-04-23 18:21:08 UTC
I've been watching the incoming svgs since last week, and there have been no attempts to upload any svgs with any exotic formats, so I think we're safe to whitelist the encodings.

A couple files came in with encoding="utf-16", and were themselves encoded in a utf-16 encoding, so the encoding regex did not match. It looks like expat detects and converts files from ISO-8859-1, US_ASCII, UTF-8, UTF-16 (BE/LE), so we should probably detect if those are coming in at the PHP layer. New patch coming soon.
Comment 11 Chris Steipp 2013-04-23 21:44:24 UTC
Created attachment 12163 [details]
Whitelist and detect utf16/32 encodings

This patch also attempts to convert the input form utf16/32 before looking for an encoding directive. This check may not be necessary, as it looks like if expat detects utf16/32 encoding, that will override the encoding in the xml directive. However, it seems prudent to check in case other versions use a different approach.
Comment 12 Tim Starling 2013-04-24 04:28:02 UTC
(In reply to comment #11)
> Created attachment 12163 [details]
> Whitelist and detect utf16/32 encodings

If an XML declaration can be constructed which doesn't match your regex but does satisfy libxml's idea of a valid XML declaration, then a UTF-7 encoding could sneak through. I notice that libxml allows a space between "encoding" and "=".

php > print simplexml_load_string( '<?xml version="1.0" encoding="utf-7" ?><root>+-</root>' ) . "\n";
+
php > print simplexml_load_string( '<?xml version="1.0" encoding ="utf-7" ?><root>+-</root>' ) . "\n";
+

Other XML whitespace is also allowed. It says so in the XML 1.0 spec also:

EncodingDecl ::= S 'encoding' Eq ('"' EncName '"' | "'" EncName "'" ) 
Eq ::= S? '=' S?

I think adding [ \t\n\r]* on either side of the equals sign will fix that.

The "m" modifier in your regexes is not needed, since you have no start or end anchors.

You missed a wfRestoreWarnings() in the case where the function returns early. I suggest wrapping it around the iconv() only.

$wgSVGMetadataCutoff may be a vulnerability. The upload should be rejected if no end to the XML declaration is found before truncation by $wgSVGMetadataCutoff.

I don't think it is possible to make a file which is valid XML simultaneously in both UTF-7 and UTF-16 or UTF-32. The document has to start with <?xml, and that string is different in those three encodings. But I suppose the code you wrote is harmless.
Comment 13 Chris Steipp 2013-04-24 20:43:28 UTC
(In reply to comment #12)
> I don't think it is possible to make a file which is valid XML simultaneously
> in both UTF-7 and UTF-16 or UTF-32. The document has to start with <?xml, and
> that string is different in those three encodings. But I suppose the code you
> wrote is harmless.

I don't think it's possible, but I wanted to make sure we could detect (and limit via whitelist if needed) multi-byte encodings.

Thanks for the input! New patch to address the other items to follow.
Comment 14 Chris Steipp 2013-04-24 20:47:46 UTC
Created attachment 12170 [details]
Whitelist and detect utf16/32 encodings

Fixed the issues that Tim pointed out. Also, on reading the spec more, realized that the xml could be EBCDIC encoded, so added an explicit test for that, which I'm currently not adding to the whitelist.
Comment 15 Tim Starling 2013-04-26 04:05:52 UTC
Looks good now.
Comment 16 Gerrit Notification Bot 2013-04-30 20:20:20 UTC
Related URL: https://gerrit.wikimedia.org/r/61632 (Gerrit Change I3b311a7078d977ae89c51e95e625d79fba183cfc)
Comment 17 Gerrit Notification Bot 2013-04-30 20:53:40 UTC
Related URL: https://gerrit.wikimedia.org/r/61640 (Gerrit Change I3b311a7078d977ae89c51e95e625d79fba183cfc)
Comment 18 Gerrit Notification Bot 2013-04-30 20:58:30 UTC
Related URL: https://gerrit.wikimedia.org/r/61643 (Gerrit Change I3b311a7078d977ae89c51e95e625d79fba183cfc)
Comment 19 Gerrit Notification Bot 2013-05-04 05:07:46 UTC
Related URL: https://gerrit.wikimedia.org/r/62215 (Gerrit Change I3b311a7078d977ae89c51e95e625d79fba183cfc)

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links