Last modified: 2011-04-30 01:16:45 UTC
I have a downstream bug that I've been working on for an MW 1.14 installation. Certain users can't upload most of their MS Word documents. These are NOT the .docx XML-based formats but the older type. It turns out that users would receive the error even with a BLANK document saved. I've attached an example of this for people to reproduce the bug. The error received is the message: "The file is corrupt or has an incorrect extension. Please check the file and upload again." Debugging locally has shown that these files are being identified as: mime: <application/zip> extension: <doc Clearly with MIME-type/extension verification turned on this will fail giving the error they see. The question is why is this being found as "application/zip"? (Furthermore, the workaround I read about / had planned, using $wgMimeDetectorCommand to externally check the MIME-type, is no good as MimeMagic::guessMimeType() calls doGuessMimeType FIRST, and any (false) positives will then NOT call detectMimeType(), which seems to work correctly)
Created attachment 6610 [details] test file for bug reproduction
What I didn't add is that I've identified what I believe is the problem, stemming from revision 39203: r1=39203&r2=39202&pathrev=39203">http://svn.wikimedia.org/viewvc/mediawiki/trunk/phase3/includes/MimeMagic.php?r1=39203&r2=39202&pathrev=39203 By reading the last ~64k of a file, it seems that the magic word for a valid (empty) ZIP file can also be found in certain mac word documents. I've verified this on the test document with bgrep (http://blog.thorx.net/2009/07/binary-grep/). (Also, this might not be the first time this bug has been noticed: http://www.mwusers.com/forums/showthread.php?t=4903 )
I've had this with a Windows Word document, unfortunately it also happened to be the first file I tried to upload to a newly installed wiki. In theory there's approximately a 1 in 65000 chance that the check will match any 64k of binary data and would equally apply to any file type including images. Ideally the detection needs to be improved, failing that a clearer error message to the user would help.
As I noted at bug 16583, this issue has security implications: the code in MimeMagic.php that triggers these occasional false positives is also what's protecting MediaWiki from things like the GIFAR exploit (a file which is simultaneously a valid GIF image and an executable Java archive). That said, the error reporting could be cleaner: in particular, rather than detecting these files as application/zip, we should ideally first run them through normal MIME type detection and only then check for any unexpected ZIP EOCDR markers and, if any are found, fail with a message something like: "This file, apparently of type foo/bar, contains a marker suggesting it might also be a valid ZIP archive. For security reasons, uploading such files has been disabled." Also, it might be possible to reduce the false positive rate for the ZIP file detection, but doing so safely would have to involve checking how existing ZIP decoders (in particular, the Info-ZIP decoder and Java's java.util.zip classes) do it, lest we accidentally allow through files which, though not necessarily valid according to the ZIP format spec, might still be accepted by these decoders.
It's not matching it as a zip by pure chance. That file contains a zip-like structure embedded. For instance, 7-zip is able to "open" it, showing inside the files ObjectPool/, [5]SummaryInformation, WordDocument, [1]CompObj, 1Table and [5]DocumentSummaryInformation.
The list given by unzip -l seems more reliable: warning: 6060 extra bytes at beginning or within zipfile (attempting to process anyway) Length EAs ACLs Date Time Name -------- --- ---- ---- ---- ---- 540 0 0 01/01/80 00:00 [Content_Types].xml 310 0 0 01/01/80 00:00 _rels/.rels 138 0 0 01/01/80 00:00 theme/theme/themeManager.xml 7559 0 0 01/01/80 00:00 theme/theme/theme1.xml 283 0 0 01/01/80 00:00 theme/theme/_rels/themeManager.xml.rels -------- ----- ----- ------- 8830 0 0 5 files Those xml contain openxml content. Looks like Microsoft Office is lying when told to save in the old format, by still including data in openxml format.
Cf. comment #6 this is really a zip file, so closing as invalid, since rejecting those files is the actual purpose. (As a side note, perhaps we could have something like a zip stripper?)
I don't like the idea of changing the user file, but it could perhaps be integrated into new-upload: "Your file seems rotated. Fix?", "Currently I cannot accept this file with zip data included. Strip?" This case would (should?) be easy to strip. Another Office app also likes to embed its own format into pngs, which we could remove, and pngs with garbage appended are common, too.