Last modified: 2014-09-06 14:43:12 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T59236, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 57236 - mwdumper fails to import English wikipedia dump: ArrayIndexOutOfBoundsException; error in SQL syntax
mwdumper fails to import English wikipedia dump: ArrayIndexOutOfBoundsExcepti...
Status: NEW
Product: Utilities
Classification: Unclassified
mwdumper (Other open bugs)
unspecified
All All
: Normal critical with 2 votes (vote)
: ---
Assigned To: Brion Vibber
: upstream
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2013-11-19 10:13 UTC by piotr.jagielski
Modified: 2014-09-06 14:43 UTC (History)
3 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description piotr.jagielski 2013-11-19 10:13:48 UTC
Hello

I'm trying to use mwdumper to import the latest English Wikipedia dump (enwiki-20131104-pages-articles.xml). It fails with the following error:

10á045á000 pages (1á658,325/sec), 10á045á000 revs (1á658,325/sec)
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 2048
        at org.apache.xerces.impl.io.UTF8Reader.read(Unknown Source)
        at org.apache.xerces.impl.XMLEntityScanner.load(Unknown Source)
        at org.apache.xerces.impl.XMLEntityScanner.scanContent(Unknown Source)
        at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanContent(Unk
nown Source)
        at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContent
Dispatcher.dispatch(Unknown Source)
        at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Un
known Source)
        at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
        at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
        at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
        at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
        at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Sour
ce)
        at javax.xml.parsers.SAXParser.parse(Unknown Source)
        at javax.xml.parsers.SAXParser.parse(Unknown Source)
        at org.mediawiki.importer.XmlDumpReader.readDump(XmlDumpReader.java:88)
        at org.mediawiki.dumper.Dumper.main(Dumper.java:142)
ERROR 1064 (42000) at line 79047: You have an error in your SQL syntax; check th
e manual that corresponds to your MySQL server version for the right syntax to u
se near ''{{Infobox military person\n|name=Alexander Holle\n|birth_date=27 Febru
ary 1898\' at line 1
Comment 1 piotr.jagielski 2014-04-12 16:06:27 UTC
Why is it unconfirmed? I run into it into again with the latest dump. Do you need additional information to reproduce it?
Comment 2 Andre Klapper 2014-04-12 17:01:28 UTC
It'll be confirmed when a second person has reproduced it.
Comment 3 piotr.jagielski 2014-04-12 17:27:42 UTC
Was anyone here able to import the latest dump (20140402) with mwdumper? If there is a chance that it's an issue with my local environment I'd be glad to know.
Comment 4 piotr.jagielski 2014-04-18 21:50:06 UTC
Is there anyone here that uses mwdumper to import English Wikipedia XML dump? I tried several ones from past few months and I'm always running into some blocking issue.
Comment 5 mad2one48 2014-06-18 01:18:34 UTC
I have the same problem with the file enwiki-20140502-pages-articles.xml
Comment 6 mad2one48 2014-06-18 01:51:07 UTC
13,200,000 pages (5,538.948/sec), 13,200,000 revs (5,538.948/sec)
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 8192
 at com.sun.org.apache.xerces.internal.impl.io.UTF8Reader.read(UTF8Reader.java:546)
 at com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.load(XMLEntityScanner.java:1753)
 at com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.arrangeCapacity(XMLEntityScanner.java:1629)
 at com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.skipString(XMLEntityScanner.java:1667)
 at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(XMLDocumentFragmentScannerImpl.java:1747)
 at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2957)
 at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:606)
 at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:510)
 at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:848)
 at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:777)
 at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:141)
 at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1213)
 at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:649)
 at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl.parse(SAXParserImpl.java:333)
 at javax.xml.parsers.SAXParser.parse(SAXParser.java:195)
 at org.mediawiki.importer.XmlDumpReader.readDump(XmlDumpReader.java:96)
 at org.mediawiki.dumper.Dumper.main(Dumper.java:142)
Comment 7 mad2one48 2014-06-18 12:30:08 UTC
piotr did you find out the problem ?
Comment 8 Chris Padfield 2014-08-22 15:18:53 UTC
This is a Xerces bug, documented at https://issues.apache.org/jira/browse/XERCESJ-1257

The workaround suggested is to use the JVM's UTF-8 reader instead of the Xerces UTF8Reader.
Comment 9 Chris Padfield 2014-08-22 15:20:42 UTC
And definitely confirmed:

649,000 pages (1,281.975/sec), 649,000 revs (1,281.975/sec)
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 2048
        at org.apache.xerces.impl.io.UTF8Reader.read(Unknown Source)
        at org.apache.xerces.impl.XMLEntityScanner.load(Unknown Source)
        at org.apache.xerces.impl.XMLEntityScanner.scanContent(Unknown Source)
        at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanContent(Unknown Source)
        at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown Source)
        at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
        at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
        at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
        at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
        at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
        at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source)
        at javax.xml.parsers.SAXParser.parse(SAXParser.java:392)
        at javax.xml.parsers.SAXParser.parse(SAXParser.java:195)
        at org.mediawiki.importer.XmlDumpReader.readDump(XmlDumpReader.java:88)
        at org.mediawiki.dumper.Dumper.main(Dumper.java:142)
Comment 10 Andre Klapper 2014-08-22 16:12:07 UTC
Oh, thanks for finding out!
Comment 11 piotr.jagielski 2014-09-06 14:43:12 UTC
The only workaround I came up with is trying a different dump. I was able to import enwiki-20140707-pages-articles.xml.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links