Last modified: 2014-09-06 14:43:12 UTC
Hello I'm trying to use mwdumper to import the latest English Wikipedia dump (enwiki-20131104-pages-articles.xml). It fails with the following error: 10á045á000 pages (1á658,325/sec), 10á045á000 revs (1á658,325/sec) Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 2048 at org.apache.xerces.impl.io.UTF8Reader.read(Unknown Source) at org.apache.xerces.impl.XMLEntityScanner.load(Unknown Source) at org.apache.xerces.impl.XMLEntityScanner.scanContent(Unknown Source) at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanContent(Unk nown Source) at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContent Dispatcher.dispatch(Unknown Source) at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Un known Source) at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source) at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source) at org.apache.xerces.parsers.XMLParser.parse(Unknown Source) at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source) at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Sour ce) at javax.xml.parsers.SAXParser.parse(Unknown Source) at javax.xml.parsers.SAXParser.parse(Unknown Source) at org.mediawiki.importer.XmlDumpReader.readDump(XmlDumpReader.java:88) at org.mediawiki.dumper.Dumper.main(Dumper.java:142) ERROR 1064 (42000) at line 79047: You have an error in your SQL syntax; check th e manual that corresponds to your MySQL server version for the right syntax to u se near ''{{Infobox military person\n|name=Alexander Holle\n|birth_date=27 Febru ary 1898\' at line 1
Why is it unconfirmed? I run into it into again with the latest dump. Do you need additional information to reproduce it?
It'll be confirmed when a second person has reproduced it.
Was anyone here able to import the latest dump (20140402) with mwdumper? If there is a chance that it's an issue with my local environment I'd be glad to know.
Is there anyone here that uses mwdumper to import English Wikipedia XML dump? I tried several ones from past few months and I'm always running into some blocking issue.
I have the same problem with the file enwiki-20140502-pages-articles.xml
13,200,000 pages (5,538.948/sec), 13,200,000 revs (5,538.948/sec) Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 8192 at com.sun.org.apache.xerces.internal.impl.io.UTF8Reader.read(UTF8Reader.java:546) at com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.load(XMLEntityScanner.java:1753) at com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.arrangeCapacity(XMLEntityScanner.java:1629) at com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.skipString(XMLEntityScanner.java:1667) at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(XMLDocumentFragmentScannerImpl.java:1747) at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2957) at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:606) at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:510) at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:848) at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:777) at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:141) at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1213) at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:649) at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl.parse(SAXParserImpl.java:333) at javax.xml.parsers.SAXParser.parse(SAXParser.java:195) at org.mediawiki.importer.XmlDumpReader.readDump(XmlDumpReader.java:96) at org.mediawiki.dumper.Dumper.main(Dumper.java:142)
piotr did you find out the problem ?
This is a Xerces bug, documented at https://issues.apache.org/jira/browse/XERCESJ-1257 The workaround suggested is to use the JVM's UTF-8 reader instead of the Xerces UTF8Reader.
And definitely confirmed: 649,000 pages (1,281.975/sec), 649,000 revs (1,281.975/sec) Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 2048 at org.apache.xerces.impl.io.UTF8Reader.read(Unknown Source) at org.apache.xerces.impl.XMLEntityScanner.load(Unknown Source) at org.apache.xerces.impl.XMLEntityScanner.scanContent(Unknown Source) at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanContent(Unknown Source) at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown Source) at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source) at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source) at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source) at org.apache.xerces.parsers.XMLParser.parse(Unknown Source) at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source) at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source) at javax.xml.parsers.SAXParser.parse(SAXParser.java:392) at javax.xml.parsers.SAXParser.parse(SAXParser.java:195) at org.mediawiki.importer.XmlDumpReader.readDump(XmlDumpReader.java:88) at org.mediawiki.dumper.Dumper.main(Dumper.java:142)
Oh, thanks for finding out!
The only workaround I came up with is trying a different dump. I was able to import enwiki-20140707-pages-articles.xml.