Last modified: 2014-09-06 14:43:12 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T59236, the corresponding Phabricator task for complete and up-to-date bug report information.

Bug 57236 - mwdumper fails to import English wikipedia dump: ArrayIndexOutOfBoundsException; error in SQL syntax


Summary:	mwdumper fails to import English wikipedia dump: ArrayIndexOutOfBoundsExcepti...

Status:	NEW

Product:	Utilities
Classification:	Unclassified
Component:	mwdumper (Other open bugs)
Version:	unspecified
Hardware:	All All

Importance:	Normal critical with 2 votes (vote)
Target Milestone:	---
Assigned To:	Brion Vibber

URL:
Whiteboard:
Keywords:	upstream

Depends on:
Blocks:
	Show dependency tree / graph

Reported:	2013-11-19 10:13 UTC by piotr.jagielski
Modified:	2014-09-06 14:43 UTC (History)
CC List:	3 users (show)

See Also:
Web browser:	---
Mobile Platform:	---
Assignee Huggle Beta Tester:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Description piotr.jagielski 2013-11-19 10:13:48 UTC

Hello

I'm trying to use mwdumper to import the latest English Wikipedia dump (enwiki-20131104-pages-articles.xml). It fails with the following error:

10á045á000 pages (1á658,325/sec), 10á045á000 revs (1á658,325/sec)
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 2048
        at org.apache.xerces.impl.io.UTF8Reader.read(Unknown Source)
        at org.apache.xerces.impl.XMLEntityScanner.load(Unknown Source)
        at org.apache.xerces.impl.XMLEntityScanner.scanContent(Unknown Source)
        at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanContent(Unk
nown Source)
        at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContent
Dispatcher.dispatch(Unknown Source)
        at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Un
known Source)
        at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
        at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
        at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
        at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
        at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Sour
ce)
        at javax.xml.parsers.SAXParser.parse(Unknown Source)
        at javax.xml.parsers.SAXParser.parse(Unknown Source)
        at org.mediawiki.importer.XmlDumpReader.readDump(XmlDumpReader.java:88)
        at org.mediawiki.dumper.Dumper.main(Dumper.java:142)
ERROR 1064 (42000) at line 79047: You have an error in your SQL syntax; check th
e manual that corresponds to your MySQL server version for the right syntax to u
se near ''{{Infobox military person\n|name=Alexander Holle\n|birth_date=27 Febru
ary 1898\' at line 1

Comment 1 piotr.jagielski 2014-04-12 16:06:27 UTC

Why is it unconfirmed? I run into it into again with the latest dump. Do you need additional information to reproduce it?

Comment 2 Andre Klapper 2014-04-12 17:01:28 UTC

It'll be confirmed when a second person has reproduced it.

Comment 3 piotr.jagielski 2014-04-12 17:27:42 UTC

Was anyone here able to import the latest dump (20140402) with mwdumper? If there is a chance that it's an issue with my local environment I'd be glad to know.

Comment 4 piotr.jagielski 2014-04-18 21:50:06 UTC

Is there anyone here that uses mwdumper to import English Wikipedia XML dump? I tried several ones from past few months and I'm always running into some blocking issue.

Comment 5 mad2one48 2014-06-18 01:18:34 UTC

I have the same problem with the file enwiki-20140502-pages-articles.xml

Comment 6 mad2one48 2014-06-18 01:51:07 UTC

13,200,000 pages (5,538.948/sec), 13,200,000 revs (5,538.948/sec)
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 8192
 at com.sun.org.apache.xerces.internal.impl.io.UTF8Reader.read(UTF8Reader.java:546)
 at com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.load(XMLEntityScanner.java:1753)
 at com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.arrangeCapacity(XMLEntityScanner.java:1629)
 at com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.skipString(XMLEntityScanner.java:1667)
 at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(XMLDocumentFragmentScannerImpl.java:1747)
 at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2957)
 at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:606)
 at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:510)
 at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:848)
 at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:777)
 at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:141)
 at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1213)
 at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:649)
 at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl.parse(SAXParserImpl.java:333)
 at javax.xml.parsers.SAXParser.parse(SAXParser.java:195)
 at org.mediawiki.importer.XmlDumpReader.readDump(XmlDumpReader.java:96)
 at org.mediawiki.dumper.Dumper.main(Dumper.java:142)

Comment 7 mad2one48 2014-06-18 12:30:08 UTC

piotr did you find out the problem ?

Comment 8 Chris Padfield 2014-08-22 15:18:53 UTC

This is a Xerces bug, documented at https://issues.apache.org/jira/browse/XERCESJ-1257

The workaround suggested is to use the JVM's UTF-8 reader instead of the Xerces UTF8Reader.

Comment 9 Chris Padfield 2014-08-22 15:20:42 UTC

And definitely confirmed:

649,000 pages (1,281.975/sec), 649,000 revs (1,281.975/sec)
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 2048
        at org.apache.xerces.impl.io.UTF8Reader.read(Unknown Source)
        at org.apache.xerces.impl.XMLEntityScanner.load(Unknown Source)
        at org.apache.xerces.impl.XMLEntityScanner.scanContent(Unknown Source)
        at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanContent(Unknown Source)
        at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown Source)
        at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
        at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
        at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
        at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
        at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
        at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source)
        at javax.xml.parsers.SAXParser.parse(SAXParser.java:392)
        at javax.xml.parsers.SAXParser.parse(SAXParser.java:195)
        at org.mediawiki.importer.XmlDumpReader.readDump(XmlDumpReader.java:88)
        at org.mediawiki.dumper.Dumper.main(Dumper.java:142)

Comment 10 Andre Klapper 2014-08-22 16:12:07 UTC

Oh, thanks for finding out!

Comment 11 piotr.jagielski 2014-09-06 14:43:12 UTC

The only workaround I came up with is trying a different dump. I was able to import enwiki-20140707-pages-articles.xml.

Wikimedia Bugzilla is closed!

Search

Personal tools

Navigation

Links