Last modified: 2013-08-19 14:01:07 UTC
the dump file i'm reading is : enwiki-latest-pages-articles.xml.bz2(aug 08,2011) i'm inserting the values into mysql db according do the wiki sql db definition, after i removed the tables indexes constraints. i will be more then glad to know if there's a way to work around it, and ignore the problematic rows and continue reading, and writing the rest of the file. thank 2,260,000 pages (36.843/sec), 2,260,000 revs (36.843/sec) 2,261,000 pages (36.842/sec), 2,261,000 revs (36.842/sec) 2,262,000 pages (36.841/sec), 2,262,000 revs (36.841/sec) 2,263,000 pages (36.839/sec), 2,263,000 revs (36.839/sec) 2,264,000 pages (36.837/sec), 2,264,000 revs (36.837/sec) 2,265,000 pages (36.838/sec), 2,265,000 revs (36.838/sec) java.io.IOException: java.sql.SQLException: Incorrect string value: '\xF0\x9D\x9E\xB1_\xF0...' for column 'page_title' at row 9 at org.mediawiki.importer.XmlDumpReader.readDump(XmlDumpReader.java:92) at org.mediawiki.dumper.gui.DumperGui$1.run(DumperGui.java:206) Caused by: org.xml.sax.SAXException: java.sql.SQLException: Incorrect string value: '\xF0\x9D\x9E\xB1_\xF0...' for column 'page_title' at row 9 at org.mediawiki.importer.XmlDumpReader.endElement(XmlDumpReader.java:227) at org.apache.xerces.parsers.AbstractSAXParser.endElement(Unknown Source) at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanEndElement(Unknown Source) at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown Source) at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source) at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source) at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source) at org.apache.xerces.parsers.XMLParser.parse(Unknown Source) at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source) at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source) at javax.xml.parsers.SAXParser.parse(SAXParser.java:395) at javax.xml.parsers.SAXParser.parse(SAXParser.java:198) at org.mediawiki.importer.XmlDumpReader.readDump(XmlDumpReader.java:88)
i found a way to fix the problem... the sql schema file provided by wikipedia, at : http://svn.wikimedia.org/viewvc/mediawiki/trunk/phase3/maintenance/tables.sql has some defaults in it. steps for solution : 1.at page table defintion : at page_title field :change the varchar type to varbinary. 2.at table revision : field re_comment : change type tinyblob to mediumblob(not the exception i described above, but still it's necessary if you want to avoid future exceptions. that's it you're good to go..
eyal: (In reply to comment #0) > the dump file i'm reading is : > enwiki-latest-pages-articles.xml.bz2(aug 08,2011) Exact and full command used for this is welcome (without any potential user password of course). Did you use the old/outdated jar from http://download.wikimedia.org/tools/ or the source from trunk/master?
eyal: Could you answer comment 2 please?
Unfortunately closing this report as no further information has been provided. eyal: Please feel free to reopen this report if you can provide the information asked for and if this still happens. Thanks!