Last modified: 2014-02-12 23:40:01 UTC
I tried to run the GUI version of the newest revision (r60229) of mwdumper under Java 6 update 17 on an Intel Core i7 with 3,25G RAM and WinXP SP3, and it gave this error: Exception in thread "Thread-8" java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Unknown Source) at java.lang.StringCoding.safeTrim(Unknown Source) at java.lang.StringCoding.access$300(Unknown Source) at java.lang.StringCoding$StringEncoder.encode(Unknown Source) at java.lang.StringCoding.encode(Unknown Source) at java.lang.String.getBytes(Unknown Source) at com.mysql.jdbc.StringUtils.getBytes(StringUtils.java:493) at com.mysql.jdbc.StringUtils.getBytes(StringUtils.java:603) at com.mysql.jdbc.ByteArrayBuffer.writeStringNoNull(ByteArrayBuffer.java:544) at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:1638) at com.mysql.jdbc.Connection.execSQL(Connection.java:2972) at com.mysql.jdbc.Connection.execSQL(Connection.java:2902) at com.mysql.jdbc.Statement.execute(Statement.java:529) at org.mediawiki.importer.SqlServerStream.writeStatement(SqlServerStream.java:25) at org.mediawiki.importer.SqlWriter.flushInsertBuffer(SqlWriter.java:195) at org.mediawiki.importer.SqlWriter.bufferInsertRow(SqlWriter.java:184) at org.mediawiki.importer.SqlWriter15.writeRevision(SqlWriter15.java:68) at org.mediawiki.importer.PageFilter.writeRevision(PageFilter.java:67) at org.mediawiki.dumper.ProgressFilter.writeRevision(ProgressFilter.java:56) at org.mediawiki.importer.XmlDumpReader.closeRevision(XmlDumpReader.java:346) at org.mediawiki.importer.XmlDumpReader.endElement(XmlDumpReader.java:204) at org.apache.xerces.parsers.AbstractSAXParser.endElement(Unknown Source) at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanEndElement(Unknown Source) at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown Source) at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source) at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source) at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source) at org.apache.xerces.parsers.XMLParser.parse(Unknown Source) at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source) at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source) at org.apache.xerces.jaxp.SAXParserImpl.parse(Unknown Source) at javax.xml.parsers.SAXParser.parse(Unknown Source) According to the Java docs, default max heap size is 3/4 of the physical memory, that is, around 800M. Since a single revision is at most 2M, there is no reason for mwdumper to require that much space. (It ran on the huwiki full history dump, directly writing to the database.)
After manually raising the max heap size, it ran smoothly, unlike the older versions available from download.wikimedia.org which didn't even start. Is there any reason to recommend the broken old versions instead of a current one? ([[mw:MWDumper]] points to a third version attached in a bug report, which also didn't seem to work.)
The solution seems to be to increase the size of the heap as explained on http://www.mediawiki.org/wiki/Manual:MWDumper#Troubleshooting I'll mark this bugs as Resolved and Worksforme, if the bugreporter feels that this is still an issue then please reopen the bug.
As a bigger question though - why does it need so much memory? Doesn't it interpert the dumps a little at a time, and thus shouldn't need all that much memory?
(In reply to comment #2) > The solution seems to be to increase the size of the heap as explained on > http://www.mediawiki.org/wiki/Manual:MWDumper#Troubleshooting Yeah, I'm probably aware of that, since I was the one who added it there :) The point, as Bawolff said, is that MWDumper should not need a default heap size of ~1GB when the largest revision is below 2MB. Either there is a memory leak, or something is done really inefficiently.