Last modified: 2011-09-18 07:22:28 UTC
$ md5sum itwiki-20110130-pages-articles.xml.bz2 7eac57c7c521bf6f36e9a5d7ec476562 itwiki-20110130-pages-articles.xml.bz2 which is fine, according to http://dumps.wikimedia.org/itwiki/20110130/itwiki-20110130-md5sums.txt but... $ bunzip2 itwiki-20110130-pages-articles.xml.bz2 bunzip2: Data integrity error when decompressing. Input file = itwiki-20110130-pages-articles.xml.bz2, output file = itwiki-20110130-pages-articles.xml It is possible that the compressed file(s) have become corrupted. You can use the -tvv option to test integrity of such files. You can use the `bzip2recover' program to attempt to recover data from undamaged sections of corrupted files. bunzip2: Deleting output file itwiki-20110130-pages-articles.xml, if it exists. $ bunzip2 -tvv itwiki-20110130-pages-articles.xml.bz2 itwiki-20110130-pages-articles.xml.bz2: [1: huff+mtf rt+rld] [2: huff+mtf rt+rld] [.... snip ....] [2510: huff+mtf rt+rld] [2511: huff+mtf rt+rld] [2512: huff+mtf data integrity (CRC) error in data You can use the `bzip2recover' program to attempt to recover data from undamaged sections of corrupted files.
got this, too. thanks for reporting this.
Rerunning this job from the command line. It should be done in a couple hours and I'll have a look. I've saved a copy of the old bad file elsewhere on the off chance that it's useful for comparison.
The new file looks normal afaict. Can you check it please?
Indeed, my import script passed the download and unzip stage. Thanks a lot, and good luck with the broken file.
New instance of the issue, this time with http://download.wikimedia.org/eswiki/20110511/eswiki-20110511-pages-articles.xml.bz2
The bzip appears to die partway through once in a while. I'm going to have to add a check for that. I've so far failed to duplicate it on my laptop (probably because the files I generate aren't large enough). I'll rerun that step so we have a good file in the meantime.
I confirm new eswiki file is ok.
Closing.