Last modified: 2011-01-25 00:31:02 UTC
Bug has been encountered on fr.wikisource : MediaWiki 1.16alpha-wmf (r58524) PHP 5.2.4-2ubuntu5.7wm1 (apache2handler) MySQL 4.0.40-wikimedia-log When the text layer of the Djvu file contains « ") », the MediaWiki parser produces an empty page and then the text layer is shifted by one page from the image. An example of problematic Djvu file can be found here : http://commons.wikimedia.org/w/index.php?title=File:Sima_qian_chavannes_memoires_historiques_v4.djvu&oldid=31865251 In particular, we can find, in page 80, the following text (bad quality of scan) : « La quatrième année (.\"),*)()) ». The problem can be seen in the proofread version of this scan : http://fr.wikisource.org/w/index.php?title=Page:Sima_qian_chavannes_memoires_historiques_v4.djvu/80&action=edit : the end of the text is missing http://fr.wikisource.org/w/index.php?title=Page:Sima_qian_chavannes_memoires_historiques_v4.djvu/81&action=edit : no text layer http://fr.wikisource.org/w/index.php?title=Page:Sima_qian_chavannes_memoires_historiques_v4.djvu/82&action=edit : text layer and image does not longer match I have been able to track and fix the bug in my local mediawiki installation (same branch, same revision as fr.wikisource). The problem is located in DjvuImage::retrieveMetadata (includes/DjvuImage.php:257) : the regular expression considers any ") as the end of page marker, but a \ before the double quote should prevent this interpretation. I replaced the current regular expression by this one, and now the problem is fixed : $reg = "/\(page\s[\d-]*\s[\d-]*\s[\d-]*\s[\d-]*\s*\"((?>\\\\.|(?:(?!\\\\|\").)++)*?)\"\s*\)/s"; $txt = preg_replace( $reg, "<PAGE value=\"$1\" />", $txt ); Note for the regular expression : this is the adaptation of the regular expression used to match a text between double quotes with backslash as escape character, which in perl would be : "((?>\\.|[^"\\]++)*?)". The rather ugly (but working) (?:(?!\\\\|\").) corresponds to the trivial [^"\\], but the problem is that [^\"] and [^"] are not really the same thing…
In the Djvu file File:Post- och Inrikes Tidningar 1836-01-27.djvu page 4 contains the two character sequence ") properly escaped. After this, on the same page is the word "Eskilstuna" which you can search and find in djview, if you download the djvu file. But text extraction for the Wikisource ProofreadPage extension stops at the "). To verify this, go to http://en.wikisource.org/wiki/Page:Post-_och_Inrikes_Tidningar_1836-01-27.djvu/4 and click "create". (But don't create that page on the English Wikisource. It already exists on the Swedish Wikisource.)
To extract the OCR text (without pixel coordinates for each word) for the page NNN, this command should do: djvused -e 'select NNN; print-pure-txt' FILENAME.djvu
For /66 of commons:File:Östgötars_minne.djvu contains the two character sequence ") and that is where the extracted text ends. For /67 the extracted text is empty. For /68, the extracted text is the one that belongs to the /67 image. All subsequent pages have the text layer off by one or more pages. The OCR quality is low (coming from Google), so a new OCR should be generated before proofreading. But until then, this file is another test case for this bug. http://sv.wikisource.org/wiki/Index:%C3%96stg%C3%B6tars_minne.djvu
The proposed patch is a perl-compatible regexp. I am not familiar with that syntax, this is why I have not commited it. Could someone have a look at it, or provide a posix regexp ?
> or provide a posix regexp ? That’s not possible. Matching C-like quoted strings needs look-ahead and possessive operators, which are not available in POSIX syntax. But if you have any question, feel free to contact me (I’m Sloonz on fr.wikisource)
I tested your patch on this djvu file: http://fr.wikisource.org/wiki/Livre:Revue_des_Romans_%281839%29.djvu The file does not have the bug; djvu text extraction works without the patch. With the patch, pages are no longer aligned with the text.
http://en.wikisource.org/wiki/Index:Blackwood%27s_Magazine_volume_003.djvu has a problem at 122 http://en.wikisource.org/w/index.php?title=Page:Blackwood%27s_Magazine_volume_003.djvu/122
> With the patch, pages are no longer aligned with the text. Strange ; by the time I made the patch, I didn’t see this problem. I’ll look at it during this week.
Created attachment 7557 [details] Patch Found the problem (I dropped the empty-page case). Attached an updated patch that fix it. By doing htmlspecialchars after the matching phase, it allow to get rid of unreadable look-ahead. And I commented the regexp using /x modifier of PCRE. But that’s still not possible to convert this into POSIX regexp, since ereg_* doesn't have an equivalent of preg_replace_callback. Also, your file has a problem page 8 (http://fr.wikisource.org/w/index.php?title=Page:Revue_des_Romans_%281839%29.djvu/8&action=edit). As a side-effect, the patch fixes that too ;)
Thanks for patch and the detailed explanation. I commited it (r69139)
It would be nice if this bug fix could be considered out of session to look to be implemented at the Wikisource sites ahead of scheduled updates (next full application review). It is a minor bug that has major impediments for works for consideration. It leaves a blank page, misaligns text, and requires every subsequent pages in a work to be moved incrementally forward. Simple equations, even if we have only 20 works broken, and DjVu files are typically 200-500 pages in size, that would already start to equate to somewhere between 2000-8000 page moves. Thanks for any consideration that could be made to this request.
Well, in the meanwhile, it’s still possible to manually fix the broken djvu files ; my own pdf to djvu converter has these lines : # Workaround for MediaWiki bug #21526 # see https://bugzilla.wikimedia.org/show_bug.cgi?id=21526 $text =~ s/"(?=\s*\))//g; A quick look at man djvused give me this simple command to fix a djvu file (untested): cp thefile.djvu thefile-fixed.djvu; djvused thefile.djvu -e output-all | perl -pe 's/"(?=\s*\))//g' | djvused thefile-fixed.djvu -s
This is reported as fixed and for a period of time, and even with asking nicely for it to be given some priority for the Wikisource sites there is neither action, nor evidence of it being noticed. Something somewhere somehow would be nice, even a rough indication of who needs to sleep with whom, and where we have to send the photographs would be helpful. :-)
Deployed now. Note that the effect of create_function() is to create a global function with a random name and to return the name. Calling it in a loop will eventually use up all memory, because there is no way to delete global functions once they are created. For this reason alone, it shouldn't be used. But it is also slow, requiring a parse operation that is uncached by APC, and it's insecure in the sense that eval() is insecure: construction of PHP code can easily lead to arbitrary execution if user input is included in the code.
Many thanks to all. As a side not to Wikisourcerers the files need to be purged at Commons to get them to reload the text layer properly.
@Tim Starling I wasn’t aware of the performance issues of using create_function, sorry. But since the created function is static, it should be trivial to factor it out ; I used create_function only because I’m used to use blocks in Ruby. The corresponding function should just be: function convert_page_to_xml($matches) { return '<PAGE value="'.htmlspecialchars($matches[1]).'" />'; } Anyway, since the text layer is computed only once and then cached, I don’t fix that’s a big issue.
(In reply to comment #16) > @Tim Starling > I wasn’t aware of the performance issues of using create_function, sorry. > But since the created function is static, it should be trivial to factor it out > ; I used create_function only because I’m used to use blocks in Ruby. The > corresponding function should just be: > > function convert_page_to_xml($matches) { > return '<PAGE value="'.htmlspecialchars($matches[1]).'" />'; > } > > Anyway, since the text layer is computed only once and then cached, I don’t fix > that’s a big issue. Tim fixed the issue in r78046. The two revisions were then merged from trunk in r78047.