Last modified: 2013-02-06 20:35:35 UTC
The preview of the replacement to be made by the following regex does not match what is actually replaced. Original text: Document Number=(POL|PRO) [0-9]+\.([0-9]+) Replacement text: $2 The original texts are of the form: Document Number=POL 1.23 Document Number=PRO 23.5 Document Number=PRO 2.9 and so on. The idea is to be left with: Document Number=23 Document Number=5 Document Number=9 i.e. strip what are actually document types and manual section numbers, and leave only what's after the dot. The regex I gave above, when previewed with these original values, only highlights: Document Number=POL 1.2 Document Number=PRO 23.5 Document Number=PRO 2.9 That is, it is not being greedy about the numbers after the dot, but it is being so with the numbers before it. I tried removing the nongreedy-ing 'U' in $targetStr = "/$target/U"; in SpecialReplaceText.php extractContext(), and it correctly highlighted the whole thing. But I haven't really read though the ramifications of doing that. Anyway, the point is that when I actually *run* the replacement (with unmodified code) the correct, greedy, replacement *is* made! I hope this all makes sense. :-) Thanks!
I'm not surprised that there are differences between the two, since one uses PHP's regex handling, and the other uses MySQL's (or whatever database system is being used) - actually, the surprising thing is that the two work as similarly as they do. It would probably take a lot of work to get the two to match each other more closely.
Well, I guess it's much faster this way, and one just needs to know to construct regexes that are compatible with both PHP and the DBMS. But I think what I'm seeing here is not about incompatibility between the regex engines: because the lines are being found correctly, but just highlighted wrongly. The process seems to be as follows.... To preview, in SpecialReplaceText.php: 1. Use the DB's regexp to find the pages -- regexCond(): "$column $op " . $dbr->addQuotes( $regex ); 2. Then find the lines in each page that match -- preg_match_all("/$target/", $text, $matches, PREG_OFFSET_CAPTURE); 3. Then, for each matching line, highlight the result -- $targetStr = "/$target/U"; preg_replace( $targetStr, '<span class="searchmatch">\0</span>', $snippet); 4. Then create the job, saving the page name, regex, etc. Then to replace, ReplaceTextJob.php: 1. For each job, create new text -- preg_replace( '/'.$target_str.'/U', $replacement_str, $article_text, -1, $num_matches ); So is it just a matter of removing the Ungreedy modifiers? That fixes the highlighting problem that I'm seeing, but I'm sure other people know better than I about what else that would break! Thanks for taking the time to look at this. :-)