Last modified: 2013-02-06 20:35:35 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T40944, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 38944 - Previewed regex differs from actual replacement (as one is PHP's handling, and the other is the database's)
Previewed regex differs from actual replacement (as one is PHP's handling, an...
Status: NEW
Product: MediaWiki extensions
Classification: Unclassified
ReplaceText (Other open bugs)
REL1_19-branch
All All
: Low normal (vote)
: ---
Assigned To: Nobody - You can work on this!
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2012-08-02 08:32 UTC by Sam Wilson
Modified: 2013-02-06 20:35 UTC (History)
1 user (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Sam Wilson 2012-08-02 08:32:47 UTC
The preview of the replacement to be made by the following
regex does not match what is actually replaced.

      Original text: Document Number=(POL|PRO) [0-9]+\.([0-9]+)
   Replacement text: $2

The original texts are of the form:

    Document Number=POL 1.23
    Document Number=PRO 23.5
    Document Number=PRO 2.9

and so on.  The idea is to be left with:

    Document Number=23
    Document Number=5
    Document Number=9

i.e. strip what are actually document types and manual section
numbers, and leave only what's after the dot.  The regex I gave
above, when previewed with these original values, only highlights:

    Document Number=POL 1.2
    Document Number=PRO 23.5
    Document Number=PRO 2.9

That is, it is not being greedy about the numbers after the dot, 
but it is being so with the numbers before it.  I tried removing 
the nongreedy-ing 'U' in

    $targetStr = "/$target/U";

in SpecialReplaceText.php extractContext(), and it correctly highlighted 
the whole thing.  But I haven't really read though the ramifications 
of doing that.

Anyway, the point is that when I actually *run* the replacement
(with unmodified code) the correct, greedy, replacement *is* made!

I hope this all makes sense.  :-)  Thanks!
Comment 1 Yaron Koren 2012-08-02 19:17:03 UTC
I'm not surprised that there are differences between the two, since one uses PHP's regex handling, and the other uses MySQL's (or whatever database system is being used) - actually, the surprising thing is that the two work as similarly as they do. It would probably take a lot of work to get the two to match each other more closely.
Comment 2 Sam Wilson 2012-08-03 06:18:17 UTC
Well, I guess it's much faster this way, and one just needs to know to construct regexes that are compatible with both PHP and the DBMS.  But I think what I'm seeing here is not about incompatibility between the regex engines: because the lines are being found correctly, but just highlighted wrongly.

The process seems to be as follows....

To preview, in SpecialReplaceText.php:

1. Use the DB's regexp to find the pages --
   regexCond(): "$column $op " . $dbr->addQuotes( $regex );
2. Then find the lines in each page that match --
   preg_match_all("/$target/", $text, $matches, PREG_OFFSET_CAPTURE);
3. Then, for each matching line, highlight the result --
   $targetStr = "/$target/U";
   preg_replace( $targetStr, '<span class="searchmatch">\0</span>', $snippet);
4. Then create the job, saving the page name, regex, etc.

Then to replace, ReplaceTextJob.php:

1. For each job, create new text --
   preg_replace( '/'.$target_str.'/U', $replacement_str, $article_text, -1, $num_matches );

So is it just a matter of removing the Ungreedy modifiers?  That fixes the highlighting problem that I'm seeing, but I'm sure other people know better than I about what else that would break!

Thanks for taking the time to look at this.  :-)

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links