Last modified: 2009-05-18 23:52:41 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T17099, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 15099 - Bad regexes make the at least some of the blacklist get ignored
Bad regexes make the at least some of the blacklist get ignored
Status: RESOLVED FIXED
Product: MediaWiki extensions
Classification: Unclassified
Spam Blacklist (Other open bugs)
unspecified
All All
: Normal major (vote)
: ---
Assigned To: Tim Starling
http://en.wikipedia.org/w/index.php?t...
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2008-08-09 18:22 UTC by Mike.lifeguard
Modified: 2009-05-18 23:52 UTC (History)
2 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Mike.lifeguard 2008-08-09 18:22:32 UTC
If a bad regex exists in the blacklist (you cannot enter new bad regexes, as the page will not be saved, but old ones may remain), at least part of the blacklist don't get applied properly. I had thought that the regex of each line was joined together to make a big one, and bad regexes were split out & ignored. However, that would mean all /good/ regexes should be applied - that did not occur in this case.

You can see a supposedly-blacklisted domain added here: http://en.wikipedia.org/w/index.php?title=MediaWiki_talk:Spam-blacklist&diff=230441961&oldid=230437845
Comment 1 Siebrand Mazeland 2008-08-10 22:10:46 UTC
Assigned to extension author.
Comment 2 Brion Vibber 2008-08-11 01:10:41 UTC
As I recall, the way the spam blacklist regex constructions works is:

1) Build all the individual regexes

2) Compile them into a list of a few reasonably-sized large chunks

3) Test each of the chunks

4) For any chunk that fails to parse, split it up into its individual component regexes, so that all the non-failing ones will still be applied.


Note that a particular instance of a should-be-blacklisted URL sneaking through in an edit doesn't necessarily mean that regexes failed to parse, it could simply mean that the fetching of all the regexes failed on that host at that particular time.
Comment 3 Brion Vibber 2008-08-11 01:33:00 UTC
Ok, the problem here is that the result is a valid regex, just not the desired regex. ;)

http://en.wikipedia.org/w/index.php?title=MediaWiki:Spam-blacklist&diff=230442308&oldid=230364450

Previous line ended with a stray \ which turned the group separator | into \|, thus matching just the text "|", so breaking those two adjacent bits.

An explicit check for final "\" should probably be enough to work around this, since other metachars should lead to explicit breakage... hopefully... :)

Perhaps a hack in SpamRegexBatch::buildRegexes to ensure it doesn't break even on old errors, though a hack in SpamRegexBatch::getBadLines() will do to warn on editing the blacklist.

Seems to work in local testing. :D Committed in r39114
Comment 4 Mike.lifeguard 2009-05-18 23:52:41 UTC
(In reply to comment #2)
> Note that a particular instance of a should-be-blacklisted URL sneaking through
> in an edit doesn't necessarily mean that regexes failed to parse, it could
> simply mean that the fetching of all the regexes failed on that host at that
> particular time.
> 

Could we make sure that if that happens the last fetched version of the blacklist is used? A network failure or something should open the wikis to spamming simply because we couldn't get the blacklist from Meta the last time we tried. If we knew that that wasn't the case, then it'd warrant it's own bug.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links