Last modified: 2014-10-07 21:04:16 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T69439, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 67439 - Deal with logging query spam on crawler 404 floods
Deal with logging query spam on crawler 404 floods
Status: RESOLVED FIXED
Product: MediaWiki
Classification: Unclassified
General/Unknown (Other open bugs)
1.24rc
All All
: Normal normal (vote)
: ---
Assigned To: Aaron Schulz
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2014-07-02 22:12 UTC by Aaron Schulz
Modified: 2014-10-07 21:04 UTC (History)
2 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Aaron Schulz 2014-07-02 22:12:06 UTC
As seen at https://tendril.wikimedia.org/report/, we have a bunch of crawlers of various types hitting non-existent pages. We do a move/delete log query on such page views...which is fine except when lots of queries come in at once. They end up taking 16s to 18s.

Possible solution is to avoid calling the LogEventList method in showMissingArticle based on a Bloom Filter in Redis. This would be updated on the fly. Not sure how to estimate the set size to keep the false hit rate down.
Comment 1 Aaron Schulz 2014-07-03 17:53:55 UTC
(In reply to Aaron Schulz from comment #0)
> As seen at https://tendril.wikimedia.org/report/, we have a bunch of
> crawlers of various types hitting non-existent pages. We do a move/delete
> log query on such page views...which is fine except when lots of queries
> come in at once. They end up taking 16s to 18s.
> 
> Possible solution is to avoid calling the LogEventList method in
> showMissingArticle based on a Bloom Filter in Redis. This would be updated
> on the fly. Not sure how to estimate the set size to keep the false hit rate
> down.

Of course a bloom filter requires scanning all of `logging` and using add() for new deletes. This is problematic if the redis server is not durable or is downed (since repopulation cannot be on the fly). Maybe the rebuilding could be automatic and batched (switching the filter on when done).
Comment 2 Aaron Schulz 2014-07-03 20:47:55 UTC
Also it might help to route non-user based logging queries to all DBs rather than just db1055 (the partitioning of that table by user is necessary for this query).
Comment 3 Gerrit Notification Bot 2014-09-03 17:50:39 UTC
Change 143802 merged by jenkins-bot:
Added BloomCache classes

https://gerrit.wikimedia.org/r/143802
Comment 4 Aaron Schulz 2014-10-07 21:04:16 UTC
Deployed and populated (on enwiki, mostly automatically).

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links