Last modified: 2013-11-25 21:15:22 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T58798, the corresponding Phabricator task for complete and up-to-date bug report information.

Bug 56798 - CirrusSearch shouldn't use SQL to figure out page weight


Summary:	CirrusSearch shouldn't use SQL to figure out page weight

Status:	RESOLVED FIXED

Product:	MediaWiki extensions
Classification:	Unclassified
Component:	CirrusSearch (Other open bugs)
Version:	unspecified
Hardware:	All All

Importance:	Highest normal (vote)
Target Milestone:	---
Assigned To:	Nik Everett

URL:
Whiteboard:	cirrus_reenable
Keywords:	performance

Depends on:
Blocks:
	Show dependency tree / graph

Reported:	2013-11-08 19:41 UTC by Nik Everett
Modified:	2013-11-25 21:15 UTC (History)
CC List:	5 users (show)

See Also:
Web browser:	---
Mobile Platform:	---
Assignee Huggle Beta Tester:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Description Nik Everett 2013-11-08 19:41:14 UTC

CirrusSearch shouldn't use SQL to figure out page weight.  The queries are too slow.

Comment 1 Nik Everett 2013-11-09 19:19:49 UTC

While I'm thinking about this - we should probably go about this by pushing the outbound links into Elasticsearch and then counting them on page update.  That should be reasonably simple but will require a code push and a rebuild to get the links in then another code push.

My only question is do we continue to use the database method during initial recovery and population or do we switch to a two pass approach for both.  I'm inclined to stick with the database based method because it simplifies initial index loading a ton.  Because we control it we can make sure it doesn't run rampant.  Also, it is less work so we can get this out the door sooner.  We can always go with the two phrased approach later and we really won't have cost ourselves too much in the mean time.

Comment 2 Aaron Schulz 2013-11-09 19:25:40 UTC

It might also have to use SELECT 1 ... LIMIT X where X is some reasonable number to short-circuit the query.

Alternatively, one could page through them for the count (if this is for a population script). BacklinkCache::partition() already does that. Since maintenance scripts turn of DBO_TRX, each paging query would be a separate implicit server-side TRX, so row purging could happen in between each SELECT() unlike the huge COUNT(*) SELECT now.

Comment 3 Nik Everett 2013-11-09 19:37:04 UTC

Something like this:

while count > 0:
  SELECT MAX(pl_page_id), COUNT(*) FROM (SELECT pl_page_id FROM page_link WHERE pl_page_id > $last_max$ LIMIT 10000)
?


We're sure we'll get rid of the sql based counting in the normal update case but in the population/outage recovery case (both in process and job queue based) I was thinking of keeping it (or modifying it like you suggest.)  The idea being that SQL based counting will be right even if Elasticsearch is super out of date.  And it'll certainly be out of date in the population case.  Without it we'd need a second pass at populating Elasticsearch to count the links which just seems complicated/burdensome/nasty.

I had a look at BacklinkCache a while ago but it looked like it was pulling all the backlinks into memory to count them.  That didn't seem pretty.

Comment 4 Aaron Schulz 2013-11-09 19:45:14 UTC

BacklinkCache::getLinks() buffers the whole query result in RAM but iterates for the Titles. BacklinkCache::partition() batches larges queries and does not store the full query result nor titles anywhere. It just stores small arrays of ranges (and row count integers).

Comment 5 Nik Everett 2013-11-18 22:48:37 UTC

Now that we've disabled CirrusSearch in production we can actually do this without a full index rebuild.

So we still have to decide if we want to use BacklinkCache for maintenance scripts or use a two pass build.  I kind of prefer a two pass build because that has the minimum effect on the database.

Comment 6 Chad H. 2013-11-21 19:30:19 UTC

(In reply to comment #5)
> So we still have to decide if we want to use BacklinkCache for maintenance
> scripts or use a two pass build.  I kind of prefer a two pass build because
> that has the minimum effect on the database.

Yes, let's. I think operations will appreciate it ;-)

Comment 7 Nik Everett 2013-11-22 21:38:21 UTC

Patch uploaded but gerrit didn't link it: https://gerrit.wikimedia.org/r/#/c/97002/

Wikimedia Bugzilla is closed!

Search

Personal tools

Navigation

Links