Last modified: 2014-10-22 05:43:16 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T59176, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 57176 - ApiQueryExtLinksUsage::run query has crazy limit
ApiQueryExtLinksUsage::run query has crazy limit
Status: NEW
Product: MediaWiki
Classification: Unclassified
Database (Other open bugs)
1.23.0
All All
: High major with 1 vote (vote)
: ---
Assigned To: Nobody - You can work on this!
: performance
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2013-11-18 08:59 UTC by Sean Pringle
Modified: 2014-10-22 05:43 UTC (History)
7 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Sean Pringle 2013-11-18 08:59:35 UTC
Crazy limit offsets like this are showing up in slow query logs for enwiki:

SELECT /* ApiQueryExtLinksUsage::run  */  page_id,page_namespace,page_title,el_to  FROM `page`,`externallinks`  WHERE (page_id=el_from) AND (el_index  LIKE 'http://gov.nih.nlm.ncbi.%' )  LIMIT 768500,501;

By itself it's not too bad, < 10 seconds, but they're appearing in spikes presumably from something automated and causing concurrency issues.

Suggest disallowing such page offsets or redesigning the pagination method.
Comment 1 Umherirrender 2013-11-18 18:21:33 UTC
Can you count how often the LIKE will match?
SELECT COUNT(*)
  FROM externallinks
 WHERE el_index LIKE 'http://gov.nih.nlm.ncbi.%'

When the result is a number over the offset, it is okay, when someone is getting all external link usages for this url.

A primiary key was added to externallinks table with Gerrit change #51675 which allows changing the pagination method of the api and [[Special:LinkSearch]] (bug 45237). Maybe it can allow descending order on that table (bug 32386, bug 32401)
Comment 2 Sean Pringle 2013-11-19 09:43:42 UTC
enwiki> SELECT COUNT(*)
    ->   FROM externallinks
    ->  WHERE el_index LIKE 'http://gov.nih.nlm.ncbi.%';
+----------+
| COUNT(*) |
+----------+
|   951774 |
+----------+
1 row in set (3 min 57.90 sec)

The externallinks PK is partly added but some larger wikis (including en) are yet to be done.

For the time being, spikes of these queries may get killed if slaves get overloaded.
Comment 3 Aaron Schulz 2014-02-11 23:06:59 UTC
el_id would help when the el_index query prefix is the full el_index key length (60 chars) or more. Even if all results are all actually < 60 chars, all rows would have to be scanned to know that. In any case, if the query prefix is < 60 chars, then you no longer have the property of easily getting a range of index records having an equal first part and are ordered by the second part. Ordering by el_id in that case could be slower than the OFFSET query used now since MySQL would have to quicksort for all matches just to return the top X.

Option A:
Maybe we could have two more or more el_index indexes, accept they'd be on smaller prefixes (e.g. a 30 and 15). The index with the largest prefix that is smaller than the query prefix could be used. Non-matching rows (were only the prefix in the index matched, not the whole el_index field) would just be eliminated via scanning.

Option B:
It could help to have a new field:

-- Bucket to page on for large prefix queries
el_bucket UNSIGNED INTEGER DEFAULT MOD(el_id,1024)

...and a new index:

el_bucket_index ON externallinks (el_bucket,el_index (60))

When a LIKE query is estimated to match many rows (100000+), the API query could page through el_bucket from 0 to 1023, doing the LIKE query for each bucket.
Comment 4 Sean Pringle 2014-02-26 06:06:32 UTC
Option A means using FORCE INDEX I guess. Since multiple indexes would increase the overall footprint it would need testing to see if there is a net improvement. My guess would be probably not much.

Option B sounds like application-side hash partitioning. Anything that breaks up queries would help, though that may not speed up the api-user experience. Along the same lines we could investigate server-side partitioning now that S1 api traffic is sent to specific slaves.
Comment 5 Aaron Schulz 2014-02-26 06:10:58 UTC
With B, the API would would get the requested number of rows within a shard (moving to the next one as needed to respect the limit). The API continue parameter would include the shard and next link. The shards just give a stable ordering to page on, rather than having to keep doing more and more work for each subsequent query of the same result size. It should be much faster for the api-user.
Comment 6 Sean Pringle 2014-02-26 07:21:24 UTC
If the B method of paging works for the users, great. Let's see what Anomie thinks.
Comment 7 Brad Jorsch 2014-02-26 17:26:17 UTC
I thought I had posted an analysis of these API modules somewhere, but now I can't find it.

As noted already, the query for ApiQueryExtLinksUsage looks something like this:

 SELECT page_id,page_namespace,page_title,el_to FROM `page`,`externallinks` WHERE (page_id=el_from) AND (el_index LIKE 'http://gov.nih.nlm.ncbi.%' ) LIMIT $OFFSET,501;

ApiQueryExternalLinks has the same issue, BTW. Its queries look something like this:

 SELECT el_from, el_to FROM externallinks WHERE (el_from IN (1,2,3,4,5,6,7,...,5000)) AND (el_index LIKE 'http://gov.nih.nlm.ncbi.%') ORDER BY el_from LIMIT $OFFSET,501;

Option B is to change the first to making queries like this for $NUM from 0 to 1023?

 SELECT page_id,page_namespace,page_title,el_to FROM `page`,`externallinks` WHERE (page_id=el_from) AND (el_index LIKE 'http://gov.nih.nlm.ncbi.%' ) AND el_bucket = $NUM LIMIT $OFFSET,501;

It'd be doable, although we still have the offset in there so it's still ugly IMO. We'd also need another index to handle ApiQueryExternalLinks in a similar way: either (el_bucket, el_from, el_index(60)) or (el_from, el_bucket, el_index(60)) would work, I think.

I note that the API doesn't really have any clue as to whether the like is likely to match many rows, so we'd either have to do some sort of pre-query to test it (e.g. "SELECT COUNT(*) FROM (SELECT 1 FROM externallinks WHERE el_index LIKE $FOO LIMIT 5001) AS tmp") or else always do the bucket method.

Also, to avoid making clients do excessive numbers of queries to get back few rows from each, I'd prefer ApiQueryExtLinksUsage to do as many of the 1024 bucket queries it needs to fill up the requested limit. For example, if the 1000000 matching links were evenly spread across the buckets, ApiQueryExtLinksUsage called by someone with apihighlimits and eulimit=5000 would wind up making 6 queries: bucket 0 with limit 5001 getting about 977 rows, bucket 1 with limit 4024 getting another 977, bucket 2 with limit 3047, bucket 3 with limit 2070, bucket 4 with limit 1093, and bucket 5 with limit 116. And if the 1000000 matching links were all in bucket 999 (major hashing failure!), it'd do 1000 queries.


I like Option A better, since it would let us get rid of that offset entirely and also to stop trusting the database to not arbitrarily change the row ordering when no ORDER BY is actually given. But couldn't we just use one long index field and do a range match if the query is shorter, much like Block::getRangeCond()?

-- This is el_index truncated to 60 bytes
el_index_60 VARBINARY(60) NOT NULL

CREATE INDEX /*i*/el_index_60 ON /*_*/externallinks (el_index_60, el_id);
CREATE INDEX /*i*/el_from_index_60 ON /*_*/externallinks (el_from, el_index_60, el_id);

Then the queries would look something like this:

 -- Note '/' is '.' + 1
 SELECT page_id,page_namespace,page_title,el_to,el_index_60,el_id FROM `page`,`externallinks` WHERE (page_id=el_from) AND (el_index_60 >= 'http://gov.nih.nlm.ncbi.' AND el_index_60 < 'http://gov.nih.nlm.ncbi/' AND el_index LIKE 'http://gov.nih.nlm.ncbi.%') AND (el_index_60 > 'http://gov.nih.nlm.ncbi.XXXX' OR (el_index_60 = 'http://gov.nih.nlm.ncbi.XXXX' AND el_id >= $ID)) ORDER BY el_index_60, el_id LIMIT 501;

 SELECT el_from,el_to,el_index_60,el_id FROM externallinks WHERE (el_from IN (1,2,3,4,5,6,7,...,5000)) AND (el_index_60 >= 'http://gov.nih.nlm.ncbi.' AND el_index_60 < 'http://gov.nih.nlm.ncbi/' AND el_index LIKE 'http://gov.nih.nlm.ncbi.%') AND (el_from > $FROM OR (el_from = $FROM AND (el_index_60 > 'http://gov.nih.nlm.ncbi.XXXX' OR (el_index_60 = 'http://gov.nih.nlm.ncbi.XXXX' AND el_id >= $ID)))) ORDER BY el_from, el_index_60, el_id LIMIT 501;

 -- In these, the query is over 60 characters so el_index_60 is constant
 SELECT page_id,page_namespace,page_title,el_to,el_id FROM `page`,`externallinks` WHERE (page_id=el_from) AND (el_index_60 = 'http://gov.nih.nlm.ncbi./foo/bar/xxxxxxxxxxxxxxxxxxxxxxxx/ba' AND el_index LIKE 'http://gov.nih.nlm.ncbi./foo/bar/xxxxxxxxxxxxxxxxxxxxxxxx/baz/%') AND (el_id >= $ID) ORDER BY el_id LIMIT 501;

 SELECT el_from,el_to,el_id FROM externallinks WHERE (el_from IN (1,2,3,4,5,6,7,...,5000)) AND (el_index_60 = 'http://gov.nih.nlm.ncbi./foo/bar/xxxxxxxxxxxxxxxxxxxxxxxx/ba' AND el_index LIKE 'http://gov.nih.nlm.ncbi./foo/bar/xxxxxxxxxxxxxxxxxxxxxxxx/baz/%') AND (el_from > $FROM OR (el_from = $FROM AND el_id >= $ID)) ORDER BY el_from, el_id LIMIT 501;
Comment 8 Aaron Schulz 2014-03-07 19:40:12 UTC
I'm OK with A too. The above approach looks reasonable. Indeed we will need both extra columns with indexes and not just more prefixes since even FORCE is not enough to get MySQL to be aware of the possible optimization.
Comment 9 Brad Jorsch 2014-06-18 14:50:06 UTC
Sean, how does my proposal in comment 7 sound to you?
Comment 10 Andre Klapper 2014-07-09 12:27:39 UTC
Sean: How does anomie's proposal in comment 7 sound?
Comment 11 Andre Klapper 2014-08-04 00:15:26 UTC
Sean: How does anomie's proposal in comment 7 sound?
Comment 12 Andre Klapper 2014-10-21 16:27:17 UTC
(In reply to Brad Jorsch from comment #9)
> Sean, how does my proposal in comment 7 sound to you?

Looks like we can't make Sean answer here. :(
Comment 13 Greg Grossmeier 2014-10-21 22:36:44 UTC
(suggestion: add it (the question for Sean) to the Scrum of Scrums)
Comment 14 Sean Pringle 2014-10-22 05:43:16 UTC
Ah, sorry. Dropped the ball.

Setting up a real test of both options on enwiki.eternallinks now.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links