Last modified: 2014-05-20 15:19:19 UTC
InfoAction::pageCounts() does these two queries: // Total number of distinct authors $authors = (int)$dbr->selectField( 'revision', 'COUNT(DISTINCT rev_user_text)', array( 'rev_page' => $id ), __METHOD__ ); SELECT COUNT(DISTINCT rev_user_text) FROM `revision` WHERE rev_page = '453' LIMIT 1; // Recent number of distinct authors $authors = (int)$dbr->selectField( 'revision', 'COUNT(DISTINCT rev_user_text)', array( 'rev_page' => $id, "rev_timestamp >= " . $dbr->addQuotes( $threshold ) ), __METHOD__ ); SELECT COUNT(DISTINCT rev_user_text) FROM `revision` WHERE rev_page = '453' AND rev_timestamp >= '20140415000000' LIMIT 1; Generally OK except for certain pages which have quite a lot of hits, such as the wikidata example id 443 above. Sometimes examples run for up to 300s if data is cold. Obviously these results are for caching, but they correspond to disk io spikes which would be nice to avoid. These versions using rev_user are faster: SELECT COUNT(DISTINCT rev_user) FROM `revision` WHERE rev_page = '453' LIMIT 1; SELECT COUNT(DISTINCT rev_user) FROM `revision` WHERE rev_page = '453' AND rev_timestamp >= '20140415000000' LIMIT 1; Two reasons: 1. Counting integers is marginally faster than strings. Small deal. 2. Optimizer choose to retrieve the results entirely from existing indexes without touching rows at all. Big deal!
I believe the reason it's using rev_user_text instead of rev_user is so that different IP addresses are considered to be different users. With rev_user, all IPs just show up are 0, so they'd be counted as one user.
(In reply to Sean Pringle from comment #0) > 2. Optimizer choose to retrieve the results entirely from existing indexes > without touching rows at all. Big deal! Given comment 1 (the desire to use rev_user_text), can we force indices? That seems like it could work here. Otherwise, what are our other options? Caching this data for a day? Storing the counts in the DB?
There isn't a useful index on rev_user_text /to/ force :) afaik these results are already cached so this doesn't need to be instant, just tolerable. I guess I need to trial adding a page_usertext_timestamp index and report back.
(In reply to Sean Pringle from comment #3) > There isn't a useful index on rev_user_text /to/ force :) Ah, fair enough. For the bug's reference, according to <https://git.wikimedia.org/raw/mediawiki%2fcore%2egit/HEAD/maintenance%2ftables%2esql> MediaWiki core has: CREATE INDEX /*i*/usertext_timestamp ON /*_*/revision (rev_user_text,rev_timestamp); CREATE INDEX /*i*/page_user_timestamp ON /*_*/revision (rev_page,rev_user,rev_timestamp); > afaik these results are already cached so this doesn't need to be instant, > just tolerable. I guess I need to trial adding a page_usertext_timestamp > index and report back. I guess this page_usertext_timestamp index would be a combination of usertext_timestamp and page_user_timestamp and would be a custom index for (all?) Wikimedia wikis. Won't this be a very large index for the very large revision tables? I'm not sure this feature is worth it, but this is most definitely your area of expertise and not mine. :-)
Yes, the potential index size isn't a thrilling prospect.