Last modified: 2014-05-20 15:19:19 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T67332, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 65332 - infoAction::pageCounts use rev_user instead of rev_user_text
infoAction::pageCounts use rev_user instead of rev_user_text
Status: NEW
Product: MediaWiki
Classification: Unclassified
Database (Other open bugs)
1.23.0
All All
: Normal normal (vote)
: ---
Assigned To: Sean Pringle
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2014-05-15 06:15 UTC by Sean Pringle
Modified: 2014-05-20 15:19 UTC (History)
3 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Sean Pringle 2014-05-15 06:15:28 UTC
InfoAction::pageCounts() does these two queries:

// Total number of distinct authors
$authors = (int)$dbr->selectField(
	'revision',
	'COUNT(DISTINCT rev_user_text)',
	array( 'rev_page' => $id ),
	__METHOD__
);

SELECT COUNT(DISTINCT rev_user_text) FROM `revision` WHERE rev_page = '453' LIMIT 1;

// Recent number of distinct authors
$authors = (int)$dbr->selectField(
	'revision',
	'COUNT(DISTINCT rev_user_text)',
	array(
		'rev_page' => $id,
		"rev_timestamp >= " . $dbr->addQuotes( $threshold )
	),
	__METHOD__
);

SELECT COUNT(DISTINCT rev_user_text) FROM `revision` WHERE rev_page = '453' AND rev_timestamp >= '20140415000000' LIMIT 1;

Generally OK except for certain pages which have quite a lot of hits, such as the wikidata example id 443 above. Sometimes examples run for up to 300s if data is cold. Obviously these results are for caching, but they correspond to disk io spikes which would be nice to avoid.

These versions using rev_user are faster:

SELECT COUNT(DISTINCT rev_user) FROM `revision` WHERE rev_page = '453' LIMIT 1;

SELECT COUNT(DISTINCT rev_user) FROM `revision` WHERE rev_page = '453' AND rev_timestamp >= '20140415000000' LIMIT 1;

Two reasons:

1. Counting integers is marginally faster than strings. Small deal.

2. Optimizer choose to retrieve the results entirely from existing indexes without touching rows at all. Big deal!
Comment 1 Kunal Mehta (Legoktm) 2014-05-15 06:46:32 UTC
I believe the reason it's using rev_user_text instead of rev_user is so that different IP addresses are considered to be different users. With rev_user, all IPs just show up are 0, so they'd be counted as one user.
Comment 2 MZMcBride 2014-05-16 00:15:46 UTC
(In reply to Sean Pringle from comment #0)
> 2. Optimizer choose to retrieve the results entirely from existing indexes
> without touching rows at all. Big deal!

Given comment 1 (the desire to use rev_user_text), can we force indices? That seems like it could work here.

Otherwise, what are our other options? Caching this data for a day? Storing the counts in the DB?
Comment 3 Sean Pringle 2014-05-16 01:17:39 UTC
There isn't a useful index on rev_user_text /to/ force :)

afaik these results are already cached so this doesn't need to be instant, just tolerable. I guess I need to trial adding a page_usertext_timestamp index and report back.
Comment 4 MZMcBride 2014-05-16 01:56:11 UTC
(In reply to Sean Pringle from comment #3)
> There isn't a useful index on rev_user_text /to/ force :)

Ah, fair enough.

For the bug's reference, according to <https://git.wikimedia.org/raw/mediawiki%2fcore%2egit/HEAD/maintenance%2ftables%2esql> MediaWiki core has:

CREATE INDEX /*i*/usertext_timestamp ON /*_*/revision (rev_user_text,rev_timestamp);
CREATE INDEX /*i*/page_user_timestamp ON /*_*/revision (rev_page,rev_user,rev_timestamp);

> afaik these results are already cached so this doesn't need to be instant,
> just tolerable. I guess I need to trial adding a page_usertext_timestamp
> index and report back.

I guess this page_usertext_timestamp index would be a combination of usertext_timestamp and page_user_timestamp and would be a custom index for (all?) Wikimedia wikis. Won't this be a very large index for the very large revision tables? I'm not sure this feature is worth it, but this is most definitely your area of expertise and not mine. :-)
Comment 5 Sean Pringle 2014-05-16 02:29:08 UTC
Yes, the potential index size isn't a thrilling prospect.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links