Last modified: 2014-09-16 11:09:47 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T54728, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 52728 - Optimize Special:SupportedLanguages
Optimize Special:SupportedLanguages
Status: NEW
Product: MediaWiki extensions
Classification: Unclassified
Translate (Other open bugs)
master
All All
: Normal normal (vote)
: ---
Assigned To: Nobody - You can work on this!
: performance
: 59497 (view as bug list)
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2013-08-11 14:38 UTC by MZMcBride
Modified: 2014-09-16 11:09 UTC (History)
12 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description MZMcBride 2013-08-11 14:38:19 UTC
https://gerrit.wikimedia.org/r/78660

Special:SupportedLanguages had to be disabled on Wikimedia wikis due to database queries taking over 90,000 seconds. No idea which query or why. Perhaps a missing index?
Comment 1 Tim Starling 2013-08-12 00:27:27 UTC
The query would have been of the form:

SELECT user_name, user_editcount, MAX(rev_timestamp) as lastedit
FROM user, revision
WHERE user_id = rev_user
GROUP BY user_name

MAX() is optimised in various contexts, but I guess this isn't one of them. EXPLAIN indicates that this query scans the entire revision table, which would explain the long query times. This ungrouped query is efficient:

SELECT MAX(rev_timestamp) 
FROM revision,user 
WHERE rev_user=user_id AND user_name='Tim Starling';

So I guess it is the grouping that stops it from hitting a special-case optimisation of MAX(). The traditional way to retrieve a row from the end of a range, which does not rely on MAX() optimisations, is with ORDER BY and LIMIT:

SELECT rev_timestamp
FROM revision,user 
WHERE rev_user=user_id AND user_name='Tim Starling'
ORDER BY rev_timestamp DESC
LIMIT 1

But even if that was done, it would still be extremely inefficient and would not work. All the special page wants to do is display statistics about translators listed on e.g. [[Portal:Fr/translators]] for French translators, and there are no such translator lists on Commons, so actually it is trying to calculate statistics about nobody in order to display an empty table. In order to display that empty table, it is fetching edit count statistics for all 3 million users on the wiki. Even if it managed to get the queries done, the memcached set would fail due to the value size being larger than 1MB.

Obviously, it should do a single ungrouped query like the one above for each translator, not for every user on the wiki.
Comment 2 Gerrit Notification Bot 2014-01-06 19:58:32 UTC
Change 105736 had a related patch set uploaded by Nikerabbit:
Optimize Special:SupportedLanguages

https://gerrit.wikimedia.org/r/105736
Comment 3 Sean Pringle 2014-01-08 11:05:57 UTC
Even the solution without GROUP BY could be slow for very active users. The index used is probably always revision user_timestamp (rev_user, rev_timestamp). Since rev_timestamp is in second place MAX() must still do a range scan on the user's portion of the btree.

As we're already proposing to query separately for each user then pulling the user data out first in bulk and removing the join in the individual user queries would be predictable and faster:

SELECT user_id, user_editcount FROM user WHERE user_name in (...);

foreach $user_id:
    SELECT MAX(rev_timestamp) FROM revision WHERE rev_user = $user_id;

*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: NULL
         type: NULL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: NULL
        Extra: Select tables optimized away
1 row in set (0.29 sec)

Each user needs a single Handler_read_key hit on revision rev_timestamp (rev_timestamp) index regardless of how many revisions they have authored.

Worth considering.
Comment 4 Sean Pringle 2014-01-08 11:11:00 UTC
Actually, s/range scan/ref access/, but still potentially thousands of Handler_read_next.
Comment 5 Niklas Laxström 2014-01-08 16:56:55 UTC
It is actually doing individual queries now, though joining with user table as I didn't have user_id available without more refactoring.

I'm happy to do more refactoring to make it faster, but would like to avoid building dependent patchsets.
Comment 6 Gerrit Notification Bot 2014-01-22 22:41:49 UTC
Change 105736 merged by jenkins-bot:
Optimize Special:SupportedLanguages

https://gerrit.wikimedia.org/r/105736
Comment 7 Nemo 2014-01-23 20:18:53 UTC
This is not an enhancement, it's a bug because it completely prevents the feature from being used on Wikimedia wikis.
Comment 8 Nemo 2014-01-24 12:39:49 UTC
On translatewiki.net, time for purge went from 60+ s (to get a timeout) to about 30 (to actually get it).
http://www.webpagetest.org/result/140123_M7_FB4/
http://www.webpagetest.org/result/140124_8E_8RV/
Is the performance of the query used on the other wikis (i.e. without translators lists in portals) proportional to this? Maybe to re-enable it we only need to be sure the really-worst case is eliminated (queries 90 000 s long).
Comment 9 Niklas Laxström 2014-01-24 18:36:29 UTC
I was discussing this with Siebrand and we were planning to drop our special case for using portal pages, but currently that seems to throw us over the 60+ limit again so further work is needed. On the other hand this is not on top of my todo list.
Comment 10 Nemo 2014-09-16 11:09:47 UTC
*** Bug 59497 has been marked as a duplicate of this bug. ***

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links