Last modified: 2010-05-15 15:33:44 UTC
This causes a full table scan of several gigabytes of data, much of which isn't cached, and regularly takes more than a minute for de and hundreds of seconds for en. Please change it to use the query below, which uses an index which is normally very well cached because it's used for lots of article lookups. select count(cur_namespace) from cur; +----------------------+ | count(cur_namespace) | +----------------------+ | 1292372 | +----------------------+ 1 row in set (6.29 sec) mysql> explain select count(cur_namespace) from cur; +-------+-------+---------------+----------------------+----- ----+------+---------+-------------+ | table | type | possible_keys | key | key_len | ref | rows | Extra | +-------+-------+---------------+----------------------+----- ----+------+---------+-------------+ | cur | index | NULL | name_title_timestamp | 270 | NULL | 1444561 | Using index | +-------+-------+---------------+----------------------+----- ----+------+---------+-------------+ 1 row in set (0.00 sec) Next run takes 0.62 seconds. Current code causes results like this on bacon: 1548197 wikiuser igoeje:53107 enwiki 402 Query /* wfSpecialStatistics */ SELECT COUNT(cur_id) AS total FROM `cur` or this on holbach: 23994523 wikiuser smellie:33404 dewiki 112 Query /* wfSpecialStatistics */ SELECT COUNT(cur_id) AS total FROM `cur` Because of the long run time and high disk activity these queries are also a regular cause of replication lag. It's also fairly common to see several of these running at the same time for wikis like en and de. The smaller cur records for 1.5 will help these queries but it'll still be better to have this index used.
I have a patch for SpecialStatistics (that I had forgotten to upload) which uses higher-level methods of the database class, removes some unnecessary code, but used COUNT(*) based on some mysql doco claiming that as the optimised form for MyISAM and ISAM tables and removing the need to check for NULL values. Do you have a timing for COUNT(*)?
Count(*) is the optimised form for MyISAM and ISAM table types. In the case of those non-transactional table types, the value is stored and immediately available if you use count(*). That is, it's available in a fraction of a second. Wikimedia uses the InnoDB table type. So should other MediaWiki users, when possible, because of the far better crash recovery and the transaction support, which isn't available in MyISAM. explain select count(*) from cur; +-------+-------+---------------+--------+---------+------+---------+- ------------+ | table | type | possible_keys | key | key_len | ref | rows | Extra | +-------+-------+---------------+--------+---------+------+---------+- ------------+ | cur | index | NULL | cur_id | 4 | NULL | 1463251 | Using index | +-------+-------+---------------+--------+---------+------+---------+- ------------+ 1 row in set (0.00 sec) As you can see, count(*) will use the primary key, which is the clustered key in InnoDB and so forces the same full table scan. Optimal form for InnoDB is a count using an index which is already in cache or is mostly in cache. That's the namespace, title and timestamp index,which is routinely used for many common queries. Caution on the MySQL manual: it is often written assuming MyISAM and often does not say what is different for innoDB. For InnoDB: 'InnoDB does not keep an internal count of rows in a table. (This would actually be somewhat complicated because of multi-versioning.) To process a SELECT COUNT(*) FROM T statement, InnoDB must scan an index of the table, which will take some time if the table is not entirely in the buffer pool. To get a fast count, you have to use a counter table you create yourself and let your application update it according to the inserts and deletes it does. If your table does not change often, using the MySQL query cache is a good solution. SHOW TABLE STATUS also can be used if an approximate row count is sufficient. See Section 15.12, “InnoDB Performance Tuning Tips”.' The "the table is not entirely in the buffer pool" part is in error: it's the index which must be in the buffer pool, not the table, when scanning an index. I've submitted the correction. There's already a table for statistics and any statistics reporting code should ideally be just asking an object for the count, with the object using and occasionally updating the statistics table. Perhaps once every hour or every ten minutes. But for now, just changing the variable will be a major improvment. See http://dev.mysql.com/doc/mysql/en/innodb-restrictions.html
Created attachment 268 [details] REL1_4 patch for SpecialStatistics.php Optimises string quotes, uses higher-level database class methods, removes misplaced call to getSkin(), optimises implicit db index usage by doing count on columns that are not the primary index where possible. Simple test was done locally.
Created attachment 269 [details] HEAD patch for SpecialStatistics.php Uses higher-level database class methods, removes misplaced call to getSkin(), optimises implicit db index usage by doing count on columns that are not the primary index where possible. Simple test was done locally.
Those look good. Thanks. One extra note on head, though it also applies to 1.4: LIKE '% anything' alaways causes a full table scan. On the "when time is available to do list" should be something like different rows in user_rights for each capability with ur_user, ur_right as the primary key. Then the leading % can be dropped and it'll be a much faster indexed query. Shorter term goodness would be forcing sysop to be the first user right if present, then scanning for 'sysop%'. The request for lists of sysops is frequent enough that it's worth some work to make that particular form use an index.
Changed in REL1_4 and HEAD to use COUNT(cur_name)/COUNT(page_name) on James's suggestion. I'm not applying the patch, as it makes unnecessary changes to use the new functions without fitting the existing coding style for use of those functions.