Last modified: 2014-07-31 07:44:42 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T62618, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 60618 - Very slow query for Special:WhatLinksHere limitted to a namespace when page has large number of backlinks
Very slow query for Special:WhatLinksHere limitted to a namespace when page h...
Status: RESOLVED FIXED
Product: MediaWiki
Classification: Unclassified
Database (Other open bugs)
1.23.0
All All
: Normal normal (vote)
: ---
Assigned To: Aaron Schulz
: performance
: 60838 (view as bug list)
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2014-01-30 14:44 UTC by Jarek Tuszynski
Modified: 2014-07-31 07:44 UTC (History)
8 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Jarek Tuszynski 2014-01-30 14:44:52 UTC
Special:WhatLinksHere on Commons occasionally gives Database errors. For example:

https://commons.wikimedia.org/w/index.php?title=Special%3AWhatLinksHere&target=Template%3ALocation%2Flayout&namespace=10 

and 

https://commons.wikimedia.org/w/index.php?title=Special%3AWhatLinksHere&target=Module%3AFallbacklist&namespace=828 

gave me:

  Database error
  A database query error has occurred. This may indicate a bug in the software.
    Function: SpecialWhatLinksHere::showIndirectLinks
    Error: 0 

Antoine "hashar" Musso checked on that and wrote:
"The first URL worked for me, the second throws a database error.

Thu Jan 30 14:11:19 UTC 2014 /* query */ Thu Jan 30 14:19:33 UTC 2014 /* query */ Connection lost and reconnected after 59.754s".

This might be related to the fact that both pages within last month or two went through changes that affects update links tables on millions of pages, and those link tables are slow to update, resulting in a state when results of Special:WhatLinksHere do not match the actual linking dependencies.
Comment 1 Brad Jorsch 2014-01-30 15:31:08 UTC
The underlying cause seems to be the query killer mentioned in bug 58157 comment 5. The query here appears to be along the lines of:

 SELECT /*! STRAIGHT_JOIN */ page_id,page_namespace,page_title,rd_from  FROM `templatelinks`,`page` LEFT JOIN `redirect` ON ((rd_from = page_id) AND rd_namespace = '828' AND rd_title = 'Fallbacklist' AND (rd_interwiki = '' OR rd_interwiki IS NULL))  WHERE (page_id=tl_from) AND tl_namespace = '828' AND tl_title = 'Fallbacklist' AND page_namespace = '828'  ORDER BY tl_from LIMIT 51;

Presumably it's slow because Module:Fallbacklist probably has something like 25 million transclusions, few of which are in the Module namespace. What might be done about that, though, I have no idea.
Comment 2 Bawolff (Brian Wolff) 2014-01-30 15:57:16 UTC
Removing the /*! STRAIGHT_JOIN */ would probably help as there is only 163 pages in the module namespace in commons (On tool labs, query without straight join took 0.16 seconds), so much more efficient to join the other way around.

However that doesn't really help the general case of a template with millions of transclusions none of which are in the namespace being looked for, and that namespace also has millions of pages. I'm not sure if anything can help with that case short of duplicating the namespace of tl_from into template links table (Wouldn't it be nice if indexes could cross table boundaries?)

----


>https://commons.wikimedia.org/w/index.php?title=Special%3AWhatLinksHere&target=Module%3AFallbacklist&namespace=828 

If you really wanted to know, the modules that link to Fallbacklist are:

+------------------+
| page_title       |
+------------------+
| File             |
| Coordinates      |
| Fallback         |
| Fallbacklist     |
| Fallbacklist/doc |
| Coordinates/doc  |
| Fallback/sandbox |
| File/doc         |
+------------------+
Comment 3 Bartosz Dziewoński 2014-02-04 19:49:48 UTC
*** Bug 60838 has been marked as a duplicate of this bug. ***
Comment 4 Gerrit Notification Bot 2014-02-18 21:06:29 UTC
Change 114070 had a related patch set uploaded by Aaron Schulz:
Removed STRAIGHT_JOIN; the is slower when a namespace has a few pages

https://gerrit.wikimedia.org/r/114070
Comment 5 Gerrit Notification Bot 2014-02-19 02:53:07 UTC
Change 114070 merged by jenkins-bot:
Removed STRAIGHT_JOIN; the is slower when a namespace has a few pages

https://gerrit.wikimedia.org/r/114070
Comment 6 Aaron Schulz 2014-03-05 06:36:00 UTC
(In reply to Gerrit Notification Bot from comment #5)
> Change 114070 merged by jenkins-bot:
> Removed STRAIGHT_JOIN; the is slower when a namespace has a few pages
> 
> https://gerrit.wikimedia.org/r/114070

This both helped and harmed some cases. More work is needed here.
Comment 7 Sean Pringle 2014-03-05 11:05:11 UTC
Aaron pinged me on IRC. Some observations made during that discussion follow.

The pain points are now mostly mid-sized wikis (eg, *wiktionary, metawiki) that have:

* large numbers of page or template links (hundreds of millions)
* comparatively few pages (less than 10 million)
* data skewed toward one or two namespaces
* data skewed toward a small set of titles

enwiktionary> select tl_namespace, count(*) as links
   from templatelinks group by tl_namespace;
+--------------+-----------+
| tl_namespace | links     |
+--------------+-----------+
|            0 |      1696 |
|            1 |        56 |
|            2 |     14901 |
|            3 |       129 |
|            4 |      1462 |
|            5 |        12 |
|            8 |       103 |
|           10 |  23967908 |
|           11 |         1 |
|           14 |         5 |
|           15 |         4 |
|           90 |         3 |
|          100 |        60 |
|          101 |         2 |
|          104 |      1546 |
|          106 |         6 |
|          110 |         1 |
|          828 | 138128211 |
+--------------+-----------+

When hitting 10 or 828 and a title with millions of links MariaDB may fall back on an index scan on `page`. These mid-sized wikis liked having STRAIGHT_JOIN even if others didn't :-)

Could we tolerate the possibility of stale links and skip the JOIN on `page` altogther? Might need a denormalised tl_from_namespace field. Then pull out the page fields in a second batch WHERE page_id IN (...), or use a sub-query to do the same thing:

SELECT 
   page_id, page_namespace, page_title, rd_from
FROM (
   SELECT tl_from, rd_from
   FROM `templatelinks`
   LEFT JOIN `redirect` 
      ON rd_from = tl_from
         AND rd_namespace = tl_namespace
         AND rd_title = tl_title
         AND (rd_interwiki = '' OR rd_interwiki IS NULL)
   WHERE tl_namespace = '828'
      AND tl_title = 'languages/data3/i'
   ORDER BY tl_from
   LIMIT 100
) tmp
JOIN page ON tl_from = page_id
ORDER BY page_id
LIMIT 51;

Increase the inner LIMIT if stale links are a problem; 500 or 1000 would be fine.
Comment 8 Aaron Schulz 2014-03-06 23:02:52 UTC
This works now as long as the NS selector is not used. If it is, then we'd need an rd_from_namespace field to make this work.
Comment 9 Aaron Schulz 2014-03-06 23:07:32 UTC
(In reply to Aaron Schulz from comment #8)
> This works now as long as the NS selector is not used. If it is, then we'd
> need an rd_from_namespace field to make this work.

Sorry not rd_from_namespace but rather *_from_namespace for templatelinks, imagelinks, and pagelinks.
Comment 10 Gerrit Notification Bot 2014-03-07 01:52:32 UTC
Change 117373 had a related patch set uploaded by Aaron Schulz:
Redid WhatLinksHere query and added a _from_namespace field

https://gerrit.wikimedia.org/r/117373
Comment 11 Gerrit Notification Bot 2014-03-07 22:03:05 UTC
Change 117373 had a related patch set uploaded by Krinkle:
Redo WhatLinksHere query and add a *_from_namespace field to link tables

https://gerrit.wikimedia.org/r/117373
Comment 12 Sean Pringle 2014-04-01 04:47:24 UTC
I am testing MariaDB 5.5.36 on db1034 with a patch to add a innodb_min_scan_time variable which allows ha_innodb::scan_time to be controlled, and by extension the apparent cost of an index scan (InnoDB tables scans are index scans on the clustered primary key). Historically this sort of thing could be done with MyISAM and max_seeks_for_key, but that isn't so effective for InnoDB.

SpecialWhatLinksHere::showIndirectLinks queries on skewed data is one group of the queries this setting helps. Also likely a number of other queries using FORCE INDEX could benefit too. So far no adverse impact on other traffic or disk IO patterns.

Not necessarily a reason to abandon Aaron's patch; just an update because the patch is blocked on a schema change that won't be scheduled until after the TechOps meet.
Comment 13 Gerrit Notification Bot 2014-07-21 19:10:25 UTC
Change 117373 merged by jenkins-bot:
Redo WhatLinksHere query and add a *_from_namespace field to link tables

https://gerrit.wikimedia.org/r/117373
Comment 14 db [inactive,noenotif] 2014-07-31 07:44:42 UTC
Status: Merged

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links