Last modified: 2014-01-03 15:22:38 UTC
This issue was converted from https://jira.toolserver.org/browse/DBQ-7. Summary: Shortpages without templates and a single contributor Issue type: Task - A task that needs to be done. Priority: Minor Status: Done Assignee: Andreas Gasser <andreas_gasser@yahoo.com> ------------------------------------------------------------------------------- From: MZMcBride <mzmcbride@gmail.com> Date: Sun, 06 Jan 2008 01:22:25 ------------------------------------------------------------------------------- I'd like a query run that can get a list of all shortpages on en.wiki that don't contain templates and only have a single contributor. Erwin provided me with the query he uses (below), however I've been told that it can probably be optimized or written in a more efficient manner. Thanks! /usr/bin/mysql -hsql-s1 --skip-column-names enwiki_p > /home/erwin85/public_html/dbq/shortpages_enwiki.txt <<EOF SELECT CONCAT(ns_name, ':', page_title, ' - ', page_len) FROM page LEFT JOIN toolserver.namespace ON page_namespace = ns_id WHERE page_len < 50 AND page_is_redirect = 0 AND page_namespace NOT IN (2, 3) AND (SELECT COUNT(1) FROM templatelinks WHERE tl_from = page_id) = 0 AND (SELECT COUNT(DISTINCT rev_user_text) FROM revision WHERE rev_page = page_id) = 1 AND dbname = 'enwiki_p' ORDER BY page_len ASC EOF
------------------------------------------------------------------------------- From: DaB. <dab@ts.wikimedia.org> Date: Sun, 06 Jan 2008 01:52:10 ------------------------------------------------------------------------------- SELECT page_namespace,page_title,page_len FROM revision,page LEFT JOIN templatelinks ON tl_from = page_id WHERE page_len<50 AND page_is_redirect = 0 AND page_namespace NOT IN (2, 3) AND rev_page=page_id AND tl_from IS NULL GROUP BY rev_user HAVING count(DISTINCT rev_user)<2 ORDER BY page_len LIMIT 20; seems to do what you want (the join with toolserver.namespace is missing, that's trivial ![][1]). But perhaps Duesentrieb knows a better way ![][2] Sincerly, DaB. P.S: on dewp the query needs 8 min 5.23 sec and 0.31 sec on cswp. [1]: https://jira.toolserver.org/images/icons/emoticons/wink.gif [2]: https://jira.toolserver.org/images/icons/emoticons/smile.gif
------------------------------------------------------------------------------- From: Daniel Kinzler <daniel@brightbyte.de> Date: Sun, 06 Jan 2008 14:16:01 ------------------------------------------------------------------------------- The result is at <http://tools.wikimedia.de/~daniel/misc/shortpages-enwiki.txt> (39441), details at <http://wiki.ts.wikimedia.org/view/Query_service/short_pages_enwiki>. I used the original query, only cleaned it up a bit. I got rid of the first subquery, but as it turns out, mysql was already smart enough to optimize it away anyway. @DaB: avoiding the second subquery at the cost of running a GROUP on all revisions on enwiki probably makes things worse, not better. Running group on such a huge table is expensive, and running a simple subquery on a set of relatively few pages is probably faster - this is true because page_len < 50 is a pretty strong limitation; if we had no such strong resatriction, grouping would probably be faster.
------------------------------------------------------------------------------- From: DaB. <dab@ts.wikimedia.org> Date: Sun, 06 Jan 2008 23:28:51 ------------------------------------------------------------------------------- mm, your query is a little bit slower then mine. Your query needs 10 min 26.73 sec on dewp and 1 min 24.52 sec on cswp. Any idea, why? (That's only a question to learn ![][1]) [1]: https://jira.toolserver.org/images/icons/emoticons/smile.gif
------------------------------------------------------------------------------- From: Daniel Kinzler <daniel@brightbyte.de> Date: Mon, 07 Jan 2008 00:14:00 ------------------------------------------------------------------------------- when i tested, it was a bit faster ![][1] the problem is: caches. to get a reliable benchmark, you would have to try each on a cold cache. if you run either qiery twice, the second go is **much** faster. this may also make the other query faster. also, some query run by someone else just before you tried may have loaded the relevant index into the cache, etc... [1]: https://jira.toolserver.org/images/icons/emoticons/smile.gif
------------------------------------------------------------------------------- From: Andreas Gasser <andreas_gasser@yahoo.com> Date: Tue, 28 Jun 2011 10:09:48 ------------------------------------------------------------------------------- http://toolserver.org/
This bug was imported as RESOLVED. The original assignee has therefore not been set, and the original reporters/responders have not been added as CC, to prevent bugspam. If you re-open this bug, please consider adding these people to the CC list: Original assignee: (none) CC list: b@mzmcbride.com, wikimedia-bugzilla@dabpunkt.eu, daniel.kinzler@wikimedia.de