Last modified: 2014-01-03 15:58:44 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T61374, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 59374 - DBQ-121 Sublist of the "List of pages of English Wikipedia with most interwikis"
DBQ-121 Sublist of the "List of pages of English Wikipedia with most interwikis"
Status: RESOLVED FIXED
Product: Tool Labs tools
Classification: Unclassified
Database Queries (Other open bugs)
unspecified
All All
: Unprioritized major
: ---
Assigned To: Bugzilla Bug Importer (valhallasw)
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2014-01-03 15:58 UTC by Bugzilla Bug Importer (valhallasw)
Modified: 2014-01-03 15:58 UTC (History)
0 users

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Bugzilla Bug Importer (valhallasw) 2014-01-03 15:58:29 UTC
This issue was converted from https://jira.toolserver.org/browse/DBQ-121.
Summary: Sublist of the "List of pages of English Wikipedia with most interwikis"
Issue type: Task - A task that needs to be done.
Priority: Major
Status: Done
Assignee: Hoo man <hoo@online.de>

-------------------------------------------------------------------------------
From: Helder <helder.wiki@gmail.com>
Date: Wed, 26 Jan 2011 19:54:22
-------------------------------------------------------------------------------

Hi!

I would like to request two lists for use on Portuguese Wikipedia:

  * A list of 5000 pages of English Wikipedia with more interwikilinks;
  * The list of pages in the first list which doesn't have an interwiki to Portuguese Wikipedia (pt.wikipedia);

This request is analogous to <del>DBQ-91</del>, so I think it should be easy to generate.  
Please, could you also inform how long does it takes to get such lists from database?

Thank you very much
Comment 1 Bugzilla Bug Importer (valhallasw) 2014-01-03 15:58:31 UTC
-------------------------------------------------------------------------------
From: Hoo man <hoo@online.de>
Date: Thu, 27 Jan 2011 15:38:07
-------------------------------------------------------------------------------

The first query needed 6m58.933s to execute, the second 1m12.616s.  
Further, the second only includes the page 2016, I don't know if that can be true, but I couldn't find an error in my Queries ![][1]

SQL:  
(first query)
    
    SELECT page.page_title as page FROM langlinks INNER JOIN page on langlinks.ll_from = page.page_id WHERE page.page_namespace = 0 GROUP BY ll_from ORDER BY COUNT(*) DESC LIMIT 5000;
    

(second query)
    
    SELECT page.page_title as page FROM (SELECT * FROM langlinks INNER JOIN page on langlinks.ll_from = page.page_id WHERE page.page_namespace = 0 GROUP BY ll_from ORDER BY COUNT(*) DESC LIMIT 5000) as langlinks INNER JOIN page ON langlinks.ll_from = page.page_id LEFT JOIN (SELECT COUNT(*) as bool, ll_from FROM langlinks WHERE ll_lang = 'pt' GROUP BY ll_from) as tmp ON tmp.ll_from = langlinks.ll_from WHERE isnull(tmp.bool);
    

Results:  
http://toolserver.org/~hoo/dbq/dbq-121.txt (first query)  
http://toolserver.org/~hoo/dbq/dbq-121_2.txt (second query)

   [1]: https://jira.toolserver.org/images/icons/emoticons/smile.gif
Comment 2 Bugzilla Bug Importer (valhallasw) 2014-01-03 15:58:32 UTC
-------------------------------------------------------------------------------
From: Helder <helder.wiki@gmail.com>
Date: Thu, 27 Jan 2011 17:26:46
-------------------------------------------------------------------------------

Hi!

Thank you Hoo man!

Since the second list has only one page, do you mind to expand it to some reasonable value above 5000 (10000 maybe?) so that we can get a list with more pages? If possible, could you also add a column with the number of interwiki links in the resulting list?
Comment 3 Bugzilla Bug Importer (valhallasw) 2014-01-03 15:58:34 UTC
-------------------------------------------------------------------------------
From: Hoo man <hoo@online.de>
Date: Thu, 27 Jan 2011 19:23:00
-------------------------------------------------------------------------------

I reran the second query with the top 10,000 pages and it is displaying the number interwiki links now:  
http://toolserver.org/~hoo/dbq/dbq-121_3.txt

Fell free to ask, if you want even more.
Comment 4 Bugzilla Bug Importer (valhallasw) 2014-01-03 15:58:36 UTC
-------------------------------------------------------------------------------
From: Helder <helder.wiki@gmail.com>
Date: Fri, 28 Jan 2011 10:31:15
-------------------------------------------------------------------------------

Hi again!

It seems that the criteria to filter the list is still too restrictive. Maybe one of these alternatives (both?) could give us better results:

  * A list of en.wp pages with more than 20 interwiki links and no interwiki to pt.wp
  * A sublist of articles in the "list of 5000 en.wp pages with more interwikis" having an interwiki to a "pt.wp page whose size is less than 3kb" (instead of having no interwiki to 'pt'). In this case, it would be good to sort by size of Portuguese article, so that the community can focus on improving the small articles. If possible, this list should have 4 columns: English title, number of interwikis, Portuguese title, size of Portuguese article.

Are those lists still feasible?

Thanks for helping us
Comment 5 Bugzilla Bug Importer (valhallasw) 2014-01-03 15:58:37 UTC
-------------------------------------------------------------------------------
From: Hoo man <hoo@online.de>
Date: Fri, 28 Jan 2011 22:45:37
-------------------------------------------------------------------------------

Both done ![][1]

SQL:  
(first query)
    
    SELECT COUNT(*) as links, page.page_title as page FROM langlinks INNER JOIN page on langlinks.ll_from = page.page_id WHERE page.page_namespace = 0 GROUP BY langlinks.ll_from HAVING links > 20 ORDER BY links DESC;
    

(second query)  
This one was a bit more complicated, I first had to pull the page list from enwiki_p in a user table, then dump that user table and import it to the server with ptwiki_p on. (MySQL doesn't support JOINS between different Servers yet)

This is the SELECT INSERT into the user table:
    
    INSERT INTO u_hoo.dbq121 SELECT langlinks.ll_title FROM (SELECT * FROM langlinks INNER JOIN page on langlinks.ll_from = page.page_id WHERE page.page_namespace = 0 GROUP BY ll_from ORDER BY COUNT(*) DESC LIMIT 5000) as langlinks INNER JOIN page ON langlinks.ll_from = page.page_id LEFT JOIN (SELECT COUNT(*) as bool, ll_from FROM langlinks WHERE ll_lang = 'pt' GROUP BY ll_from) as tmp ON tmp.ll_from = langlinks.ll_from WHERE tmp.bool IS NOT NULL;
    

After the dump and import to sql 2:
    
    SELECT page.page_len as page_size, dbq121.page_title as page FROM u_hoo.dbq121 INNER JOIN page ON dbq121.page_title = page.page_title WHERE page.page_len < 3072 AND page.page_namespace = 0 ORDER BY page.page_len DESC;

Results:  
http://toolserver.org/~hoo/dbq/dbq-121_4.txt (first query)  
http://toolserver.org/~hoo/dbq/dbq-121_5.txt (second query)

   [1]: https://jira.toolserver.org/images/icons/emoticons/smile.gif
Comment 6 Bugzilla Bug Importer (valhallasw) 2014-01-03 15:58:39 UTC
-------------------------------------------------------------------------------
From: Helder <helder.wiki@gmail.com>
Date: Tue, 01 Feb 2011 15:33:16
-------------------------------------------------------------------------------

Thank you very much!

I just have one question: The last line of second list is  
17 Alfa  
but the article has 16 only characters:  
"#REDIRECT [[Α]]\n"  
as you can see here:  
http://pt.wikipedia.org/w/index.php?title=Alfa&action=edit

Is there a bug making that value wrong? How should we interpret the numbers in that column?
Comment 7 Bugzilla Bug Importer (valhallasw) 2014-01-03 15:58:40 UTC
-------------------------------------------------------------------------------
From: Hoo man <hoo@online.de>
Date: Tue, 01 Feb 2011 15:51:52
-------------------------------------------------------------------------------

The numbers are right, they just show the page size in bytes, which may make them vary. Further the additional "\n" at the end (which stands for a new line), is another char ![][1]

   [1]: https://jira.toolserver.org/images/icons/emoticons/smile.gif
Comment 8 Bugzilla Bug Importer (valhallasw) 2014-01-03 15:58:42 UTC
-------------------------------------------------------------------------------
From: Helder <helder.wiki@gmail.com>
Date: Tue, 01 Feb 2011 16:58:51
-------------------------------------------------------------------------------

Yep, this is the problem: without the new line, there are only 15 characters in that wikitext, so it still doesn't seems to be correct. I've noticed this on larger pages as well, like the page "L" which has 3006 in the first column, but has 2892 characters (counted by putting  
javascript:alert(jQuery('#wpTextbox1').text().length)  
in address bar when editing the page:  
http://pt.wikipedia.org/w/index.php?title=L&oldid=22987621&action=edit  
)

But I think I understood: after checking the value of `PAGESIZE:{{FULLPAGENAME`}} for a page with "c" and other with "ç" I noticed this last character takes two bytes instead of only one. So, never mind... =)
Comment 9 Bugzilla Bug Importer (valhallasw) 2014-01-03 15:58:44 UTC
This bug was imported as RESOLVED. The original assignee has therefore not been
set, and the original reporters/responders have not been added as CC, to
prevent bugspam.

If you re-open this bug, please consider adding these people to the CC list:
Original assignee: hoo@online.de
CC list: hoo@online.de

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links