Last modified: 2014-01-03 16:01:57 UTC
This issue was converted from https://jira.toolserver.org/browse/DBQ-137. Summary: statistics of different languages Issue type: Task - A task that needs to be done. Priority: Major Status: Done Assignee: Hoo man <hoo@online.de> ------------------------------------------------------------------------------- From: Minn Seok Choi <MinnSeok.Choi@gmail.com> Date: Wed, 20 Apr 2011 19:21:49 ------------------------------------------------------------------------------- I am not sure it is possible to retrieve some data from the Wikipedia databases. If it is possible, I would like to get the following variables from the different Wikipedias shown in the list: A. total pages of each namespace pages (excluding redirects) (1) the number of article pages (i.e. main namespace pages) (2) the number of talk pages (3) the number of user pages (4) the number of user talk pages (5) the number of Wikipedia pages (6) the number of Wikipedia talk pages (7) the number of file pages (8) the number of file talk pages (9) the number of template pages (10) the number of template talk pages (11) the number of portal pages (12) the number of portal talk pages (13) the number of help pages (14) the number of help talk pages B. total edits to each namespace (excluding redirects) (15) the number of article pages (i.e. main namespace pages) (16) the number of talk pages (17) the number of user pages (18) the number of user talk pages (19) the number of Wikipedia pages (20) the number of Wikipedia talk pages (21) the number of file pages (22) the number of file talk pages (23) the number of template pages (24) the number of template talk pages (25) the number of portal pages (26) the number of portal talk pages (27) the number of help pages (28) the number of help talk pages C. size of each namespace (byte)(excluding redirects) (29) the size of article pages (i.e. main namespace pages) (30) the number of talk pages (31) the number of user pages (32) the number of user talk pages (33) the number of Wikipedia pages (34) the number of Wikipedia talk pages (35) the number of file pages (36) the number of file talk pages (37) the number of template pages (38) the number of template talk pages (39) the number of portal pages (40) the number of portal talk pages (41) the number of help pages (42) the number of help talk pages D. URL for certain pages (43) the URL of community portal pages (if available) (44) the URL of village pump, it available) (45) the URL of help desk (46) the URL of Featured article portal == the Wikipedia list (68 languages) == en English de German fr French pl Polish it Italian ja Japanese es Spanish ru Russian pt Portuguese nl Dutch sv Swedish zh Chinese ca Catalan no Norwegian (Bokmål) uk Ukrainian fi Finnish vi Vietnamese cs Czech hu Hungarian ko Korean ro Romanian id Indonesian tr Turkish da Danish ar Arabic eo Esperanto sr Serbian lt Lithuanian sk Slovak he Hebrew ms Malay bg Bulgarian sl Slovenian hr Croatian et Estonian simple Simple English th Thai eu Basque nn Norwegian (Nynorsk) el Greek az Azerbaijan la Latin tl Tagalog te Telugu ka Georgian sh Serbo-Croatian be-x-old Belarusian (Taraškievica) lv Latvian jv Javanese sq Albanian bs Bosnian is Icelandic ta Tamil an Aragonese oc Occitan bn Bengali ml Malayalam af Afrikaans ur Urdu zh-yue Cantonese ast Asturian yo Yuruba wa Walloon yi Yiddish uz Uzbek li Limburgian ia Interlingua szl Silesian
------------------------------------------------------------------------------- From: Hoo man <hoo@online.de> Date: Fri, 22 Apr 2011 18:39:33 ------------------------------------------------------------------------------- The following is feasible: 1-14 and (may) 29 - 42. Please confirm that the above data alone is useful for you and please give me the lang code (like en for English, sq for Albanian) for the above languages (I'm to lazy to get them myself ![][1] ). [1]: https://jira.toolserver.org/images/icons/emoticons/tongue.gif
------------------------------------------------------------------------------- From: Minn Seok Choi <MinnSeok.Choi@gmail.com> Date: Sat, 23 Apr 2011 08:54:04 ------------------------------------------------------------------------------- Thanks, Hoo man. 1-14 and 29-42 are useful for me. I updated my query request by adding the language codes, following your comment.
------------------------------------------------------------------------------- From: Hoo man <hoo@online.de> Date: Sun, 24 Apr 2011 18:08:20 ------------------------------------------------------------------------------- Ok, fine, thanks for the language codes ![][1] Code (did id in PHP because I once again was to lazy for bash ![][2]): #!/bin/php <?php $langcodes = array('en', 'de', 'fr', 'pl', 'it', 'ja', 'es', 'ru', 'pt', 'nl', 'sv', 'zh', 'ca', 'no', 'uk', 'fi', 'vi', 'cs', 'hu', 'ko', 'ro', 'id', 'tr', 'da', 'ar', 'eo', 'sr', 'lt', 'sk', 'he', 'ms', 'bg', 'sl', 'hr', 'et', 'simple', 'th', 'eu', 'nn', 'el', 'az', 'la', 'tl', 'te', 'ka', 'sh', 'be_x_old', 'lv', 'jv', 'sq', 'bs', 'is', 'ta', 'an', 'oc', 'bn', 'ml', 'af', 'ur', 'zh_yue', 'ast', 'yo', 'wa', 'yi', 'uz', 'li', 'ia', 'szl'); $file = '../public_html/dbq/dbq-137.txt'; foreach($langcodes as $lang) { $query = 'SELECT /* SLOW_OK */ \'' . $lang . '\' as lang, page_namespace, COUNT(*) as page_count, SUM(page_len) as namespace_size FROM page WHERE page_namespace IN(0,1,2,3,4,5,6,7,10,11,100,101,12,13) AND page_is_redirect = 0 GROUP BY page_namespace;'; echo 'Executing "' . $query .'" on ' . $lang . "wiki_p\n"; exec('mysql --host=' . $lang . 'wiki-p.rrdb.toolserver.org --database=' . $lang . 'wiki_p -e"' . $query . '" | cat >> ' . $file); } ?> Result: http://toolserver.org/~hoo/dbq/dbq-137.txt (plain text) http://toolserver.org/~hoo/dbq/dbq-137.csv (Excel readable csv) [1]: https://jira.toolserver.org/images/icons/emoticons/smile.gif [2]: https://jira.toolserver.org/images/icons/emoticons/tongue.gif
------------------------------------------------------------------------------- From: Minn Seok Choi <MinnSeok.Choi@gmail.com> Date: Mon, 25 Apr 2011 19:53:25 ------------------------------------------------------------------------------- Thank you so much, Hoo man.
This bug was imported as RESOLVED. The original assignee has therefore not been set, and the original reporters/responders have not been added as CC, to prevent bugspam. If you re-open this bug, please consider adding these people to the CC list: Original assignee: hoo@online.de CC list: hoo@online.de