Last modified: 2014-04-16 21:34:12 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T62826, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 60826 - Enable parallel processing of stub dump and full archive dump for same wiki.
Enable parallel processing of stub dump and full archive dump for same wiki.
Status: NEW
Product: Analytics
Classification: Unclassified
Wikistats (Other open bugs)
unspecified
All All
: High normal
: ---
Assigned To: Nobody - You can work on this!
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2014-02-04 13:52 UTC by Erik Zachte
Modified: 2014-04-16 21:34 UTC (History)
7 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Erik Zachte 2014-02-04 13:52:21 UTC
Years ago Wikistats used to process the full archive dump for each wiki, the dump which contains full text for each revision of each article. Only that type of dump file can yield word count and average article size and some other content based metrics. For a list of affected metrics see all partially empty columns at e.g. http://stats.wikimedia.org/EN/TablesWikipediaEN.htm (first table).

As the dumps grew larger and larger this was no longer possible on a monthly schedule, at least for the largest Wikipedia wikis. Processing the English full archive dump takes more than a month now by itself. Some very heavy regexps are partially to blame. 

Many people have asked when the missing metrics will be revived. A pressing case was brought forward in the first days of 2014 in https://nl.wikipedia.org/wiki/Overleg_gebruiker:Erik_Zachte#Does German Wikipedia have a crisis? For example "Can you find out, if the growth of average size has significantly changed in 2013?" 

At the moment there is limited parallelism within Wikistats dump processing. Two wikis from different projects can be processed in parallel, as each project has its own set of input/output folders. But processing two Wikipedia wikis at the same time could bring interference problems, as there are some project-wide csv files. Not to mention processing stub and full archive dump for the same wiki at the same time, where all files for that wiki would be updated by two processes.

The simplest solution is to schedule full archive dump processing on a different  server than stub dump processing (e.g. stat1 instead of stat1001?) and merge the few metrics that can be only collected from the full archive dumps into the csv files generated from the stub dumps.  

This merge would require a separate script, which can fetch a csv file from one server and merge specific columns into the equivalent csv files on another server. 

This/these csv file(s) should be protected against concurrent access (metaphore?   how?) or the merge step should be part of the round-robin job which processes dumps whenever they become available. (the latter being slightly less safe, as there is a theoretical change that a concurrent access still could occur, as there are on occasion manually scheduled extra runs).
Comment 1 Bingle 2014-02-04 14:00:44 UTC
Prioritization and scheduling of this bug is tracked on Mingle card https://wikimedia.mingle.thoughtworks.com/projects/analytics/cards/cards/1429
Comment 2 Toby Negrin 2014-02-04 15:05:04 UTC
I'd really like to see if we can use hadoop for further processing of the dumps. 

We can easily set up a hadoop instance in labs -- anybody interested in taking a crack at this?

-Toby
Comment 3 Erik Zachte 2014-03-11 17:36:14 UTC
As discussed with Toby off-line, given the current functionality replacing it with HADOOP will not so simple. Possibly opportune, but some caution as for ETA seems warranted.

The new job will need to incorporate several filters (in Wikistats countable namespaces are determined dynamically, redirects are filtered out with awareness of language specific tags harvested from php files and WikiTranslate, dumps need to be vetted for validity (ideally such housekeeping would be done by the dump process, but given the low bandwidth for dump maintenance for many years that might take a while, so right now the ugly approach of parsing html status files is used). Also word count is far from the straightforward function implemented in some languages. Here markup, headers, links etc are first stripped, also for some language the current approach is aware of ideographic languages and their differnt content density. This list is probably not exhaustive. Any rebuild will probably be less ambitious in some aspects (e.g. word count) but it will not be trivial.
Comment 4 Erik Zachte 2014-03-11 17:36:50 UTC
Set importance to high as this is a widely deplored bug and I get mails about it  since 2010 every few months.
Comment 5 Erik Zachte 2014-03-21 15:46:45 UTC
First step is done: 

==adapting wikistats scripts==

*new argument -F to force processing full archive dumps (regardless of dump size)
*Wikistats now can handle segmented dumps 
(which BTW differ in file name for wp:de and wp:en) 
e.g. see first 100 lines or so in http://dumps.wikimedia.org/enwiki/20140304/ 
*Wikistats can now also see for segmented dumps if there was a error during dump generation 
(by parsing dump job status report 'index.html') and looking for 'failed' in appropriate sections
if found switch to other dump format, 
if none of dumps format is valid switch to older dump

Second step has started

==collect counts from full archive dumps for Wikipedias only on stat1==

*this will run for several weeks probably
* see for progress http://stats.wikimedia.org/WikiCountsJobProgressCurrentStat1.html

Third step needs to be done

==merge results from stat1 into stat1002==

*make small script that merges values (missing values only) from 
stat1:/[..]/StatisticsMonthly.csv into 
stat1002:/[..]//StatisticsMonthly.csv 
as part of monthly Wikistats cycle stat1002:/[..]count_report_publish_wp.sh

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links