Last modified: 2014-03-11 17:14:34 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T48206, the corresponding Phabricator task for complete and up-to-date bug report information.

Bug 46206 - Dump stats: rework flow to allow for parallel data gathering multiple wikis


Summary:	Dump stats: rework flow to allow for parallel data gathering multiple wikis

Status:	NEW

Product:	Analytics
Classification:	Unclassified
Component:	Wikistats (Other open bugs)
Version:	unspecified
Hardware:	All All

Importance:	Low normal
Target Milestone:	---
Assigned To:	Nobody - You can work on this!

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:
	Show dependency tree / graph

Reported:	2013-03-16 12:48 UTC by Erik Zachte
Modified:	2014-03-11 17:14 UTC (History)
CC List:	3 users (show)

See Also:
Web browser:	---
Mobile Platform:	---
Assignee Huggle Beta Tester:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Description Erik Zachte 2013-03-16 12:48:08 UTC

scripts access shared csv files per project, e.g. StatisticsMonthly.csv
either
- use separate file per wiki, and combine later in last count step (semaphore)
- use separate file per wiki, and adapt report step for this (semaphore)

Comment 1 Erik Zachte 2013-03-16 12:48:35 UTC

https://mingle.corp.wikimedia.org/projects/analytics/cards/342

Comment 2 Nilesh Chakraborty 2014-01-28 16:03:44 UTC

Could you give some more info about the problem? I can see that this has something to do with parallelizing some task, but can't be sure as to what it exactly entails. Some pointers please?

Sorry if it's too obvious.

Comment 3 Erik Zachte 2014-03-11 17:14:05 UTC

See https://bugzilla.wikimedia.org/show_bug.cgi?id=60826, where a somewhat less elegant and less powerful but simpler solution is proposed, and more explanation is given. 

Main difference: here the idea is to allow processing of any number of wikis in parallel, by removing any chance that files will be updated simultaneously by several threads. In 60826 the idea is to keep the limitation of processing one wiki per project, but stub and full archive dumps will be processed round robin on different servers. Periodically a rather small script takes care of harvesting   
the extra metrics from key metrics file StatisticsMontly.csv on the full archive server and update empty columns in same file on stub dump server). 60826 is less elegant as it requires syncing between two servers, and less powerful as it still doesn't allow ad hoc processing of dumps without suspending round robin process, but it is much simpler to implement. This bug would require maintenance in tens of places in several source files as Wikistats counts job generates so many files.

Wikimedia Bugzilla is closed!

Search

Personal tools

Navigation

Links