Last modified: 2014-03-11 13:32:38 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T48198, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 46198 - Dump stats: switch to persistent stats rather than monthly regenerated stats
Dump stats: switch to persistent stats rather than monthly regenerated stats
Status: NEW
Product: Analytics
Classification: Unclassified
Wikistats (Other open bugs)
unspecified
All All
: Normal normal
: ---
Assigned To: Nobody - You can work on this!
http://lists.wikimedia.org/pipermail/...
:
Depends on:
Blocks: 46208
  Show dependency treegraph
 
Reported: 2013-03-16 12:22 UTC by Erik Zachte
Modified: 2014-03-11 13:32 UTC (History)
1 user (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Erik Zachte 2013-03-16 12:22:55 UTC
On Wed, Mar 6, 2013 at 10:47 PM, Erik Zachte wrote:

I just realize that the YoY +2%, which LIMN shows and which I reported earlier, will probably be around 0% on next report. As always the latest active editor counts shrinks 1% – 2% in subsequent month as more quick deletions of Jan 2012 content will still happen.

ErikM March 13, 2013: 

I’d like to get this “deletion drift” issue on the mid-term agenda for the analytics team. While I understand why you chose this approach with WikiStats, I think the drawbacks outweigh the benefits. Having comparative data from month to month continually shift does materially impact our ability to plan and to understand trends in the data. Freezing the data at month-end and only making corrections in cases of genuine errors in measurement seems much preferable to me.

I think with such an approach, we can simply take as a given that some % of TAE are not making constructive contributions. That is a given anyway as we can’t account for quality of edits.

I realize this is very non-trivial given the way the data pipeline currently works, but I at least want to be on record that we should aim for numbers being frozen at measurement point and only corrected in case of measurement errors, as a design characteristic. That applies to article counts and other such measures as well—if we measured 3M articles in January 2012, and 2.5M in February 2012 because 500K articles were deleted (absurd example), that does not negate our 3M article measurement from January.
Comment 2 Erik Zachte 2014-03-11 12:17:46 UTC
Pro's and con's of permanent stats have been discussed endlessly over the years. Current inclination among analytics team and key users is to favor it. Updating historic stats due to new insights is deemed less important than giving the user a sense of stability. Updating due to bug fixing if still on the table. 

Technically it could be added as feature to current wikistats scripts, as  follows: a runtime argument tells Wikistats whether to update all historic months or only a range of months (default last month only).

Then all or some routines in WikiCountsOutput.pm for updating all or some csv files need to be adjusted, to not add/replace all data for a given wiki, but only for a given period. 

A minimal implementation would be to do this only for key metrics in StatisticsMonthly.csv         

As future of dump based Wikistats scripts is uncertain (HADOOP will likely take over) costs may outweigh benefits.
Comment 3 Nemo 2014-03-11 12:25:12 UTC
(In reply to Erik Zachte from comment #2)
> As future of dump based Wikistats scripts is uncertain [...]
> costs may outweigh benefits.

Did you consider the "archives" alternative, i.e. updating everything but archiving the old HTML instead of wiping it? If space is really an issue, I believe most space is taken by category trees and few other things, so it would be enough to refrain from archiving those.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links