Last modified: 2014-04-16 21:36:03 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T60754, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 58754 - Huge reports that could clog Wikimetrics may happen accidentally, add a warning
Huge reports that could clog Wikimetrics may happen accidentally, add a warning
Status: NEW
Product: Analytics
Classification: Unclassified
Wikimetrics (Other open bugs)
unspecified
All All
: Normal enhancement
: ---
Assigned To: Nobody - You can work on this!
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2013-12-20 18:19 UTC by Dan Andreescu
Modified: 2014-04-16 21:36 UTC (History)
6 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Dan Andreescu 2013-12-20 18:19:53 UTC
Add a warning when a user tries to run a report that would return more than X data points, where X is sufficiently large.
Comment 1 Bingle 2013-12-20 18:27:08 UTC
Prioritization and scheduling of this bug is tracked on Mingle card https://wikimedia.mingle.thoughtworks.com/projects/analytics/cards/1347
Comment 2 nuria 2014-01-17 11:28:04 UTC
Rather than a warning (which in my experience users often do not read) I think it will better to determine threshold of "X" and not let users upload if X is sufficiently large to run into problems.
Comment 3 Dan Andreescu 2014-01-17 16:55:46 UTC
I suggested a hard limit and people like Dario and Jamie were against that.  It's possible that in some rare cases people might need to run very large reports.  I agree with you about the warning, but without getting into user roles and different privileges it's the only solution I see.
Comment 4 Philippe Verdy 2014-02-19 20:31:46 UTC
If there are huge cohorts, may be the tool could automatically split it into subcohorts, then it will schedule each subcohort separately, and will create a temporary report storing results that will then be aggregated.
It is opssible if the SQL queries contain only aggregatable items (all of them should be aggregatable, because Wikimetrics should only be used to generate aggregate data to respect users privacy)

So all data columns should specify the type aggregate they use: COUNT, MIN, MAX, SUM.

Derived aggregates can be computed in a scheduled way using only these basic aggregates: this includes AVG (uses SUM and COUNT), STDDEV or VAR (uses SUM(data), SUM(data^2) and COUNT).

The scheduler would then report the status of each subcohort processed and if needed it can be paused at any time when it has already run for too long but there are enough data generated to create a valid report, and resumed later when the servers experiment lower work charges. The scheduler should also be able to monitor the time or work charge taken by each subcohort, in order to estimate and adjust the size of the next subcohort; or to insert variables delay before processing the next subcohort.

An SQL server admin could also kill a SQL query that takes too much time/resource: that query will fail, the scheduler will detect the failure and pause the processing intil the cohort parameters are adjusted and the scheduler being relaunched to restart the work for the last failed subcohort. This could allow manual tuning of these cohort sizes.

(the cohort uploader may also consider splitting this cohort himself into multiple ones with reasonnable size. The same cohort creator should not have multiple cohorts being processed at the same time, but he could schedul them in order).

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links