Last modified: 2014-05-21 09:23:21 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T67500, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 65500 - Add sampling support in EventLogging
Add sampling support in EventLogging
Status: NEW
Product: Analytics
Classification: Unclassified
EventLogging (Other open bugs)
unspecified
All All
: Unprioritized normal
: ---
Assigned To: Nobody - You can work on this!
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2014-05-19 17:49 UTC by Dario Taraborelli
Modified: 2014-05-21 09:23 UTC (History)
9 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Dario Taraborelli 2014-05-19 17:49:41 UTC
Different teams have implemented ad-hoc solutions to introduce sampling in EventLogging in order to perform measurements of the usage of features where a sample provides sufficient data to answer a research question.

In some cases, sampling needs to be applied to all events (so that, for example, only 1 out of 1000 events is logged). In other cases, unique clients need to be sampled by setting a session token so that only data for clients included in the sample is collected. 

This pattern is sufficiently common to justify the creation of a general purpose solution to the problem (the most recent request for sampled data is [1]). The desired sampling method and rate could be specified via a dedicated element of a JSON schema; by default no sampling would be applied.

[1] http://lists.wikimedia.org/pipermail/analytics/2014-May/002053.html
Comment 1 nuria 2014-05-20 09:49:39 UTC
Just a note about wording as terminology in this bug is confusing. We are mixing sampling  ratio (1:100) with, let's say, a 'statistical sample' (a set of users/requests with a different treatment from the majority).

>In some cases, sampling needs to be applied to all events (so that, for example, >only 1 out of 1000 events is logged)
Even if we keep track of ratio sampling in the schema (up for discussion) that likely just be an informative number on the short term. It likely will not be used to decide whether an event needs to be created and logged as doing those types of checks every time an event is generated could turn out to be a performance bottleneck (this depends on caching policies and bootstrapping of schemas)
 

> In other cases, unique clients need to be sampled 
> by setting a session token so that only data for 
>clients included in the sample is collected. 
I do not think we want EL in any to keep track of users or sessions to decide whether data needs to be logged. EL is a light system to keep track of events and as such it is agnostic to the events being logged. I do not see us doing any modifications on this regard to EL clients in the near future.
Comment 2 nuria 2014-05-21 09:23:21 UTC
Adding comments posted by ori on e-mail thread:

"to do this {sampling] in the schema itself confuses the structure of the data with the mechanics of its use. I think having a couple of helpers in JavaScript and PHP for simple random sampling is sufficient."

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links