Last modified: 2014-05-21 09:23:21 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T67500, the corresponding Phabricator task for complete and up-to-date bug report information.

Bug 65500 - Add sampling support in EventLogging


Summary:	Add sampling support in EventLogging

Status:	NEW

Product:	Analytics
Classification:	Unclassified
Component:	EventLogging (Other open bugs)
Version:	unspecified
Hardware:	All All

Importance:	Unprioritized normal
Target Milestone:	---
Assigned To:	Nobody - You can work on this!

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:
	Show dependency tree / graph

Reported:	2014-05-19 17:49 UTC by Dario Taraborelli
Modified:	2014-05-21 09:23 UTC (History)
CC List:	9 users (show)

See Also:
Web browser:	---
Mobile Platform:	---
Assignee Huggle Beta Tester:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Description Dario Taraborelli 2014-05-19 17:49:41 UTC

Different teams have implemented ad-hoc solutions to introduce sampling in EventLogging in order to perform measurements of the usage of features where a sample provides sufficient data to answer a research question.

In some cases, sampling needs to be applied to all events (so that, for example, only 1 out of 1000 events is logged). In other cases, unique clients need to be sampled by setting a session token so that only data for clients included in the sample is collected. 

This pattern is sufficiently common to justify the creation of a general purpose solution to the problem (the most recent request for sampled data is [1]). The desired sampling method and rate could be specified via a dedicated element of a JSON schema; by default no sampling would be applied.

[1] http://lists.wikimedia.org/pipermail/analytics/2014-May/002053.html

Comment 1 nuria 2014-05-20 09:49:39 UTC

Just a note about wording as terminology in this bug is confusing. We are mixing sampling  ratio (1:100) with, let's say, a 'statistical sample' (a set of users/requests with a different treatment from the majority).

>In some cases, sampling needs to be applied to all events (so that, for example, >only 1 out of 1000 events is logged)
Even if we keep track of ratio sampling in the schema (up for discussion) that likely just be an informative number on the short term. It likely will not be used to decide whether an event needs to be created and logged as doing those types of checks every time an event is generated could turn out to be a performance bottleneck (this depends on caching policies and bootstrapping of schemas)
 

> In other cases, unique clients need to be sampled 
> by setting a session token so that only data for 
>clients included in the sample is collected. 
I do not think we want EL in any to keep track of users or sessions to decide whether data needs to be logged. EL is a light system to keep track of events and as such it is agnostic to the events being logged. I do not see us doing any modifications on this regard to EL clients in the near future.

Comment 2 nuria 2014-05-21 09:23:21 UTC

Adding comments posted by ori on e-mail thread:

"to do this {sampling] in the schema itself confuses the structure of the data with the mechanics of its use. I think having a couple of helpers in JavaScript and PHP for simple random sampling is sufficient."

Wikimedia Bugzilla is closed!

Search

Personal tools

Navigation

Links