Last modified: 2014-05-21 09:23:21 UTC
Different teams have implemented ad-hoc solutions to introduce sampling in EventLogging in order to perform measurements of the usage of features where a sample provides sufficient data to answer a research question. In some cases, sampling needs to be applied to all events (so that, for example, only 1 out of 1000 events is logged). In other cases, unique clients need to be sampled by setting a session token so that only data for clients included in the sample is collected. This pattern is sufficiently common to justify the creation of a general purpose solution to the problem (the most recent request for sampled data is [1]). The desired sampling method and rate could be specified via a dedicated element of a JSON schema; by default no sampling would be applied. [1] http://lists.wikimedia.org/pipermail/analytics/2014-May/002053.html
Just a note about wording as terminology in this bug is confusing. We are mixing sampling ratio (1:100) with, let's say, a 'statistical sample' (a set of users/requests with a different treatment from the majority). >In some cases, sampling needs to be applied to all events (so that, for example, >only 1 out of 1000 events is logged) Even if we keep track of ratio sampling in the schema (up for discussion) that likely just be an informative number on the short term. It likely will not be used to decide whether an event needs to be created and logged as doing those types of checks every time an event is generated could turn out to be a performance bottleneck (this depends on caching policies and bootstrapping of schemas) > In other cases, unique clients need to be sampled > by setting a session token so that only data for >clients included in the sample is collected. I do not think we want EL in any to keep track of users or sessions to decide whether data needs to be logged. EL is a light system to keep track of events and as such it is agnostic to the events being logged. I do not see us doing any modifications on this regard to EL clients in the near future.
Adding comments posted by ori on e-mail thread: "to do this {sampling] in the schema itself confuses the structure of the data with the mechanics of its use. I think having a couple of helpers in JavaScript and PHP for simple random sampling is sufficient."