Last modified: 2014-11-04 13:50:38 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T74651, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 72651 - Spike: Assess feasibility and effort to add fields to webrequest logs
Spike: Assess feasibility and effort to add fields to webrequest logs
Status: RESOLVED FIXED
Product: Analytics
Classification: Unclassified
Refinery (Other open bugs)
unspecified
All All
: Unprioritized normal
: ---
Assigned To: Nobody - You can work on this!
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2014-10-28 22:49 UTC by Kevin Leduc
Modified: 2014-11-04 13:50 UTC (History)
8 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Kevin Leduc 2014-10-28 22:49:05 UTC
Quote from Trello:
"Aaron, Oliver and [Dario] sat down to think of a number of ways in which we could facilitate parsing of the request logs and joining with data sqooped from the SQL slaves. This is not a final list of asks but we would like to run these questions past Dev/Ops (particularly Christian, Andrew and Jeff) at some point."
https://trello.com/b/k5N0ivoM/research-and-data

The fields are defined here;
http://etherpad.wikimedia.org/p/HadoopEtherpad
Comment 1 christian 2014-10-29 13:56:57 UTC
Not sure where to respond since it covers Trello, Etherpad, Email,
IRC, and now bugzilla. Responding in bugzilla, since this is at least
a public medium that cannot be changed.

> New fields / headers
> * page_id

Several people want this. Even I want it :-)
It would be helpful for so many things. Even for "per page" pageviews too.
It seems the to-be-written XAnalytics extension would be the place to
do it [1].

Feasible: Yes.
Effort: Once the XAnalytics extension is there, ~4 man-days.
(Only a few hours of coding)

> * unique id token
> ** Is it possible to move the unique app install id (currently
>    appended by Wikimedia apps to the URI requested) to a dedicated
>    key=> value in x-analytics?

It would be possible to do the rewriting on the varnishes.
We try to do as little processing on the varnishes as possible, so I
would not want to parse out things there.
We could do in the ETL step,
But ETL is not there yet, and we have some tasks to do before we can
start implementing it.

But we should not track people without their consent. So getting their
consent is more important to me.

Feasible: No, as the “user consent issue” is to big right now.
Effort: Once the ETL step is there, ~2 man-days.
(Only a few hours of coding)

> * logged in flag

Since this information is (currently) sent only as Cookie (and not
as plain HTTP header), it would also need assistance of for example
the to-be-written XAnalytics extension. See above.
(We could do the rewriting on varnish, but as we try to do as little
as possible on the varnishes, this does not sound too thrilling)

(Note that this information is not sent to bits or upload, so it would
not allow to track media consumption per user.)

Feasible: Yes.
Effort: Once the XAnalytics extension is there, ~3 man-days.
(Only a few hours of coding)

---------------------------

(The etherpad also asks about format changes. But since this bug is
about adding fields, I guess format changes are out of scope for this
bug.)


[1] https://gerrit.wikimedia.org/r/#/c/157841/
Comment 2 christian 2014-10-29 13:59:42 UTC
The estimations from comment #1 assume that having those fields in HDFS
(not udp2log) is sufficient.
Comment 3 ewulczyn 2014-10-30 18:03:52 UTC
Another thing that came up in our research group meeting today is to add the browser session cookie. I added this to the etherpad.
Comment 4 christian 2014-11-04 13:50:38 UTC
(In reply to ewulczyn from comment #3)
> Another thing that came up in our research group meeting today is to add the
> browser session cookie.

“browser session cookie” can mean two things:

* The whole HTTP Cookie header:

Adding the whole HTTP Cookie header would add more bytes to the log
lines than I'd be comfortable with.

(Just doing the back of the envelope computation. We're currently
around 700 bytes per log line that goes through kafka. Adding the HTTP
Cookies header would add somewhere around 200-500 bytes [1] on top of
those 700 bytes for around 1/3 requests. So that would be a quite
considerable increase.)

* Really only the session identifier:

(So for example on enwiki, only the value of “enwikiSession”. No
centralnotice_* cookie values, no centralauth_* cookie values)

That would be more harmless in terms of data size. But it needs to get
extracted on the varnish machines themselves. So the same objection as
in comment #1 applies.




Regardless of which of the above interpretations you aimed for, the

  But we should not track people without their consent. So getting their
  consent is more important to me.

from comment #1 still stands for me.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links