Last modified: 2014-04-21 18:47:53 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T62184, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 60184 - Analytics: Can we start quoting our logging fields?
Analytics: Can we start quoting our logging fields?
Status: NEW
Product: Analytics
Classification: Unclassified
General/Unknown (Other open bugs)
unspecified
All All
: Normal normal
: ---
Assigned To: Nobody - You can work on this!
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2014-01-17 23:05 UTC by Oliver Keyes
Modified: 2014-04-21 18:47 UTC (History)
3 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Oliver Keyes 2014-01-17 23:05:44 UTC
I'm sat here looking at a 6MB user agent field. It's not /actually/ a 6MB user agent field, it's a user agent field where some browser designer decided "let's put tabs in our UA, that won't cause anyone any problems!" and so, of course, the tab-separated files we store our logs in happily escaped it, meaning that when the TSV was read in, the field overflowed.

In the absence of hunting down the people who made that decision at the browser end and forcing them to use the internet through an early and experimental IE version for all of time, could we start quoting the fields in the request logs? I'm not sure how Erik Z reads his files in, but if it's tab-sensitive we're potentially looking at a data loss issue with wikistats. If it's not, we're looking at a data loss issue with my work. Either is to be avoided ;p.

Obviously VK will solve for this once it's dealing with the whole firehose.
Comment 1 christian 2014-01-19 20:48:35 UTC
(In reply to comment #0)
> I'm sat here looking at a 6MB user agent field.

Interesting.
I was under the impression that requests >8K get truncated.
That's obviously wrong then :-)

Where can I find this user agent field?

> I'm not sure how Erik Z reads his files in, but if it's tab-sensitive we're
> potentially looking at a data loss issue with wikistats.

Although files may come with wrong number of columns, it's actually
only a minor problem. For example in December 2013 only about ~0.0028%
rows of the sampled-1000 stream had a wrong column count. In January
2014 it is up to now 0.0029%.

Adding escaping to the files would make many changes necessary
throughout all of our infrastructure (e.g.: Wikipedia Zero), which I'd
prefer we need not do.

To put those 0.0029% into perspective: Udp2log dropped 0.4% of the
packets in December. And when comparing with historical values, we see
that this is exceptionally low packet drop rate:
http://stats.wikimedia.org/wikimedia/squids/SquidDataMonthlyPerSquidSet.htm

> Obviously VK will solve for this once it's dealing with the whole firehose.

VK being varnishkafka?
If so ... Ja, I'd say waiting for Hadoop with the new JSON data
structures would be a good solution :-)
Comment 2 Bingle 2014-01-28 22:00:35 UTC
Prioritization and scheduling of this bug is tracked on Mingle card https://wikimedia.mingle.thoughtworks.com/projects/analytics/cards/cards/1395

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links