Last modified: 2014-02-26 12:55:35 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T62555, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 60555 - Echo logs broken since Jan 16, 2014
Echo logs broken since Jan 16, 2014
Status: RESOLVED FIXED
Product: Analytics
Classification: Unclassified
EventLogging (Other open bugs)
unspecified
All All
: High normal
: ---
Assigned To: Ori Livneh
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2014-01-29 01:22 UTC by Dario Taraborelli
Modified: 2014-02-26 12:55 UTC (History)
15 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Dario Taraborelli 2014-01-29 01:22:18 UTC
It appears that we stopped accurately logging most Echo events on January 16, 2014

See: http://ee-dashboard.wmflabs.org/dashboards/enwiki-features

SELECT MAX(timestamp) FROM `Echo_6081131` WHERE wiki = 'enwiki' AND event_notificationType = 'reverted';

20140116190505

SELECT MAX(timestamp) FROM `Echo_6081131` WHERE wiki = 'enwiki' AND event_notificationType = 'edit-thank';

20140116190354
Comment 1 bsitu 2014-01-29 01:37:06 UTC
There was no PHP code merged since January 1st.  I also did a test against revision 6081131 locally and it seemed to be fine:
{"event":{"version":"1.5","eventId":2418,"notificationType":"edit-thank","notificationGroup":"positive","sender":"1","recipientUserId":2,"recipientEditCount":911,"eventSource":"history","deliveryMethod":"web","revisionId":5348},"schema":"Echo","revision":6081131,"clientValidated":true,"wiki":"git_my_wiki","recvFrom":"ThinkPad-X220","timestamp":1390958511,"webHost":"git-core","userAgent":"Mozilla/5.0\u0020(X11;\u0020Ubuntu;\u0020Linux\u0020x86_64;\u0020rv:16.0)\u0020Gecko/20100101\u0020Firefox/16.0"}

We should look at production raw log to identify the error.
Comment 2 Dario Taraborelli 2014-01-29 01:59:04 UTC
We are definitely still collecting raw events but they are not written into the log DB. 

zgrep 'edit-thank' all-events.log-20140128.gz | wc -l
9812

zgrep 'edit-thank' all-events.log-20140128.gz | grep enwiki | wc -l
3166
Comment 3 Ryan Kaldari 2014-01-29 20:10:04 UTC
Looks like whatever change caused this was part of 1.23wmf10.
Comment 4 Ori Livneh 2014-01-30 08:04:42 UTC
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/eventlogging-0.6_20131120-py2.7.egg/EGG-INFO/scripts/eventlogging-processor", line 61, in <module>
    validate(event)
  File "/usr/local/lib/python2.7/dist-packages/eventlogging-0.6_20131120-py2.7.egg/eventlogging/schema.py", line 76, in validate
    jsonschema.Draft3Validator(schema).validate(capsule)
  File "/usr/lib/python2.7/dist-packages/jsonschema.py", line 227, in validate
    :exc:`ValidationError``\s to signal failed validation.
ValidationError: Additional properties are not allowed (u'userAgent' was unexpected)
Comment 5 Ori Livneh 2014-01-30 08:11:22 UTC
We can either bump the schema module to reflect the added field (in which case UAs will go into the DB) or we can remove the userAgent field from server-side events.  Analytics can choose. I can assist w/recovering events from the raw logs.
Comment 6 nuria 2014-01-31 08:42:21 UTC
A third option is to make validation not fail when there are additional fields I think (which I thought it was the case).

Let's talk about this today, of the two options removing the log of UA server side 
should be the easiest one.

Side question: Where are logs for events?
Comment 7 nuria 2014-01-31 08:53:16 UTC
(In reply to comment #6)
> A third option is to make validation not fail when there are additional
> fields
> I think (which I thought it was the case).
> 
> Let's talk about this today, of the two options removing the log of UA server
> side 
> should be the easiest one.
> 
> Side question: Where are logs for events?

Found logs.
Comment 8 Ori Livneh 2014-01-31 10:56:58 UTC
Copying my reply to Nuria here, so that the notes are public.

> - where are validation traces failures

Validation failures trigger exceptions, which are written to standard error. Stderr is captured by Upstart, so the failures are at /var/log/upstart/eventlogging-processor.log. 

> - why some events are affected but not others (I would expect all server side events to be affected equally)

Me too. But I don't remember if we pushed the change to both production branches at the same time. So it may be that one branch was generating events that were passing validation, and the other, not.

> - whether changing validation not to fail upon adding additional fields seems a good idea.

It beats ALTER TABLE `...` ADD COLUMN `random_field` ...

The events need to adhere to the table definition. But it goes beyond that. This isn't simply a trade-off that comes with using a relational database as (one of several) storage backends; it's one of the core design decisions of the entire platform. The setup EventLogging replaced allowed free-form output and as a result the datasets used every possible ad-hoc format you can think of, with the result that the data was usually logged incorrectly and no one knew how to analyze it a week after it was collected because the "wisdom" of how to interpret it was lost.

If you program something that handles EventLogging data, you will see the simplicity that comes with knowing definitively that your input matches your expectations. You don't need to handle edge case after edge case.

Events that fail validation are still recorded in the raw logs, so they are not lost; they merely don't get first-class treatment.

I can see why bug 60550 might lead you to question this design. But let me tell you: this is probably the worst screw-up in the >1 year that this has been operational; the previous norm was to have the system running for months at a time without any intervention from me or anyone else. The mistake in this case was mine (I hacked the code to pop the 'userAgent' field locally, but then clobbered it by deploying an unpatched version on top of it).

In hindsight, I should have insisted on data sterility for the entirety of the process. When the validation failure has to do with the event object and not the capsule, the developer has the responsibility (and the means) for finding out, but this isn't the case with EventCapsule schema changes. But I really didn't expect this to take so long -- I thought we'd be done with the user-agent change within a day of deploying the patch at the most. If it doesn't happen now, it may be best to revert the change, drop the userAgent field from the capsule schema, and remove the column from the database -- it's not responsible of us to keep the system in this halfway state, and not surprising that it leads to mistakes and confusion.
Comment 9 Steven Walling 2014-02-08 00:50:56 UTC
This also seems to be impacting other schemas. ServerSideAccountCreation_5487345 has no data since 20140204000427 in the log db.
Comment 10 Dario Taraborelli 2014-02-20 19:33:05 UTC
I am closing this bug as the recovery of ServerSide events is underway (per https://bugzilla.wikimedia.org/show_bug.cgi?id=60550) while the separate issue with validating new event types has been moved to https://bugzilla.wikimedia.org/show_bug.cgi?id=61698
Comment 11 Andre Klapper 2014-02-26 12:55:35 UTC
[moving from MediaWiki extensions to Analytics product - see bug 61946]

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links