Last modified: 2014-09-28 17:50:53 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T70931, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 68931 - Cleaning up of some (?) EventLogging schemata for Growth
Cleaning up of some (?) EventLogging schemata for Growth
Status: RESOLVED FIXED
Product: Analytics
Classification: Unclassified
EventLogging (Other open bugs)
unspecified
All All
: Highest normal
: ---
Assigned To: christian
u=Growth c=EventLogging p=0 s=2014-08-07
:
: 68978 (view as bug list)
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2014-07-31 17:04 UTC by christian
Modified: 2014-09-28 17:50 UTC (History)
10 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description christian 2014-07-31 17:04:43 UTC
Around

  http://lists.wikimedia.org/pipermail/analytics/2014-July/002351.html

it seems some EventLogging schemas need to get purged.

-----------------------------

The names of the schemas are not yet fully clear, but the OP in one part
said:

  we can probably just wholesale
  remove the associated schemas listed at
  https://meta.wikimedia.org/wiki/Research:Asking_anonymous_editors_to_register#Schemas

Removal should happen before 2014-08-04, but (as discussed in private communication)
only after 2014-08-01.

I made it clear in private communication that we probably cannot meet that
deadline.

If I understood OP correctly, Sean will handle the database cleanup.

I pushed back on cleanup of raw logs.
Comment 1 christian 2014-08-01 06:25:01 UTC
*** Bug 68978 has been marked as a duplicate of this bug. ***
Comment 2 christian 2014-08-01 06:52:28 UTC
> I pushed back on cleanup of raw logs.

Steven clarified on-list that they have an agreement with legal to remove
the data. So we should do it.
Comment 3 christian 2014-08-01 07:53:32 UTC
On-list [1] Kevin said
> Christian: before I prioritize it, can you scope out how much work
> would be required?

The items that immediatedly come mind are:

* Clarify which schemas are meant to get purged.

* Clarify how to handle future data (We're still seeing those events
  getting logged). We have no machinery in place to guard against data
  entering raw-logs.

* Clarify whether or not purging EventLogging's “raw-logs” is sufficient
  (Since the relevant part of the data flow starts at the caches, it
  goes through both the udp2log and kafka pipeline)

* Clarify if the event data got sent to universities (through udp2log
  forwards).

* If the event data got sent to universities (see above item), clarify
  how to proceed there.

* Get data removed from database
  (Either we get access, or we need to discuss with Sean or Ops)

* Get data removed from all relevant files in
     vanadium:/var/log/eventlogging/...

* Make sure the cleansed files from vanadium get rsynced over to
  stats1002, and stats1003.

* If necessary (see 3rd item), remove the data from kafka cosumers
  (Might be easier to just nuke current data, as we repaved Hadoop
  some days ago anyways)

* If necessary (see 3rd item), remove the data from udp2log consumers
  (Not sure. Might turn out that effectively no udp2log filter is
  actually selecting this data)

Taking a quick look, it seems data-collection might have started in
April 2014.

The 2nd and 3rd item probably need more discussion with Steven
(probably also legal, as some items are costly).

As our team lacks the required access for most of those parts, we
either need to get access [2], or consume more Ops time (which
requires more preparations on our end).

As the above list of items have some “Clarify” and “If” items, it's
hard to give an estimate. If those items do not resolve to much extra
work: Maybe 1-2 weeks total wall-clock time. But most of this time
will be waiting time. So maybe one or two man-days.




[1] http://lists.wikimedia.org/pipermail/analytics/2014-August/002367.html

[2] I already applied when receiving Steven's first email, and Toby
approved. But those items just require three days waiting.
Comment 4 christian 2014-08-01 10:26:05 UTC
ahalfak said in private communication that he has finished the things he needed
to do, so we're good to get things moving from their end.
Comment 5 christian 2014-08-08 12:20:06 UTC
As discussed in private emails between Steven, Aaron and me, the request is
only for the following schemas:

  SignupExpAccountCreationComplete
  SignupExpAccountCreationImpression
  SignupExpCTAButtonClick
  SignupExpCTAImpression
  SignupExpPageLinkClick
  TrackedPageContentSaveComplete

Removal of future data is beyond the scope of this request.
Comment 6 christian 2014-08-08 12:20:34 UTC
The tables to be purged from the log database are

  SignupExpAccountCreationComplete_8539421
  SignupExpAccountCreationImpression_8539445
  SignupExpCTAButtonClick_8102619
  SignupExpCTAButtonClick_8965028
  SignupExpCTAImpression_8101716
  SignupExpCTAImpression_8965023
  SignupExpPageLinkClick_8101692
  SignupExpPageLinkClick_8965014
  TrackedPageContentSaveComplete_7872558
  TrackedPageContentSaveComplete_8535426

On-list announcement about the upcoming purge is at

  http://lists.wikimedia.org/pipermail/analytics/2014-August/002382.html

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links