Last modified: 2014-07-25 13:41:14 UTC
scheduled via Oozie
The duplicate monitoring describes detecting duplicate events in the Kafka logs.
The hive query for this is already written. The table will need to be manually created, and then this Hive query needs to be regularly scheduled by Oozie. Hive query: https://github.com/wikimedia/analytics-refinery/blob/master/hive/webrequest/sequence_stats.hql The Oozie layout has been refactored by Christian and I, but it remains in kraken. The directory structure needs to be moved over from there. Go ahead bring over the kraken oozie directory into analytics/refinery: https://github.com/wikimedia/kraken/tree/master/oozie You should omit the 'archive' directory; don't bring that to refinery. You can then add your oozie configs in something like oozie/webrequest/sequence_stats (or something, not sure this is the best layout).
Change 143336 had a related patch set uploaded by Milimetric: Migrate oozie folder from Kraken minus archive https://gerrit.wikimedia.org/r/143336
Change 143336 merged by Ottomata: Migrate oozie folder from Kraken minus archive https://gerrit.wikimedia.org/r/143336
Change 143486 had a related patch set uploaded by Milimetric: [WIP] Oozify sequence_stats hive script https://gerrit.wikimedia.org/r/143486
Moving story to next sprint since it has not been completed this Sprint.
Change 144909 had a related patch set uploaded by QChris: Drop unneeded parts of oozie import https://gerrit.wikimedia.org/r/144909
Change 144909 merged by Ottomata: Drop unneeded partition dropping part of oozie import https://gerrit.wikimedia.org/r/144909
Discussing the solution to this item a bit more with ottomata last week, it turned out that it might be better to incorporate the duplicate checking into partition adding, and turning the aggregated statistics into a means to set a “done” flag for data sets that do not suffer obvious holes/duplicates. That would help the general pipeline, as it allows to trigger further parts of the pipeline based on the done flag and no longer encoding the same timing heuristic again and again into pipelines. However, partition adding is currently not working (as it is still centered around the precursor of refinery). So we need to fix partition adding before. But that's needed anyways to get webrequest ingestion working in refinery, so it's not a wasted effort. So the new requirements are: * Fixing the partition adding jobs * Integrating the duplicate monitoring there * Tag data sets as done (dependent on the outcome of the statistic computations) With those changed requirements, this bug has been reestimated.
Change 148650 had a related patch set uploaded by QChris: Add pipeline for basic verification of webrequest logs https://gerrit.wikimedia.org/r/148650
Change 148650 merged by Ottomata: Add pipeline for basic verification of webrequest logs https://gerrit.wikimedia.org/r/148650
Change 143486 abandoned by QChris: Coordinate computing sequence statistics through Oozie Reason: Different approach was implemented at Ie34f09a671a2ce341daabd8822d27e6b993d2e3e and got merged meanwhile. All comments in this change have been addressed, or been carried over to be tracked in bugzilla. https://gerrit.wikimedia.org/r/143486