Last modified: 2014-09-19 20:09:26 UTC
While the kafkatee zero files and udp2log zero files previously more or less agreed in number of lines, the kafkatee zero files currently have ~10% less lines than udp2log zero files. As the files start to run appart for the 2014-03-27 file [1], it seems to have started around 2014-03-26. Is this difference expected? P.S.: Running zgrep '^cp3014.*http://en\.m\.wikipedia\.org/wiki/Mamunur_Rashid' /a/{squid/archive,log/webrequest}/zero/zero.tsv.log-20140419.gz on stat1002 will give you a harmless looking log line that's only in the udp2log stream, and not in the kafka stream. [1] ___________________________________________________________ qchris@stat1002 // 0 // 11:44:00 cwd: ~ for i in /a/squid/archive/zero/zero.tsv.log-20140[34]*.gz ; do echo "$i: $(zcat $i | wc -l)" ; done [...] /a/squid/archive/zero/zero.tsv.log-20140324.gz: 4092113 /a/squid/archive/zero/zero.tsv.log-20140325.gz: 3953345 /a/squid/archive/zero/zero.tsv.log-20140326.gz: 3944141 /a/squid/archive/zero/zero.tsv.log-20140327.gz: 4145574 /a/squid/archive/zero/zero.tsv.log-20140328.gz: 6191020 /a/squid/archive/zero/zero.tsv.log-20140329.gz: 8338602 /a/squid/archive/zero/zero.tsv.log-20140330.gz: 10344328 /a/squid/archive/zero/zero.tsv.log-20140331.gz: 11867766 /a/squid/archive/zero/zero.tsv.log-20140401.gz: 12191221 /a/squid/archive/zero/zero.tsv.log-20140402.gz: 13204630 /a/squid/archive/zero/zero.tsv.log-20140403.gz: 13793041 [...] ___________________________________________________________ qchris@stat1002 // 0 // 11:44:15 cwd: ~ for i in /a/log/webrequest/zero/zero.tsv.log-20140[34]*.gz ; do echo "$i: $(zcat $i | wc -l)" ; done [...] /a/log/webrequest/zero/zero.tsv.log-20140324.gz: 4087376 /a/log/webrequest/zero/zero.tsv.log-20140325.gz: 3946964 /a/log/webrequest/zero/zero.tsv.log-20140326.gz: 3939808 /a/log/webrequest/zero/zero.tsv.log-20140327.gz: 3918716 /a/log/webrequest/zero/zero.tsv.log-20140328.gz: 5545742 /a/log/webrequest/zero/zero.tsv.log-20140329.gz: 7500078 /a/log/webrequest/zero/zero.tsv.log-20140330.gz: 9303477 /a/log/webrequest/zero/zero.tsv.log-20140331.gz: 10671821 /a/log/webrequest/zero/zero.tsv.log-20140401.gz: 10971750 /a/log/webrequest/zero/zero.tsv.log-20140402.gz: 11885403 /a/log/webrequest/zero/zero.tsv.log-20140403.gz: 12417002 [...]
Prioritization and scheduling of this bug is tracked on Mingle card https://wikimedia.mingle.thoughtworks.com/projects/analytics/cards/cards/1560
I found a line that was in udp2log output, but not in kafkatee output: cp1046.eqiad.wmnet2014-04-19T07:00:350.000164270X.X.X.Xhit/20017375GEThttp://en.m.wikipedia.org/wiki/Mobile_application_software-text/html; charset=UTF-8--XXXXenzero=410-01 (Note that the sequence number has been removed from this line.) I checked the webrequest_mobile Hive table in hdfs for this line, and it does exist. That means that the kafkatee missing logs were in the kafka log buffer, as they were consumed properly by Camus into HDFS. Still sleuthing...
Also of interest, the kafkatee process on analytics1003 has only been running since March 26th.
Let's prioritize this -- the kafka-tee stuff is pretty critical to get right.
Since it was constantly missing 10% and we have 10 partitions, the assumption is that some partition did not get properly fed into varnishkafka. Ottomata restarted varnishkafka, and the problem disappeared: (...-$DATE.gz is filename, 2014-04-20T$HOUR:xx:xx is the extracted hour, and difference is NUMBER_OF_LINES_IN_KAFKATEE_FILE * 100 / NUMBER_OF_LINES_IN_ UDP2LOG_FILE - 100 (we want difference to be 0) ) ...-20140421.gz: 2014-04-20T07:xx:xx -11% difference ...-20140421.gz: 2014-04-20T08:xx:xx -11% difference ...-20140421.gz: 2014-04-20T09:xx:xx -11% difference ...-20140421.gz: 2014-04-20T10:xx:xx -11% difference ...-20140421.gz: 2014-04-20T11:xx:xx -11% difference ...-20140421.gz: 2014-04-20T12:xx:xx -11% difference ...-20140421.gz: 2014-04-20T13:xx:xx -10% difference ...-20140421.gz: 2014-04-20T14:xx:xx -11% difference ...-20140421.gz: 2014-04-20T15:xx:xx -10% difference ...-20140421.gz: 2014-04-20T16:xx:xx -11% difference ...-20140421.gz: 2014-04-20T17:xx:xx -11% difference ...-20140421.gz: 2014-04-20T18:xx:xx -11% difference ...-20140421.gz: 2014-04-20T19:xx:xx -10% difference ...-20140421.gz: 2014-04-20T20:xx:xx -10% difference ...-20140421.gz: 2014-04-20T21:xx:xx -10% difference ...-20140421.gz: 2014-04-20T22:xx:xx -11% difference ...-20140421.gz: 2014-04-20T23:xx:xx -11% difference ...-20140421.gz: 2014-04-21T00:xx:xx -11% difference ...-20140421.gz: 2014-04-21T01:xx:xx -11% difference ...-20140421.gz: 2014-04-21T02:xx:xx -10% difference ...-20140421.gz: 2014-04-21T03:xx:xx -10% difference ...-20140421.gz: 2014-04-21T04:xx:xx -11% difference ...-20140421.gz: 2014-04-21T05.xx:xx -11% difference ...-20140421.gz: 2014-04-21T06:xx:xx -25% difference <-- log rotation ...-20140422.gz: 2014-04-21T06:xx:xx -4% difference <-- log rotation ...-20140422.gz: 2014-04-21T07:xx:xx -11% difference ...-20140422.gz: 2014-04-21T08:xx:xx -10% difference ...-20140422.gz: 2014-04-21T09:xx:xx -11% difference ...-20140422.gz: 2014-04-21T10:xx:xx -11% difference ...-20140422.gz: 2014-04-21T11:xx:xx -6% difference <-- ? ...-20140422.gz: 2014-04-21T12:xx:xx -11% difference ...-20140422.gz: 2014-04-21T13:xx:xx -10% difference ...-20140422.gz: 2014-04-21T14:xx:xx -10% difference ...-20140422.gz: 2014-04-21T15:xx:xx -10% difference ...-20140422.gz: 2014-04-21T16:xx:xx -10% difference ...-20140422.gz: 2014-04-21T17:xx:xx -10% difference ...-20140422.gz: 2014-04-21T18:xx:xx -9% difference <-- ottomata restarting kafkatee ~18:30 ...-20140422.gz: 2014-04-21T19:xx:xx -48% difference ...-20140422.gz: 2014-04-21T20:xx:xx 0% difference <-- back to normal. ...-20140422.gz: 2014-04-21T21:xx:xx 0% difference ...-20140422.gz: 2014-04-21T22:xx:xx -1% difference ...-20140422.gz: 2014-04-21T23:xx:xx -1% difference ...-20140422.gz: 2014-04-22T00:xx:xx -1% difference ...-20140422.gz: 2014-04-22T01:xx:xx -1% difference ...-20140422.gz: 2014-04-22T02:xx:xx -1% difference ...-20140422.gz: 2014-04-22T03:xx:xx -1% difference ...-20140422.gz: 2014-04-22T04:xx:xx -1% difference ...-20140422.gz: 2014-04-22T05:xx:xx -1% difference How can we guard against partitions not being fed into varnishkafka in future?
We have other kafka issues we need to resolve before we can address this problem.