Last modified: 2014-10-17 16:04:44 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T73056, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 71056 - kafkatee not consuming for some partitions
kafkatee not consuming for some partitions
Status: NEW
Product: Analytics
Classification: Unclassified
General/Unknown (Other open bugs)
unspecified
All All
: Unprioritized normal
: ---
Assigned To: Nobody - You can work on this!
:
Depends on:
Blocks: 71290
  Show dependency treegraph
 
Reported: 2014-09-19 18:18 UTC by Andrew Otto
Modified: 2014-10-17 16:04 UTC (History)
5 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Andrew Otto 2014-09-19 18:18:22 UTC
Below is a paste of an email I sent to Christian and Magnus in an effort to describe this problem.  I will add a comment with further insights.

---

Christian and I are ramping up to use kafkatee more in production.  We’re generating more output using it, and are comparing the output against udp2log’s.

We recently noticed that for one of our topics (webrequest_upload), there was 8.3% more data in the udp2log output.  We currently have 12 partitions, and 1/12 is 8.3%.  This made us suspect that one of the partitions was not being consumed properly.  We investigated that, and now have lost of weirdness to report.  Apologies if this is disorganized, there are a lot of moving pieces here.  I’ll try to report as best I can.


= Problem 1. Some offset files are not updating. =

I restarted kafkatee shortly before running the following command:

$ date && ls -ltr /var/cache/kafkatee/offsets/
Wed Sep 17 22:05:31 UTC 2014
total 192
-rw-r--r-- 1 kafkatee kafkatee 11 Jun 11 20:28 webrequest_mobile-9.offset
-rw-r--r-- 1 kafkatee kafkatee 11 Jun 11 20:28 webrequest_mobile-8.offset
-rw-r--r-- 1 kafkatee kafkatee 11 Jun 11 20:28 webrequest_mobile-7.offset
-rw-r--r-- 1 kafkatee kafkatee 11 Jun 11 20:28 webrequest_mobile-6.offset
-rw-r--r-- 1 kafkatee kafkatee 11 Jun 11 20:28 webrequest_mobile-5.offset
-rw-r--r-- 1 kafkatee kafkatee 11 Jun 11 20:28 webrequest_mobile-3.offset
-rw-r--r-- 1 kafkatee kafkatee 11 Jun 11 20:28 webrequest_mobile-4.offset
-rw-r--r-- 1 kafkatee kafkatee 11 Jun 11 20:28 webrequest_mobile-2.offset
-rw-r--r-- 1 kafkatee kafkatee 11 Jun 11 20:28 webrequest_mobile-1.offset
-rw-r--r-- 1 kafkatee kafkatee 11 Jun 11 20:28 webrequest_mobile-0.offset
-rw-r--r-- 1 kafkatee kafkatee 12 Sep 17 21:32 webrequest_bits-10.offset
-rw-r--r-- 1 kafkatee kafkatee 12 Sep 17 21:32 webrequest_bits-9.offset
-rw-r--r-- 1 kafkatee kafkatee 12 Sep 17 21:38 webrequest_upload-10.offset
-rw-r--r-- 1 kafkatee kafkatee 12 Sep 17 21:47 webrequest_text-10.offset
-rw-r--r-- 1 kafkatee kafkatee 11 Sep 17 21:48 webrequest_mobile-10.offset
-rw-r--r-- 1 kafkatee kafkatee 12 Sep 17 22:02 webrequest_bits-4.offset
-rw-r--r-- 1 kafkatee kafkatee 12 Sep 17 22:02 webrequest_text-2.offset
-rw-r--r-- 1 kafkatee kafkatee 12 Sep 17 22:02 webrequest_text-6.offset
-rw-r--r-- 1 kafkatee kafkatee 12 Sep 17 22:02 webrequest_upload-8.offset
-rw-r--r-- 1 kafkatee kafkatee 12 Sep 17 22:02 webrequest_upload-0.offset
-rw-r--r-- 1 kafkatee kafkatee 12 Sep 17 22:02 webrequest_upload-4.offset
-rw-r--r-- 1 kafkatee kafkatee 12 Sep 17 22:02 webrequest_bits-0.offset
-rw-r--r-- 1 kafkatee kafkatee 12 Sep 17 22:02 webrequest_bits-6.offset
-rw-r--r-- 1 kafkatee kafkatee 12 Sep 17 22:02 webrequest_bits-2.offset
-rw-r--r-- 1 kafkatee kafkatee 12 Sep 17 22:02 webrequest_text-7.offset
-rw-r--r-- 1 kafkatee kafkatee 12 Sep 17 22:02 webrequest_text-3.offset
-rw-r--r-- 1 kafkatee kafkatee 12 Sep 17 22:02 webrequest_text-8.offset
-rw-r--r-- 1 kafkatee kafkatee 12 Sep 17 22:02 webrequest_text-4.offset
-rw-r--r-- 1 kafkatee kafkatee 11 Sep 17 22:02 webrequest_mobile-11.offset
-rw-r--r-- 1 kafkatee kafkatee 12 Sep 17 22:02 webrequest_text-0.offset
-rw-r--r-- 1 kafkatee kafkatee 12 Sep 17 22:02 webrequest_bits-3.offset
-rw-r--r-- 1 kafkatee kafkatee 12 Sep 17 22:02 webrequest_text-11.offset
-rw-r--r-- 1 kafkatee kafkatee 12 Sep 17 22:02 webrequest_bits-8.offset
-rw-r--r-- 1 kafkatee kafkatee 12 Sep 17 22:02 webrequest_upload-6.offset
-rw-r--r-- 1 kafkatee kafkatee 12 Sep 17 22:02 webrequest_upload-2.offset
-rw-r--r-- 1 kafkatee kafkatee 12 Sep 17 22:02 webrequest_text-9.offset
-rw-r--r-- 1 kafkatee kafkatee 12 Sep 17 22:02 webrequest_bits-1.offset
-rw-r--r-- 1 kafkatee kafkatee 12 Sep 17 22:02 webrequest_text-1.offset
-rw-r--r-- 1 kafkatee kafkatee 12 Sep 17 22:02 webrequest_text-5.offset
-rw-r--r-- 1 kafkatee kafkatee 12 Sep 17 22:02 webrequest_bits-7.offset
-rw-r--r-- 1 kafkatee kafkatee 12 Sep 17 22:02 webrequest_bits-5.offset
-rw-r--r-- 1 kafkatee kafkatee 12 Sep 17 22:02 webrequest_bits-11.offset
-rw-r--r-- 1 kafkatee kafkatee 12 Sep 17 22:02 webrequest_upload-1.offset
-rw-r--r-- 1 kafkatee kafkatee 12 Sep 17 22:02 webrequest_upload-5.offset
-rw-r--r-- 1 kafkatee kafkatee 12 Sep 17 22:02 webrequest_upload-9.offset
-rw-r--r-- 1 kafkatee kafkatee 12 Sep 17 22:02 webrequest_upload-3.offset
-rw-r--r-- 1 kafkatee kafkatee 12 Sep 17 22:02 webrequest_upload-11.offset
-rw-r--r-- 1 kafkatee kafkatee 12 Sep 17 22:02 webrequest_upload-7.offset

You can see here that some of the offset files have older mtimes.  There seem to be two different cases where the files are not updated.


== Case 1a:  Very old offsets for mobile files, but still consuming ==

10 out of 12 of the mobile offset files have not been updated since June.  I turned on verbose kafkatee logging and tried to find some relevant output during kafkatee startup:

%7|1410991470.795|OFFSET|kafkatee#consumer-0| webrequest_mobile [2] using offset file /var/cache/kafkatee/offsets/webrequest_mobile-2.offset
%7|1410991470.795|OFFSET|kafkatee#consumer-0| webrequest_mobile [2]: Read offset 7486778007 from offset file (/var/cache/kafkatee/offsets/webrequest_mobile-2.offset)
%7|1410991471.224|FETCH|kafkatee#consumer-0| analytics1021.eqiad.wmnet:9092/21: Fetch topic webrequest_mobile [2] at offset 7486778007
%7|1410991471.429|OFFSET|kafkatee#consumer-0| webrequest_mobile [2]: offset reset (at offset 7486778007) to -1: Fetch response: Broker: Offset out of range
%7|1410991472.537|OFFSET|kafkatee#consumer-0| analytics1021.eqiad.wmnet:9092/21: OffsetRequest (-1) for topic webrequest_mobile [2]
%7|1410991472.537|FETCH|kafkatee#consumer-0| analytics1021.eqiad.wmnet:9092/21: Skipping topic webrequest_mobile [2] in state offset-wait
%7|1410991472.544|OFFSET|kafkatee#consumer-0| analytics1021.eqiad.wmnet:9092/21: OffsetReply for topic webrequest_mobile [2]: offset 3759198618: activating fetch
%7|1410991472.544|OFFSET|kafkatee#consumer-0| analytics1021.eqiad.wmnet:9092/21: OffsetReply for topic webrequest_mobile [2]: Success
%7|1410991473.254|FETCH|kafkatee#consumer-0| analytics1021.eqiad.wmnet:9092/21: Fetch topic webrequest_mobile [2] at offset 3759199122
%7|1410991474.441|FETCH|kafkatee#consumer-0| analytics1021.eqiad.wmnet:9092/21: Fetch topic webrequest_mobile [2] at offset 3759199454
…

Although, those are probably not interesting, as clearly the old offset is no longer relevant.  Hm…

But, even though these mobile partitions are not updating their offset files, they are consuming and updating next_offset in kafkatee.stats.json just fine.

== Case 1b:  Not so old offsets, but not consuming(?)

During runtime, kafkatee stops updating a few of the offset files.  This might not be bad in itself, except we are see a missing data proportional to the number of offset files that are not updating.  E.g. if 1 out of 12 of the offset files is not updating, then we miss 1/12th of output data for upload.

When this happens, kafkatee.stats.json fails to report stats for the partitions that ARE updating.  As I write this email, I see that the webrequest_upload-11.offset file has not updated in 10 minutes.  If I look at the partitions statistics for webrequest_upload, I only see information for a single (real) partition, partition 11:

$ tail -n 1 kafkatee.stats.json | jq '.kafka.topics.webrequest_upload.partitions'
{
"-1": {
  "txbytes": 0,
  "xmit_msgq_bytes": 0,
  "xmit_msgq_cnt": 0,
  "msgq_bytes": 0,
  "msgq_cnt": 0,
  "unknown": false,
  "desired": false,
  "leader": -1,
  "partition": -1,
  "fetchq_cnt": 0,
  "fetch_state": "none",
  "query_offset": 0,
  "next_offset": 0,
  "app_offset": 0,
  "commited_offset": 0,
  "eof_offset": 0,
  "txmsgs": 0
},
"11": {
  "txbytes": 0,
  "xmit_msgq_bytes": 0,
  "xmit_msgq_cnt": 0,
  "msgq_bytes": 0,
  "msgq_cnt": 0,
  "unknown": true,
  "desired": true,
  "leader": -1,
  "partition": 11,
  "fetchq_cnt": 0,
  "fetch_state": "active",
  "query_offset": 0,
  "next_offset": 31744699547,
  "app_offset": 0,
  "commited_offset": 31744699547,
  "eof_offset": 0,
  "txmsgs": 0
}
}

It isn’t consuming anything!  Curious that the leader is -1, eh?  What does that mean?



= Problem 2. kafkatee.stats.json not reporting all partition stats =

Welp, I guess I just described this problem above.

$ for topic in mobile bits text upload; do
echo -n "reported $topic partitions: “
tail -n 1 kafkatee.stats.json | jq ".kafka.topics.webrequest_$topic.partitions"  | grep '"fetch_state": "active"' | wc -l
done

reported mobile partitions: 1
reported bits partitions: 12
reported text partitions: 12
reported upload partitions: 1

Both mobile and upload have a single offset file that hasn’t been updated in many minutes (excluding those mobile files that haven’t been updated since June).  For mobile and upload, kafkatee.stats.json only reports partition statistics for these 2 partitions.  It does not report any statistics about the partitions that are actively updating their offset files.

= Problem 3: Some partitions are not writing at all =

This has also been described, but is a really big problem!  As far as we can tell, the 2 weirdo partitions that won’t update their offset files do not actually consume (or output?) any data at all.

Which partitions that go weirdo after a kafkatee restart are not predictable, but we have not yet seen this happen to any webrequest_text partitions.  But, maybe that’s just a fluke. 




Hm, after writing all of the above, it seems that there is some bug where, some topics can't consume, which causes kafkatee.stats.json to get all wonky.    Phew, I’m still confused about all of these moving parts.  I’ll re-read this tomorrow and see if it makes more sense.  Magnus, if you are around tomorrow, let’s chat on IRC about this, eh?

Thanks!
_Ao
Comment 1 Andrew Otto 2014-09-19 18:28:49 UTC
Magnus and I worked to try to figure out what was going on.  We have upgraded librdkafka to 0.8.4 on analytics1003 (and also attempted to use broker offset storage).

By only including the webrequest_upload as input to kafkatee, I was able to reproduce this problem in a simplified setting.  Evidence points to analytics1021 as being the cause of this problem (again).

Likely related, is this bug:
  https://github.com/edenhill/librdkafka/issues/147
  https://issues.apache.org/jira/browse/KAFKA-1367

I restarted analytics1021 and issue a preferred-replica-election.  This did not solve the broker/zookeeper metadata mismatch (described in those bugs), but it did solve the problem of kafkatee not consuming from all partitions.

I'm not entirely sure how to move on from here.  I'm going to re-add kafkatee consuming from all topics, and see how things go over the weekend.  I wonder if this has something to do with a metadata refresh bug in librdkafka/kafka + the weird analytics1021 kafka<->zookeeper timeout bug[1] that we have been struggling with.

If these issues persist, I think we should consider dropping analytics1021 from our Kafka cluster.  Its hard to say if we have problems because of this machine, or because of the Rack/network it is in, or because of a fluke.


[1] https://bugzilla.wikimedia.org/show_bug.cgi?id=69667
Comment 2 Andrew Otto 2014-09-19 19:09:15 UTC
Hm, when I restarted kafkatee with all webrequest topics as input, it exhibited the same behavior as described in this bug's description.  About 4 different partitions had leader: -1 and would not consume.

So, I decided to demote analytics1021 to be only a follower.  I issued another controlled-shutdown, restarted kafka, and waited for the ISRs to catch back up.  I then restarted kafkatee.

The behavior is unaffected.  Various seemingly random partitions do not get assigned leaders on kafkatee started, which causes them not to consume any messages.

I then shut down analytics1021 completely, to see if I could 100% verify if it is related to this problem.  Even when analytics1021 was offline, the same behavior was exhibited.

I believe I can conclude that analytics1021 is not the source of these kafkatee problems.
Comment 3 christian 2014-09-19 20:09:26 UTC
(In reply to Andrew Otto from comment #2)
> I believe I can conclude that analytics1021 is not the source of these
> kafkatee problems.

That's great!

(But analytics1021 would have been such a nice scapegoat :-( )
Comment 4 Andrew Otto 2014-10-17 16:04:44 UTC
Ok!  Magnus says that he found the bug and fixed it.  

https://github.com/edenhill/librdkafka/commit/7c151bcac2230c957b40816f3fb333d729ee3dc7

I've started kafkatee back up again, running with the latest librdkakfa (compiled in my homedir on analytics1003).  I'll let this run over the weekend, and we can vet the files it creates next week.

:D :D

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links