Last modified: 2014-10-29 14:21:19 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T71667, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 69667 - Kafka broker analytics1021 not receiving messages every now and then
Kafka broker analytics1021 not receiving messages every now and then
Status: PATCH_TO_REVIEW
Product: Analytics
Classification: Unclassified
General/Unknown (Other open bugs)
unspecified
All All
: Unprioritized normal
: ---
Assigned To: Nobody - You can work on this!
:
Depends on: 72028 69244 69665 69666 71425 72252 72550
Blocks: 70087
  Show dependency treegraph
 
Reported: 2014-08-17 15:42 UTC by christian
Modified: 2014-10-29 14:21 UTC (History)
8 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments
analytics1021-2014-09-04T15-26-03.tar.gz (1.44 MB, application/x-gzip)
2014-09-04 22:12 UTC, christian
Details

Description christian 2014-08-17 15:42:36 UTC
Every few days, analytics1021 gets kicked out of its partition leader
role.

The last few occurrences are:
2014-08-06: bug 69244
2014-08-13: bug 69665
2014-08-16: bug 69666
Comment 1 Toby Negrin 2014-08-17 15:46:20 UTC
Thanks for bugging this Christian.

IMHO, this has been a fairly regular problem with this host and attempts to reimage/fix have not been successful. Andrew/Gage -- can you update?
Comment 2 Andrew Otto 2014-08-18 14:00:24 UTC
The core of this issue is a timeout of the Zookeeper connection, which neither Gage nor I have been able to solve.

Quick summary:  Kafka brokers need to maintain a live connection with Zookeeper in order to remain in the ISR.  Brokers set a timeout.  If a broker can't talk to Zookeeper within this timeout, it will close the connection it has and attempt to open a new one, most likely to a different Zookeeper host than it had before.  Zookeeper notices when it closes this connection, and then tells all of the other brokers that this broker has left the ISR.  Its leadership for any partitions is then demoted.  It takes this broker usually about less than a second to reconnect to another Zookeeper host and rejoin the cluster.  Thus far, when this happens, Gage or I have logged in and manually started a preferred replica election.  This bring's the offending broker's leadership status back to normal.

This would only be an annoyance, if it weren't for small varnishkafka hiccups this causes.  Ideally, when a broker loses partition leadership, producers would be notified of the metadata change quickly enough that their buffer's don't fill up.  We have noticed that for some higher volume partitions (upload and/or bits), some varnishkafkas drop messages during the short time that leadership metadata is being propagated.


Action items:
- Solve zookeeper timeout.
  Not sure how to replicate or do this right now.

- Keep varnishkafka from dropping messages on metadata change.
  There is likely some tuning we can do to make sure we don't
  drop messages when partition leadership changes.

- Investigate auto partition balancing.
  This is a new feature in Kafka 0.8.1.  This would eliminate
  the manual step of starting a preferred replica election.
  This won't solve either of the above problems, but would
  allow the cluster to rebalance itself without manual
  intervention when this happens.

See also:
https://rt.wikimedia.org/Ticket/Display.html?id=6877
Comment 3 Andrew Otto 2014-08-25 21:49:37 UTC
Today I increased kafka.queue.buffering.max.ms from 1 second to 5 seconds.  Leader election can take up to 3 seconds.  Hopefully this will solve the second bullet point above.  We'll have to wait until the next time a leader election happens and see if we miss any data from it.
Comment 4 christian 2014-09-03 08:50:42 UTC
It happened again on:

  2014-08-24 ~16:00 (with recovery for the next reading in
      ganglia. Since ganglia shows a decrease of volume for that time,
      it might align with the kafkatee changes on analytics1003. So it
      might actually be an artifact.)
  2014-08-30 ~03:30 (with recovery ~06:00)
  2014-09-02 ~10:50 (not yet recovered)

(I haven't yet checked if those instances also caused packetloss)
Comment 5 christian 2014-09-03 08:56:18 UTC
(In reply to christian from comment #4)
>   2014-08-24 ~16:00 (with recovery for the next reading in

Wrong day. It should read

   2014-08-25 ~16:00 (with recovery for the next reading in

. Sorry.
Comment 6 christian 2014-09-03 13:45:24 UTC
(In reply to Andrew Otto from comment #3)
> Today I increased kafka.queue.buffering.max.ms from 1 second to 5 seconds. 
> Leader election can take up to 3 seconds.  Hopefully this will solve the
> second bullet point above.  We'll have to wait until the next time a leader
> election happens and see if we miss any data from it.

The re-elections from 
  2014-08-30 ~03:30
  2014-09-02 ~10:50
(see comment #5) happened past the buffer increase, but both still exhibit
missing sequence numbers. So bumping the buffer size did not do the trick :-(
Comment 7 Andrew Otto 2014-09-04 21:05:08 UTC
I have done some more sleuthing on the zookeeper timeouts today, and I want to record my thoughts.

- The timeout can happen when the broker is connected to any zookeeper.

- Timeouts happen more often than timeout induced leader changes.  That is, a Zookeeper session timeout does necessarily mean produce errors.

- When there are timeout induced leader changes, they seem to be caused by the zookeeper leader expiring a zookeeper session, rather than a broker reconnecting to zookeeper because of a timeout.  That is, both zookeeper server and kafka broker (zookeeper client) set a timeout of 16 seconds.  (I believe they negotiate to the lower timeout if the setting don't match).  If kafka notices the timeout first, it will just close the connection and reconnect with the same session id.  If the timeout happens to be noticed by zookeeper before kafka, then zookeeper expires the session.   Kafka's own timeout will trigger after the session has been expired by zookeeper, and when it attempts to reconnect, it will be told it has an expired session, which causes it to have to reconnect a second time to ask for a new session.

- This also seems related to timeouts occurring when the broker (usually analytics1021) is also the controller for the kafka cluster.  (The Controller is the one in charge of intra-cluster things like ISRs and leaders...I think).  As far as I can tell, all of the timeout induced leader changes that we see are also accompanied by (on broker)

  kafka.controller.KafkaController$SessionExpirationListener  - [SessionExpirationListener on 21], ZK expired; shut down all controller components and try to re-elect

and (on zookeeper leader)

  [ProcessThread:-1:PrepRequestProcessor@419] - Got user-level KeeperException when processing sessionid:0x46fd72a6d4243f type:create cxid:0x1 zxid:0xfffffffffffffffe txntype:unknown reqpath:n/a Error Path:/kafka/eqiad/controller Error:KeeperErrorCode = NodeExists for /kafka/eqiad/controller

error messages.  That is, when the timed-out broker finally is able to reconnect (with a new session id), it attempts to re-register its previous controller status with zookeeper, only to find that another broker has taken over as controller.


I just went back to my email thread to the Kafka Users group and noticed that someone had replied[1] to my sleuthing about broker GCs back in July!  I never saw this email response!  Gah!  I will look into this suggestion tomorrow.

[1] http://mail-archives.apache.org/mod_mbox/kafka-users/201407.mbox/%3CCAFbh0Q2f71qgs5JDNFxkm7SSdZyYMH=ZpEOxotuEQfKqeXQHfw@mail.gmail.com%3E
Comment 8 christian 2014-09-04 22:00:27 UTC
Adding some more records:

To rule out temporary basic network problems, I started a script to
ping all zookeepers from analytics1021 in an endless loop.
That produced results already:

  There was no output for ~10 seconds between 15:26:04 and 15:26:14

But RTT times were low before and afterwards. Since we're hardly
having data to compare against, I collected the analytics1021 ganglia
graphs that showed a exceptional in-/decrease during that period, and
added the relevant parts of the kafka log and kafka gc log.
(See attachment analytics1021-2014-09-04T15-26-03.tar.gz)

It seems most disks spiked.
kafka log looks harmless.
kafka gc log looks harmless too.

--------------------------------------------

Since both the pings and kafka gc stopped during that time, it might
be it's not network related, but the machine itself preempting
processes and blocking them for some time.

So I started another monitor that (in an endless loop) logs the
timestamp, and then waits a second.
That already produced relevant results too:

  There was no output for ~13 seconds between 20:47:18 and 20:47:31

.

Again most disks spiked.
kafka log looks harmless.
kafka gc log looks harmless too.

--------------------------------------------

Since I lack access to the system's logs, do they show anything
interesting around
  15:26:04 -- 15:26:14, or
  20:47:18 -- 20:47:31
?
Comment 9 christian 2014-09-04 22:12:37 UTC
Created attachment 16376 [details]
analytics1021-2014-09-04T15-26-03.tar.gz
Comment 10 christian 2014-09-04 22:27:42 UTC
Trying to understand why it only happens an analytics1021 and not the
other brokers, the only difference that I could find is that
analytic1021 is the only broker that does not have $ganglia_aggregator
set to true. I am not sure if that is relevant, but seeing the quoted
email from comment #7 ... could the $ganglia_aggregator setting change
the machine's characteristics enough to warrant analytics1021
exhibiting this strange behaviour?
Comment 11 Andrew Otto 2014-09-05 14:42:27 UTC
Yeah, strange indeed that this only happens on analytics1021.  I *think* we have seen this elsewhere before, but not often.  And, I think not since the cluster reinstall in July.

> root@palladium:~# salt -E 'analytics10(12|18|21|22).eqiad.wmnet' cmd.run 'grep -i "timed out" /var/log/kafka/kafka.log | wc -l'
> analytics1012.eqiad.wmnet:
>     0
> analytics1018.eqiad.wmnet:
>     0
> analytics1021.eqiad.wmnet:
>     80
> analytics1022.eqiad.wmnet:
>     4
Comment 12 Andrew Otto 2014-09-08 17:44:36 UTC
This happened again on 2014-09-06 around 19:43:22.  Here are some relevant sar captures:

https://gist.github.com/ottomata/ae372649afc914a6f606

Something is happening with paging stats, but it is unclear if it is a syptom or a cause.

Starting around 19:43:11, pages are being freed up really fast.  Then, at 19:43:22, activity drops out, as if activity is blocked by something.  At 19:43:33, a ton more pages are freed all at once  Then it seems to go back to normal.


Seeing as the actual block starts at around :22, and there is weird behavior before that, maybe paging is part of the cause.
Comment 13 Gerrit Notification Bot 2014-09-29 21:53:50 UTC
Change 163744 had a related patch set uploaded by QChris:
Force ACKs from all in-sync kafka replicas

https://gerrit.wikimedia.org/r/163744
Comment 14 christian 2014-10-14 10:18:40 UTC
Happened again on 2014-10-13 around 13:37:31.
Comment 15 Toby Negrin 2014-10-14 17:50:58 UTC
Hi Christian - would you say that the fix did not solve this problem?

thanks,

-Toby
Comment 16 christian 2014-10-14 22:42:11 UTC
(In reply to Toby Negrin from comment #15)
> would you say that the fix did not solve this problem?

If you are refering to the change from comment 13, it totally did the trick.
For the machines it addresses, the issue is gone.
Those machines no longer loose data.

While the fix helps with the issue of losing data, it does not yet prevent
analytics1021 from dropping out.
Comment 17 Gerrit Notification Bot 2014-10-20 11:44:13 UTC
Change 167550 had a related patch set uploaded by QChris:
Require 2 ACKs from kafka brokers for mobile caches

https://gerrit.wikimedia.org/r/167550
Comment 18 Gerrit Notification Bot 2014-10-20 11:44:19 UTC
Change 167551 had a related patch set uploaded by QChris:
Require 2 ACKs from kafka brokers for text caches

https://gerrit.wikimedia.org/r/167551
Comment 19 Gerrit Notification Bot 2014-10-20 11:44:24 UTC
Change 167552 had a related patch set uploaded by QChris:
Require 2 ACKs from kafka brokers for bits caches

https://gerrit.wikimedia.org/r/167552
Comment 20 Gerrit Notification Bot 2014-10-20 11:44:30 UTC
Change 167553 had a related patch set uploaded by QChris:
Require 2 ACKs from kafka brokers per default

https://gerrit.wikimedia.org/r/167553
Comment 21 Gerrit Notification Bot 2014-10-21 13:40:39 UTC
Change 167550 merged by Ottomata:
Require 2 ACKs from kafka brokers for mobile caches

https://gerrit.wikimedia.org/r/167550
Comment 22 Andrew Otto 2014-10-21 14:28:30 UTC
Set vm.dirty_writeback_centisecs = 200 (was 500)
Comment 23 Gerrit Notification Bot 2014-10-22 15:30:54 UTC
Change 167551 merged by Ottomata:
Require 2 ACKs from kafka brokers for text caches

https://gerrit.wikimedia.org/r/167551
Comment 24 Gerrit Notification Bot 2014-10-23 14:23:38 UTC
Change 167552 merged by Ottomata:
Require 2 ACKs from kafka brokers for bits caches

https://gerrit.wikimedia.org/r/167552
Comment 25 Gerrit Notification Bot 2014-10-29 14:21:19 UTC
Change 167553 merged by Ottomata:
Require 2 ACKs from kafka brokers per default

https://gerrit.wikimedia.org/r/167553

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links