Last modified: 2014-08-27 12:39:20 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T71244, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 69244 - Kafka broker analytics1021 not receiving messages since 2014-08-06 ~1:44
Kafka broker analytics1021 not receiving messages since 2014-08-06 ~1:44
Status: RESOLVED WORKSFORME
Product: Analytics
Classification: Unclassified
General/Unknown (Other open bugs)
unspecified
All All
: Unprioritized normal
: ---
Assigned To: Nobody - You can work on this!
:
Depends on:
Blocks: 69667
  Show dependency treegraph
 
Reported: 2014-08-07 15:02 UTC by christian
Modified: 2014-08-27 12:39 UTC (History)
7 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments
analytics1021-AllTopicsMessagesInPerSec-OneMinuteRate (17.89 KB, image/png)
2014-08-07 15:02 UTC, christian
Details
Cluster-MessagesInPerSec-OneMinuteRate (94.24 KB, image/png)
2014-08-07 15:03 UTC, christian
Details
Cluster-RequestsPerSec-OneMinuteRate (23.28 KB, image/png)
2014-08-07 15:03 UTC, christian
Details

Description christian 2014-08-07 15:02:31 UTC
Created attachment 16152 [details]
analytics1021-AllTopicsMessagesInPerSec-OneMinuteRate

From http://bots.wmflabs.org/~wm-bot/logs/%23wikimedia-analytics/20140807.txt

[13:45:20] <mutante>	 analytics1021:
[13:45:22] <mutante>	 3/3 kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate CRITICAL: 7.42708492353e-59
[13:54:36] <tnegrin>	 gage?
[14:01:03] <tnegrin>	 mutante: andrew is out today -- is that alert repeating?
[14:01:50] <mutante>	 tnegrin: yes, it started a little over 1 day ago
[14:02:05] <tnegrin>	 hmm -- the graphs I look at all look normal
[14:02:06] <mutante>	 at wikimania but not sure how criticial it is
[14:03:06] <tnegrin>	 SF comes online in a few hours -- can you sleep it for 2 hours?
[14:03:12] <tnegrin>	 I will have gage look at it
[14:03:24] <tnegrin>	 (I don't think it's critical)
[14:04:14] <mutante>	 yes, i can
[14:04:18] <mutante>	 ok, thanks
[14:04:35] <tnegrin>	 thank
[14:04:37] <tnegrin>	 thanks

Ganglia shows analytics1021 Messages going down, and other brokers
taking over.

(See attachments
  analytics1021-AllTopicsMessagesInPerSec-OneMinuteRate.png
  Cluster-MessagesInPerSec-OneMinuteRate.png
  Cluster-RequestsPerSec-OneMinuteRate.png
)

It seems to have happened around 2014-08-07 01:44

There, according to /var/log/kafka/kafka.log on analytics1021, the
zookeeper connection expired [1]:

  [...]
  [2014-08-06 01:44:36,974] 101327050 [main-EventThread] INFO  org.I0Itec.zkclient.ZkClient  - zookeeper state changed (Expired)
  [...]

and could not connect to the ZooKeeper again

  [...]
  [2014-08-06 01:44:37,061] 101327137 [main-SendThread(analytics1024.eqiad.wmnet:2181)] INFO  org.apache.zookeeper.ClientCnxn  - Unable to reconnect to ZooKeeper service, session 0x146fd72a83d0dbe has expired, closing socket connection
  [...]

Then after re-connection, re-election took part:

[2014-08-06 01:44:37,215] 101327291 [ZkClient-EventThread-14-analytics1023.eqiad.wmnet,analytics1024.eqiad.wmnet,analytics1025.eqiad.wmnet/kafka/eqiad] INFO  kafka.controller.KafkaController$SessionExpirationListener  - [SessionExpirationListener on 21], ZK expired; shut down all controller components and try to re-elect
[2014-08-06 01:44:37,272] 101327348 [ZkClient-EventThread-14-analytics1023.eqiad.wmnet,analytics1024.eqiad.wmnet,analytics1025.eqiad.wmnet/kafka/eqiad] INFO  kafka.utils.ZkUtils$  - conflict in /controller data: {"version":1,"brokerid":21,"timestamp":"1407289477248"} stored data: {"version":1,"brokerid":22,"timestamp":"1407187809296"}


[1] Typically changes between Disconnected and SyncConected, with only a few hundret ms in Disconnected state
Comment 1 christian 2014-08-07 15:03:27 UTC
Created attachment 16153 [details]
Cluster-MessagesInPerSec-OneMinuteRate
Comment 2 christian 2014-08-07 15:03:53 UTC
Created attachment 16154 [details]
Cluster-RequestsPerSec-OneMinuteRate
Comment 3 christian 2014-08-07 15:08:06 UTC
(In reply to christian from comment #0)
> It seems to have happened around 2014-08-07 01:44

Wrong day. That should be

  [...] around 2014-08-06 01:44
Comment 4 Toby Negrin 2014-08-07 16:31:50 UTC
Gage -- can you please take a look at this? It looks like a broker has died. At the very least we should disable the alarms.

thanks,

-Toby
Comment 5 Jeff Gage 2014-08-07 19:49:31 UTC
This is the same broker we've had timeout issues with in the past. We were hopeful that the upgrade to Kafka 0.8.1.1 might resolve them. During the upgrade we found a stale Kafka init script on analytics1021; again we hoped that fix would resolve this issue. Frustrating to see that it's still happening. 

On the one hand we could just reinstall the OS in order to resolve this, but on the other hand we have three other brokers so the service has remained available, and it would be nice to understand the root cause of this problem.

After acknowledging the alerts and confirming what Christian observed in the logs, I upgraded all packages on the host and rebooted into a new kernel (3.2.0-67-generic) by doing (essentially):

apt-get update && apt-get upgrade && apt-get dist-upgrade && reboot

After reboot I observed all partitions fully replicate, triggered a replica election, and confirmed traffic flow in Ganglia. Analytics1021 is now back in service. It remains to be seen whether the package upgrades will finally resolve the timeout problems.


This is the list of upgraded packages:
accountsservice apt apt-transport-https apt-utils apt-xapian-index base-files bind9-host bsdutils ca-certificates consolekit curl dbus dbus-x11 dmidecode dmsetup dnsutils dpkg file gnupg gpgv grub-common grub-pc grub-pc-bin grub2-common icedtea-netx icedtea-netx-common ifupdown initramfs-tools initramfs-tools-bin iproute isc-dhcp-client isc-dhcp-common language-pack-en language-pack-en-base language-selector-common libaccountsservice0 libapt-inst1.4 libapt-pkg4.12 libasn1-8-heimdal libavahi-client3 libavahi-common-data libavahi-common3 libavahi-glib1 libbind9-80 libblkid1 libc-bin libc6 libck-connector0 libcups2 libcurl3 libcurl3-gnutls libdbus-1-3 libdevmapper-event1.02.1 libdevmapper1.02.1 libdns81 libdrm-intel1 libdrm-nouveau1a libdrm-radeon1 libdrm2 libgdk-pixbuf2.0-0 libgdk-pixbuf2.0-common libgl1-mesa-dri libgl1-mesa-glx libglapi-mesa libglib2.0-0 libgnutls26 libgssapi3-heimdal libgtk-3-0 libgtk-3-bin libgtk-3-common libgtk2.0-0 libgtk2.0-bin libgtk2.0-common libgudev-1.0-0 libhcrypto4-heimdal libheimbase1-heimdal libheimntlm0-heimdal libhx509-5-heimdal libisc83 libisccc80 libisccfg82 libjpeg-turbo8 libjson0 libkrb5-26-heimdal libldap-2.4-2 liblockfile-bin liblockfile1 liblvm2app2.2 liblwres80 libmagic1 libmount1 libmysqlclient18 libnspr4 libnss3 libnss3-1d libpam-ck-connector libparted0debian1 libperl5.14 libpixman-1-0 libpolkit-agent-1-0 libpolkit-backend-1-0 libpolkit-gobject-1-0 libpq5 libpulse0 libpython2.7 libroken18-heimdal libruby1.8 libservlet2.5-java libsnmp-base libsnmp15 libssl1.0.0 libtasn1-3 libtiff4 libudev0 libuuid1 libwbclient0 libwind0-heimdal libx11-6 libx11-data libx11-dev libx11-doc libx11-xcb1 libxfixes3 libxi6 libxml2 libyaml-0-2 linux-firmware mount multiarch-support mysql-common openjdk-6-jre openjdk-6-jre-headless openjdk-6-jre-lib openjdk-7-jre-lib openssl parted perl perl-base perl-modules policykit-1 procps python-apt python-apt-common python-jinja2 python2.7 python2.7-minimal ruby1.8 samba-common samba-common-bin smbclient sudo udev udisks update-manager-core util-linux uuid-runtime wget x11proto-input-dev xkb-data

Dist-upgrade took care of the kernel:
linux-headers-server linux-image-server linux-server
Comment 6 Toby Negrin 2014-08-08 10:30:39 UTC
Thanks Gage -- I'm really wondering whether there's a specific problem with this host, like a hardware issue.

-Toby

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links