Last modified: 2014-10-31 13:42:34 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T74550, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 72550 - analytics1021 getting kicked out of kafka partition leader role on 2014-10-27 ~07:12
analytics1021 getting kicked out of kafka partition leader role on 2014-10-27...
Status: RESOLVED FIXED
Product: Analytics
Classification: Unclassified
Refinery (Other open bugs)
unspecified
All All
: Unprioritized normal
: ---
Assigned To: Nobody - You can work on this!
:
Depends on: 72679 72812
Blocks: 69667
  Show dependency treegraph
 
Reported: 2014-10-27 08:39 UTC by christian
Modified: 2014-10-31 13:42 UTC (History)
7 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description christian 2014-10-27 08:39:06 UTC
analytics1021 again got kicked out of it's kafka partition leader role
on 2014-10-27 ~07:12.

I am not running leader re-elections for now, as ottomata wanted to
run some further experiments, if it happens to analytics1021 again.
Comment 1 christian 2014-10-28 19:56:14 UTC
I ran a leader re-election.
Analytics1021 is leader for a few partitions again.

(Still pending on check whether leader re-election caused loss/duplicates)
Comment 2 christian 2014-10-28 21:31:49 UTC
This bug is still missing the numbers of lost messages when
analytics1021 lost it's partition leader role.

For the text cluster, it only affected
  amssq34
  amssq53.esams.wikimedia.org
  amssq56.esams.wikimedia.org
  cp4008.ulsfo.wmnet
. The affected period was 2014-10-27T07:12:29/2014-10-27T07:12:32, and
in total 100 messages got lost, which is <<1 second worth of data for
text.

For the upload cluster, it affected all caches in that clustel except
for cp4015 .
The affected period was 2014-10-27T07:12:29/2014-10-27T07:12:46, and
in total ~51K messages got lost, which is <2 second worth of data for
upload.

When analytics1021 lost its partition leader role, bits, mobile, and
text already had the ACK fix. upload hadn't. So seeing the lost
messages on upload is expected.

It is also expected to see no loss on bits, and mobile.

However, I had expected to see no loss on text, as it already had the
ACK fix. It's strange to see exactly 100 lost messages on text.
100 is a suspiciously nice number.
Comment 3 christian 2014-10-29 16:52:44 UTC
(In reply to christian from comment #1)
> (Still pending on check whether leader re-election caused loss/duplicates)

Bug 72679 has details on that.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links