Last modified: 2014-10-31 12:53:16 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T73435, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 71435 - Duplicates/missing logs from esams bits for 2014-09-28T{18,19,20}:xx:xx
Duplicates/missing logs from esams bits for 2014-09-28T{18,19,20}:xx:xx
Status: NEW
Product: Analytics
Classification: Unclassified
Refinery (Other open bugs)
unspecified
All All
: Unprioritized normal
: ---
Assigned To: Nobody - You can work on this!
:
Depends on:
Blocks: 72809
  Show dependency treegraph
 
Reported: 2014-09-29 22:10 UTC by christian
Modified: 2014-10-31 12:53 UTC (History)
7 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments
kafka.varnishkafka.kafka_drerr.per_second-2014-10-15.png (41.17 KB, image/png)
2014-10-15 16:36 UTC, christian
Details

Description christian 2014-09-29 22:10:15 UTC
Between, 2014-09-28T18:31:10 and 2014-09-28T20:06:34 all esams bits
caches saw both duplicate and missing lines.

Looking at the Ganglia graphs, it seems we'll see the same issue also
for today (2014-09-29).

While the issue was going on today, there was a discussion
about it in IRC [1].

It is not clear what happened.

The theory up to now is that due to recent config changes around
varnishkafka, esams bits traffic can no longer be handled with 3
brokers (we're currently using only 3 out of 4 brokers).



[1] Starting at 19:04:03 at
http://bots.wmflabs.org/~wm-bot/logs/%23wikimedia-analytics/20140929.txt
Comment 1 christian 2014-09-30 11:17:27 UTC
(In reply to christian from comment #0)
> Looking at the Ganglia graphs, it seems we'll see the same issue also
> for today (2014-09-29).

Yes, we did.
The affected period is 2014-09-29T18:41:48--2014-09-29T19:55:21.
Again, only all esams bits caches.
Again, both duplicate and missing lines.

Ottomata restarted varnishkafka on cp3019 on 19:41, and cp3019
immediately recovered. Its queues being back to normal, and no longer
getting critical again. No more losses on cp3019.

This nicely matches yesterday's theory of esams bits traffic spikes are
above what 3 brokers can take.
Comment 2 christian 2014-10-15 11:22:22 UTC
It happened again during for the 5 bits partitions from
2014-10-14T16:xx:xx up to and including 2014-10-14T20:xx:xx.
Again only esams bits.

Since I've been around when it happened, and historic ganglia graphs
don't expose this: The kafka drerr's were not constant, but grew,
died off again, and stayed off for the rest of the interval in
intervals that were ~25-minutes long.
(See attachment kafka.varnishkafka.kafka_drerr.per_second-2014-10-15.png)

All affected caches showed this ~25-minutes long pattern.
But the pattern was not synchronous across machines.

While the drerrs showed this pattern, the outbuf_cnt did not show such
a pattern. It was high the whole time.
Comment 3 christian 2014-10-15 16:36:18 UTC
Created attachment 16773 [details]
kafka.varnishkafka.kafka_drerr.per_second-2014-10-15.png
Comment 4 christian 2014-10-20 09:51:29 UTC
It happened again for 2014-10-16T17:xx:xx up to and including 2014-10-16T19:xx:xx

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links