Last modified: 2014-08-20 21:55:43 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T71663, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 69663 - Packet loss alarm on oxygen on 2014-08-16
Packet loss alarm on oxygen on 2014-08-16
Status: RESOLVED FIXED
Product: Analytics
Classification: Unclassified
General/Unknown (Other open bugs)
unspecified
All All
: Unprioritized normal
: ---
Assigned To: christian
u=Community c=General/Unknown p=0 s=2...
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2014-08-17 13:11 UTC by christian
Modified: 2014-08-20 21:55 UTC (History)
6 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description christian 2014-08-17 13:11:26 UTC
We had an packetloss alert on oxygen: 

  [23:03:09] <icinga-wm> PROBLEM - Packetloss_Average on oxygen is CRITICAL: packet_loss_average CRITICAL: 54.8310445455

(see http://bots.wmflabs.org/~wm-bot/logs/%23wikimedia-operations/20140816.txt )

It seems ottomata's restarting of udp2log [1] made the problem go away:

  [04:34:54] <ottomata> !log restarted udp2log on oxygen
  [...]
  [04:51:08] <icinga-wm> RECOVERY - Packetloss_Average on oxygen is OK: packet_loss_average OKAY: 0.0  

(see http://bots.wmflabs.org/~wm-bot/logs/%23wikimedia-operations/20140817.txt )

What happened?
How hard does it affect us?

(Was it related to bug 69661 ?)
Comment 1 nuria 2014-08-18 18:17:22 UTC
Root cause of traffic drop is that we had to retstart udp2log to make effective our removal of one of the filters. Alarms were raised for what seemed a longer event but the traffic drop just lasted a few minutes.

See request drop around 22:45 hours on the 16th for mobile traffic:

2014-08-16T22:39 4482
2014-08-16T22:40 4391
2014-08-16T22:41 4408
2014-08-16T22:42 4354
2014-08-16T22:43 1628
2014-08-16T22:44 4
2014-08-16T22:46 1
2014-08-16T22:47 3
2014-08-16T22:48 631
2014-08-16T22:49 4419
2014-08-16T22:50 4460
2014-08-16T22:51 4312

A graph that ilustrates the same drop: http://i.imgur.com/7p96Wmh.png

Resolving bug and updating log of events that affect feeds on wikitech.
Comment 2 nuria 2014-08-19 14:34:46 UTC
Reopening as it looks there were other intervals affected.
Comment 3 nuria 2014-08-20 06:32:23 UTC
We had alarms about traffic drop on Aug 16th 23:03 (packetloss), Aug 17th 00:11, Aug 17th 05:51, Aug 17th 07:51 (oxygen).

Recovery was sent on Aug 17th 04:51 (packetloss) and Aug 17th 13:17 (oxygen).
Comment 4 nuria 2014-08-20 07:47:14 UTC
Looked at the files for 17th and 18th again and the only event I can find with significant loss is printed below.

2014-08-17T06:24 3474
2014-08-17T06:25 3516
2014-08-17T06:26 1385 (*)
2014-08-17T06:29 3059
2014-08-17T06:30 3605

Assigning to Christian per his request to take all prod issues in the upcoming weeks.
Comment 5 christian 2014-08-20 21:28:32 UTC
Oxygen's alarms (see comment #3) around packet loss and udp2log from
2014-08-16 23:03:09 until 2014-08-17 07:51:08 were just artifacts of
bug 69661.

Loss in TSVs (from comment #1 and comment #3) is real though.

The loss on 2014-08-16 ~22:46 was due to the root mount effectively
getting full, hence services panicing, CPU usage jumping up.

The loss starting on 2014-08-16 ~06:25 was due to logrotation kicking
in, and reshuffling some files on the root mount a bit. Thereby, a bit
of disk space was freed up for <20mins, and services recovered a bit
until the root mount got full again. CPU usage going up further.

The losses affected all the multicast udp2log filters on oxygen:
  zero tsvs
  edits tsvs
  mobile-sampled-100 tsvs
  5xx tsvs
  webstatscollector

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links