Last modified: 2014-07-29 22:25:30 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T70796, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 68796 - Packetloss was critical on 2014-07-29 ~2:00 for oxygen, analytics1003, erbium
Packetloss was critical on 2014-07-29 ~2:00 for oxygen, analytics1003, erbium
Status: RESOLVED WONTFIX
Product: Analytics
Classification: Unclassified
General/Unknown (Other open bugs)
unspecified
All All
: Unprioritized normal
: ---
Assigned To: christian
u=Community c=General/Unknown p=0 s=2...
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2014-07-29 09:17 UTC by christian
Modified: 2014-07-29 22:25 UTC (History)
5 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description christian 2014-07-29 09:17:43 UTC
On 2014-07-29 ~02:00, there were packet loss alarms for oxygen, analytics1003, erbium in the #wikimedia-operations IRC channel:

  [01:52:47] <icinga-wm> PROBLEM - Packetloss_Average on erbium is CRITICAL: packet_loss_average CRITICAL: 37.5854172414
  [01:56:47] <icinga-wm> RECOVERY - Packetloss_Average on erbium is OK: packet_loss_average OKAY: -0.0539559302326  
  [01:57:17] <icinga-wm> PROBLEM - Packetloss_Average on analytics1003 is CRITICAL: packet_loss_average CRITICAL: 14.0737649167  
  [02:01:17] <icinga-wm> RECOVERY - Packetloss_Average on analytics1003 is OK: packet_loss_average OKAY: 1.17930608333  
  [02:02:57] <icinga-wm> PROBLEM - Packetloss_Average on oxygen is CRITICAL: packet_loss_average CRITICAL: 9.18785825  
  [02:06:57] <icinga-wm> RECOVERY - Packetloss_Average on oxygen is OK: packet_loss_average OKAY: 1.15079566667

The packetloss periods were short, and there was much monitoring noise in the
IRC channel around that time, so those might have been flukes.
Comment 1 christian 2014-07-29 22:25:30 UTC
The issue was a flapping esams link [1], which (depending on the stream)
killed half up to all esams traffic (eqiad and ulsfo were unaffected) to the
udp2log instances between 2014-07-29T01:35:45 and 2014-07-29T01:42:00.

This issue affects all of our logging infrastructure, from TSVs to
webstatscollector to pagecounts.


[1] See
  http://bots.wmflabs.org/~wm-bot/logs/%23wikimedia-operations/20140729.txt
  between [01:36:07] [02:02:19]

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links