Last modified: 2014-07-29 22:25:30 UTC
On 2014-07-29 ~02:00, there were packet loss alarms for oxygen, analytics1003, erbium in the #wikimedia-operations IRC channel: [01:52:47] <icinga-wm> PROBLEM - Packetloss_Average on erbium is CRITICAL: packet_loss_average CRITICAL: 37.5854172414 [01:56:47] <icinga-wm> RECOVERY - Packetloss_Average on erbium is OK: packet_loss_average OKAY: -0.0539559302326 [01:57:17] <icinga-wm> PROBLEM - Packetloss_Average on analytics1003 is CRITICAL: packet_loss_average CRITICAL: 14.0737649167 [02:01:17] <icinga-wm> RECOVERY - Packetloss_Average on analytics1003 is OK: packet_loss_average OKAY: 1.17930608333 [02:02:57] <icinga-wm> PROBLEM - Packetloss_Average on oxygen is CRITICAL: packet_loss_average CRITICAL: 9.18785825 [02:06:57] <icinga-wm> RECOVERY - Packetloss_Average on oxygen is OK: packet_loss_average OKAY: 1.15079566667 The packetloss periods were short, and there was much monitoring noise in the IRC channel around that time, so those might have been flukes.
The issue was a flapping esams link [1], which (depending on the stream) killed half up to all esams traffic (eqiad and ulsfo were unaffected) to the udp2log instances between 2014-07-29T01:35:45 and 2014-07-29T01:42:00. This issue affects all of our logging infrastructure, from TSVs to webstatscollector to pagecounts. [1] See http://bots.wmflabs.org/~wm-bot/logs/%23wikimedia-operations/20140729.txt between [01:36:07] [02:02:19]