Last modified: 2014-10-22 12:01:24 UTC
Ops reported [1] a network issue between ulsfo and eqiad (According to IRC logs [2], alerts started around 2014-10-21 ~10:30). We did not see alerts on the udp2log pipeline. However, we saw alerts on the tighter monitoring the kafka pipeline. Did the issue affect the udp2log pipeline too? [1] https://lists.wikimedia.org/mailman/private/ops/2014-October/042427.html [2] http://bots.wmflabs.org/~wm-bot/logs/%23wikimedia-operations/20141021.txt
The upd2log pipeline shows the first sporadic ulsfo drop-outs on 2014-10-21T10:58 and continued to show ulsfo drop-outs until ulsfo got depooled on 2014-10-21T11:43 (Ifc2a1f1abb7d532e01782b05df764bf4cd072014). Per host packet loss computation for the affected hour does not give a meaningful result due to the ulsfo depooling bringing down message volume from ulsfo too much.
(In reply to christian from comment #0) > We did not see alerts on the udp2log pipeline. That's wrong. There have been alerts [1]: [11:54:29] <icinga-wm> PROBLEM - Packetloss_Average on erbium is CRITICAL: packet_loss_average CRITICAL: 9.11388505882 [12:02:12] <icinga-wm> PROBLEM - Packetloss_Average on analytics1026 is CRITICAL: packet_loss_average CRITICAL: 23.0722363964 [12:06:06] <icinga-wm> RECOVERY - Packetloss_Average on erbium is OK: packet_loss_average OKAY: 0.0 [12:21:25] <icinga-wm> RECOVERY - Packetloss_Average on analytics1026 is OK: packet_loss_average OKAY: 2.49366398305 [12:27:01] <icinga-wm> RECOVERY - Packetloss_Average on oxygen is OK: packet_loss_average OKAY: 1.85878847458 [1] http://bots.wmflabs.org/~wm-bot/logs/%23wikimedia-operations/20141021.txt