Last modified: 2014-10-22 12:03:31 UTC
Ops reported [1] a network issue between ulsfo and eqiad (2014-10-20 ~13:07). We did not see alerts on the udp2log pipeline. However, we saw alerts on the tighter monitoring the kafka pipeline. Did the issue affect the udp2log pipeline too? [1] https://lists.wikimedia.org/mailman/private/ops/2014-October/042274.html
(In reply to christian from comment #0) > However, we saw alerts on the tighter monitoring the kafka pipeline. For the kafka pipeline, the bug is 72296.
The upd2log pipeline seems affected between 2014-10-20T13:06--2014-10-20T13:27. Per hour per host packetloss ranges between 6-47% for ulsfo caches for the hour that covers the affected period. +--------------------+--------------+ | | Per hour | | | packetloss | | Host | (in percent) | +--------------------+--------------+ | cp4005.ulsfo.wmnet | 46 | | cp4006.ulsfo.wmnet | 12 | | cp4007.ulsfo.wmnet | 47 | | cp4008.ulsfo.wmnet | 42 | | cp4009.ulsfo.wmnet | 38 | | cp4010.ulsfo.wmnet | 8 | | cp4011.ulsfo.wmnet | 36 | | cp4012.ulsfo.wmnet | 37 | | cp4013.ulsfo.wmnet | 6 | | cp4014.ulsfo.wmnet | 44 | | cp4015.ulsfo.wmnet | 7 | | cp4016.ulsfo.wmnet | 22 | | cp4017.ulsfo.wmnet | 40 | | cp4018.ulsfo.wmnet | 12 | | cp4019.ulsfo.wmnet | 45 | | cp4020.ulsfo.wmnet | 9 | +--------------------+--------------+ Non-ulsfo don't show a drop/rise.
(In reply to christian from comment #0) > We did not see alerts on the udp2log pipeline. That's wrong. There have been alerts [1]: [13:19:04] <icinga-wm> PROBLEM - Packetloss_Average on erbium is CRITICAL: packet_loss_average CRITICAL: 13.2572885542 [13:27:37] <icinga-wm> PROBLEM - Packetloss_Average on oxygen is CRITICAL: packet_loss_average CRITICAL: 25.0862913793 [13:29:40] <icinga-wm> PROBLEM - Packetloss_Average on analytics1026 is CRITICAL: packet_loss_average CRITICAL: 14.6411538136 [13:32:00] <icinga-wm> RECOVERY - Packetloss_Average on erbium is OK: packet_loss_average OKAY: 2.36820388235 [13:42:20] <icinga-wm> RECOVERY - Packetloss_Average on analytics1026 is OK: packet_loss_average OKAY: 2.73679050847 [13:46:30] <icinga-wm> RECOVERY - Packetloss_Average on oxygen is OK: packet_loss_average OKAY: 1.89986423729 [1] http://bots.wmflabs.org/~wm-bot/logs/%23wikimedia-operations/20141020.txt