Last modified: 2014-08-20 21:55:43 UTC
We had an packetloss alert on oxygen: [23:03:09] <icinga-wm> PROBLEM - Packetloss_Average on oxygen is CRITICAL: packet_loss_average CRITICAL: 54.8310445455 (see http://bots.wmflabs.org/~wm-bot/logs/%23wikimedia-operations/20140816.txt ) It seems ottomata's restarting of udp2log [1] made the problem go away: [04:34:54] <ottomata> !log restarted udp2log on oxygen [...] [04:51:08] <icinga-wm> RECOVERY - Packetloss_Average on oxygen is OK: packet_loss_average OKAY: 0.0 (see http://bots.wmflabs.org/~wm-bot/logs/%23wikimedia-operations/20140817.txt ) What happened? How hard does it affect us? (Was it related to bug 69661 ?)
Root cause of traffic drop is that we had to retstart udp2log to make effective our removal of one of the filters. Alarms were raised for what seemed a longer event but the traffic drop just lasted a few minutes. See request drop around 22:45 hours on the 16th for mobile traffic: 2014-08-16T22:39 4482 2014-08-16T22:40 4391 2014-08-16T22:41 4408 2014-08-16T22:42 4354 2014-08-16T22:43 1628 2014-08-16T22:44 4 2014-08-16T22:46 1 2014-08-16T22:47 3 2014-08-16T22:48 631 2014-08-16T22:49 4419 2014-08-16T22:50 4460 2014-08-16T22:51 4312 A graph that ilustrates the same drop: http://i.imgur.com/7p96Wmh.png Resolving bug and updating log of events that affect feeds on wikitech.
Reopening as it looks there were other intervals affected.
We had alarms about traffic drop on Aug 16th 23:03 (packetloss), Aug 17th 00:11, Aug 17th 05:51, Aug 17th 07:51 (oxygen). Recovery was sent on Aug 17th 04:51 (packetloss) and Aug 17th 13:17 (oxygen).
Looked at the files for 17th and 18th again and the only event I can find with significant loss is printed below. 2014-08-17T06:24 3474 2014-08-17T06:25 3516 2014-08-17T06:26 1385 (*) 2014-08-17T06:29 3059 2014-08-17T06:30 3605 Assigning to Christian per his request to take all prod issues in the upcoming weeks.
Oxygen's alarms (see comment #3) around packet loss and udp2log from 2014-08-16 23:03:09 until 2014-08-17 07:51:08 were just artifacts of bug 69661. Loss in TSVs (from comment #1 and comment #3) is real though. The loss on 2014-08-16 ~22:46 was due to the root mount effectively getting full, hence services panicing, CPU usage jumping up. The loss starting on 2014-08-16 ~06:25 was due to logrotation kicking in, and reshuffling some files on the root mount a bit. Thereby, a bit of disk space was freed up for <20mins, and services recovered a bit until the root mount got full again. CPU usage going up further. The losses affected all the multicast udp2log filters on oxygen: zero tsvs edits tsvs mobile-sampled-100 tsvs 5xx tsvs webstatscollector