Last modified: 2014-08-01 12:59:47 UTC
Exact limit we can bikeshed/adjust as needed, but, alerting on the fact that merges are taking "too long" is needed. We currently just use the "someone needs to complain on IRC" method, which isn't good customer service. Icinga is probably the right place for the alert.
Created attachment 15993 [details] Gearman queue metrics The Zuul/Jenkins interaction is handled over German. The server is embedded in Zuul and reports metrics to Graphite over statsd. The attached graph shows over a month: - blue: overall size of Gearman functions pending/running - green: functions running (i.e. a Jenkins job is proceeding) - red: functions waiting The graph URL is: http://graphite.wikimedia.org/render/?width=586&height=308&_salt=1405950653.273&from=-1month&target=zuul.geard.queue.total.value&target=zuul.geard.queue.running.value&target=zuul.geard.queue.waiting.value
Created attachment 15994 [details] Gearman queue on July 15th 2014 Attached is the states of Gearman queues on July 15th - blue: overall size of Gearman functions pending/running - green: functions running (i.e. a Jenkins job is proceeding) - red: functions waiting The huge spike on July 15th seems to have been some kind of incident which is what probably caused this bug to be filled. We could use Icinga check_graphite plugin or some confident band to detect there are too many jobs in the queue and alert whenever it is the case. I am not sure what went wrong there. Most probably Jenkins jobs were no more properly registered / runnable in Gearman server which caused Gearman to enqueue then and wait for a worker (i.e. Jenkins) to execute the job. I guess we could alarm whenever zuul.geard.queue.waiting.value is above 5 for more than 15 minutes.
Note: a similar graph is available on our Zuul homepage at https://integration.wikimedia.org/zuul/ . That is the graph titled "Zuul Geard job queue (8 hours)". It definitely showed the exact same graphs I have attached previously: a huge spike of waiting jobs.