Last modified: 2014-08-01 12:59:47 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T70113, the corresponding Phabricator task for complete and up-to-date bug report information.

Bug 68113 - Alert when time to merge (from +2 in Gerrit to merged in git) exceeds a known bad limit


Summary:	Alert when time to merge (from +2 in Gerrit to merged in git) exceeds a known...

Status:	NEW

Product:	Wikimedia
Classification:	Unclassified
Component:	Continuous integration (Other open bugs)
Version:	wmf-deployment
Hardware:	All All

Importance:	Normal enhancement (vote)
Target Milestone:	---
Assigned To:	Nobody - You can work on this!

URL:
Whiteboard:
Keywords:

Depends on:	68114
Blocks:
	Show dependency tree / graph

Reported:	2014-07-16 17:12 UTC by Greg Grossmeier
Modified:	2014-08-01 12:59 UTC (History)
CC List:	5 users (show)

See Also:
Web browser:	---
Mobile Platform:	---
Assignee Huggle Beta Tester:	---

Attachments
Gearman queue metrics (50.00 KB, image/png) 2014-07-21 13:56 UTC, Antoine "hashar" Musso (WMF)	Details
Gearman queue on July 15th 2014 (41.43 KB, image/png) 2014-07-21 14:00 UTC, Antoine "hashar" Musso (WMF)	Details
Add an attachment (proposed patch, testcase, etc.)

Description Greg Grossmeier 2014-07-16 17:12:15 UTC

Exact limit we can bikeshed/adjust as needed, but, alerting on the fact that merges are taking "too long" is needed. We currently just use the "someone needs to complain on IRC" method, which isn't good customer service.

Icinga is probably the right place for the alert.

Comment 1 Antoine "hashar" Musso (WMF) 2014-07-21 13:56:51 UTC

Created attachment 15993 [details]
Gearman queue metrics

The Zuul/Jenkins interaction is handled over German.  The server is embedded in Zuul and reports metrics to Graphite over statsd.

The attached graph shows over a month:

- blue: overall size of Gearman functions pending/running
- green: functions running (i.e. a Jenkins job is proceeding)
- red: functions waiting


The graph URL is:
http://graphite.wikimedia.org/render/?width=586&height=308&_salt=1405950653.273&from=-1month&target=zuul.geard.queue.total.value&target=zuul.geard.queue.running.value&target=zuul.geard.queue.waiting.value

Comment 2 Antoine "hashar" Musso (WMF) 2014-07-21 14:00:05 UTC

Created attachment 15994 [details]
Gearman queue on July 15th 2014

Attached is the states of Gearman queues on July 15th

- blue: overall size of Gearman functions pending/running
- green: functions running (i.e. a Jenkins job is proceeding)
- red: functions waiting


The huge spike on July 15th seems to have been some kind of incident which is what probably caused this bug to be filled.

We could use Icinga check_graphite plugin or some confident band to detect there are too many jobs in the queue and alert whenever it is the case.


I am not sure what went wrong there. Most probably Jenkins jobs were no more properly registered / runnable in Gearman server which caused Gearman to enqueue then and wait for a worker (i.e. Jenkins) to execute the job.

I guess we could alarm whenever zuul.geard.queue.waiting.value is above 5 for more than 15 minutes.

Comment 3 Antoine "hashar" Musso (WMF) 2014-07-21 14:01:52 UTC

Note: a similar graph is available on our Zuul homepage at https://integration.wikimedia.org/zuul/ . That is the graph titled "Zuul Geard job queue (8 hours)".

It definitely showed the exact same graphs I have attached previously: a huge spike of waiting jobs.

Wikimedia Bugzilla is closed!

Search

Personal tools

Navigation

Links