Last modified: 2014-09-04 20:54:11 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T67291, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 65291 - monitor dispatch stats
monitor dispatch stats
Status: RESOLVED FIXED
Product: Wikimedia
Classification: Unclassified
Wikidata (Other open bugs)
unspecified
All All
: High normal (vote)
: ---
Assigned To: Daniel Zahn
u=dev c=infrastructure p=13 s=2014-06-17
: ops
Depends on:
Blocks: 66070
  Show dependency treegraph
 
Reported: 2014-05-14 14:57 UTC by Lydia Pintscher
Modified: 2014-09-04 20:54 UTC (History)
3 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Lydia Pintscher 2014-05-14 14:57:15 UTC
We need a job that monitors dispatch stats on Wikidata and notifies us when the lags are too high.
Comment 1 Sam Reed (reedy) 2014-05-14 15:33:29 UTC
This should be doable with something similar to the job queue monitor in ganglia that reports to IRC
Comment 2 Christopher Johnson 2014-05-25 20:56:00 UTC
This is a Perl script for Nagios that can retrieve dispatch values from the API and output a short message. 

https://github.com/ChristopherHJohnson/check_dispatch

This should be reviewed on Gerrit somewhere and tested with Nagios.  Nagios should be able to report alerts to IRC.  Threshold for critical average lag should be established on production.
Comment 3 Lydia Pintscher 2014-06-03 13:47:58 UTC
We need to add a warning threshold at a median lag of 2 minutes.
Comment 4 Christopher Johnson 2014-06-09 09:27:00 UTC
https://gerrit.wikimedia.org/r/#/c/136095/10
Comment 7 Daniel Zahn 2014-09-03 00:53:23 UTC
added an Icinga contact for aude to the private puppet repo.

icinga contact name "aude" can be used in contactgroups now, which are in the public puppet repo
Comment 8 Daniel Zahn 2014-09-03 01:15:06 UTC
i tried a couple ways to escape this already to get around the error

"command .. does not exit".. but didn't work yet.

unfortunately see stuff like http://support.nagios.com/forum/viewtopic.php?t=10596&p=54166
Comment 9 Jan Zerebecki 2014-09-03 13:16:16 UTC
I don't think it is related to that forum post, as it is not escaping in the regexp that needs to be done as it worked when run on the command line.
Meanwhile this didn't help either: https://gerrit.wikimedia.org/r/158081
Will try to reproduce the problem on labs.
Comment 10 Daniel Zahn 2014-09-03 15:37:12 UTC
it is definitely escaping, i tried manually to change the arguments to something without special characters, that made it work. and that forum post discusses problems with escaping. i wonder how to reproduce in labs without a labs icinga instance or even a class that could be applied to an instance :(
Comment 11 Jan Zerebecki 2014-09-03 16:47:10 UTC
Sorry you are right that post is actually on escaping the regexp from nagios/icinga config file syntax.
I'm trying unsuccessfully to apply icinga::monitor to a puppetmaster-self.
Meanwhile another try to fix the problem: https://gerrit.wikimedia.org/r/#/c/158119/
Comment 12 Jan Zerebecki 2014-09-03 19:49:42 UTC
That try didn't work either. Will try further in labs.
Comment 13 Daniel Zahn 2014-09-04 03:31:09 UTC
20:29 <+icinga-wm> RECOVERY - check if wikidata.org dispatch lag is higher than 2 minutes on wikidata is OK: HTTP OK: HTTP/1.1 200 OK - 1243 bytes in 0.699 second 
                   response time  


work around in 
https://gerrit.wikimedia.org/r/#/c/158319/
Comment 14 Daniel Zahn 2014-09-04 03:32:14 UTC
yea, uhm, i worked around this annoying issue as shown above.. that fixed it for now. we can turn this into aa template and pass parameters if we care...
Comment 15 Jan Zerebecki 2014-09-04 12:15:45 UTC
Thx.
Patch for mail notifications: https://gerrit.wikimedia.org/r/#/c/158362/
Comment 16 Daniel Zahn 2014-09-04 19:13:22 UTC
it was in "draft" status. i created the needed contacts in private puppet repo, then just published the draft and merged it. checked on neon. contacts have been created..

contactgroups.cfg:    contactgroup_name   wikidata
contactgroups.cfg:    members             wikidata-monitoring,aude,jzerebecki
contacts.cfg:        contact_name                    wikidata-monitoring
contacts.cfg:        email                           wikidata-monitoring..


etc..
Comment 17 Gerrit Notification Bot 2014-09-04 20:11:06 UTC
Change 158492 had a related patch set uploaded by Dzahn:
icinga-wm - configure to also serve #wikidata

https://gerrit.wikimedia.org/r/158492
Comment 18 Gerrit Notification Bot 2014-09-04 20:15:58 UTC
Change 158495 had a related patch set uploaded by Dzahn:
add irc-wikidata contact to wikidata services

https://gerrit.wikimedia.org/r/158495
Comment 19 Gerrit Notification Bot 2014-09-04 20:35:46 UTC
Change 158492 merged by Dzahn:
icinga-wm - configure to also serve #wikidata

https://gerrit.wikimedia.org/r/158492
Comment 20 Gerrit Notification Bot 2014-09-04 20:45:00 UTC
Change 158495 merged by Dzahn:
add irc-wikidata contact to wikidata services

https://gerrit.wikimedia.org/r/158495
Comment 21 Daniel Zahn 2014-09-04 20:53:10 UTC
via the last couple changes you now have an IRC bot (icinga-wm) in #wikidata

and it will output only stuff for the services it is a contact for .. :)
Comment 22 Daniel Zahn 2014-09-04 20:54:11 UTC
13:44 -!- icinga-wm [~icinga-wm@neon.wikimedia.org] has joined #wikidata

13:50 < icinga-wm> CUSTOM - check if wikidata.org dispatch lag is higher than 2 minutes on wikidata is OK: HTTP OK: HTTP/1.1 200 OK - 1248 bytes in 0.907 second 
                   response time  
13:50 < mutante> weee, it works

---

root@neon:/var/log/icinga# cat irc-wikidata.log 
CUSTOM - check if wikidata.org dispatch lag is higher than 2 minutes on wikidata is OK: HTTP OK: HTTP/1.1 200 OK - 1248 bytes in 0.907 second response time

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links