Last modified: 2013-03-12 21:24:38 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T48026, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 46026 - icinga: cant monitor some instances
icinga: cant monitor some instances
Status: RESOLVED FIXED
Product: Wikimedia Labs
Classification: Unclassified
Other (Other open bugs)
unspecified
All All
: Normal normal
: ---
Assigned To: Daniel Zahn
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2013-03-12 13:55 UTC by Antoine "hashar" Musso (WMF)
Modified: 2013-03-12 21:24 UTC (History)
5 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Antoine "hashar" Musso (WMF) 2013-03-12 13:55:08 UTC
Some instances of the deployment-prep projects are not monitored by Icinga:

http://icinga.wmflabs.org/cgi-bin/icinga/status.cgi?hostgroup=deployment-prep&style=detail

The NRPE daemon is listening on port 5666.

The project has a default security rule to allow 5666 from 10.4.0.0/21.
Comment 1 Damian Z 2013-03-12 19:52:50 UTC
It seems nrpe does not restart as expected - basically the process doesn't quit so it never really restarts.

Since the IP of monitoring changed the config has updated, but the service is running with the old IP.

To resolve run `killall nrpe; /etc/init.d/nagios-nrpe-server start` on the instances; I'm trying to get Ryan to run this labs-wide via salt to clean up the currently alerting ones.
Comment 2 Daniel Zahn 2013-03-12 19:56:17 UTC
yea, sounds like a problem we had in production before. nagios-nrpe-server would have issues restarting correctly. Looked to me though as this was resolved after the switch to Icinga (we cleaned up, incl. getting rid of an old init script for nrpe server). In the past we attempted to fix that by adding a sleep command to the init script.
Comment 3 Daniel Zahn 2013-03-12 20:03:57 UTC
root@virt0:~# salt '*' cmd.run "killall nrpe; /etc/init.d/nagios-nrpe-server start"

killed and restarted on all instances
Comment 4 Antoine "hashar" Musso (WMF) 2013-03-12 21:24:38 UTC
Works for me now :-] Thanks Daniel!

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links