Last modified: 2014-10-07 15:45:52 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T72141, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 70141 - Determine first pass list of icinga-alerting data from graphite.wmflabs
Determine first pass list of icinga-alerting data from graphite.wmflabs
Status: RESOLVED FIXED
Product: Wikimedia Labs
Classification: Unclassified
deployment-prep (beta) (Other open bugs)
unspecified
All All
: Normal normal
: ---
Assigned To: Yuvi Panda
:
Depends on:
Blocks: 51497
  Show dependency treegraph
 
Reported: 2014-08-28 22:05 UTC by Greg Grossmeier
Modified: 2014-10-07 15:45 UTC (History)
9 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Greg Grossmeier 2014-08-28 22:05:12 UTC
Let's get some icinga alerts so we know when things are going sideways in Beta Cluster.
Comment 1 Yuvi Panda 2014-08-28 22:06:53 UTC
- No puppet run for more than 1h
- Presence of any puppet failures

What else?
Comment 2 Greg Grossmeier 2014-08-28 22:08:47 UTC
My first pass list (puppet fails on important vms):
* deployment-prep.deployment-bastion.puppetagent.failed_events.value > 0
* deployment-prep.deployment-mediawiki01.puppetagent.failed_events.value > 0
* deployment-prep.deployment-mediawiki02.puppetagent.failed_events.value > 0
Comment 4 Yuvi Panda 2014-08-28 22:29:41 UTC
I just realized that you can't hit Labs URLs from prod, and so we can't actually do this right now because of that :(

Two options:
1. File an RT ticket to allow access to graphite.wmflabs.org from labmon1001.
2. Wait for labmon1001 to be setup.

Unsure if Ops would be ok with (1), and (2) is blocked on the network config.
Comment 5 Greg Grossmeier 2014-08-28 22:34:20 UTC
* deployment-prep.deployment-mediawiki01.diskspace.root.byte_free.value < 2 gigs
* deployment-prep.deployment-mediawiki02.diskspace.root.byte_free.value < 2 gigs
* deployment-prep.deployment-mediawiki01.diskspace._var.byte_free.value < 1 gig
* deployment-prep.deployment-mediawiki02.diskspace._var.byte_free.value < 1 gig
Comment 6 Greg Grossmeier 2014-08-28 22:36:12 UTC
(In reply to Yuvi Panda from comment #4)
> Two options:
> 1. File an RT ticket to allow access to graphite.wmflabs.org from labmon1001.
> 2. Wait for labmon1001 to be setup.
> 
> Unsure if Ops would be ok with (1), and (2) is blocked on the network config.

(2) https://rt.wikimedia.org/Ticket/Display.html?id=8163
Comment 7 Greg Grossmeier 2014-08-29 16:35:51 UTC
(In reply to Greg Grossmeier from comment #6)
> (In reply to Yuvi Panda from comment #4)
> > Two options:
> > 1. File an RT ticket to allow access to graphite.wmflabs.org from labmon1001.
> > 2. Wait for labmon1001 to be setup.
> > 
> > Unsure if Ops would be ok with (1), and (2) is blocked on the network config.
> 
> (2) https://rt.wikimedia.org/Ticket/Display.html?id=8163

That RT is now done (thanks mark!). So now just waiting on labsmon1001 to be setup, I presume.
Comment 8 Greg Grossmeier 2014-08-29 16:40:14 UTC
12:38 < YuviPanda> greg-g: labmon is setup - labmon.wmflabs.org :) Am sending metrics on to it now
12:38 < YuviPanda> I'll rename it to graphite.wmflabs.org soon
Comment 9 Yuvi Panda 2014-09-11 10:07:46 UTC
There now exists monitoring for puppet failures and disk space (https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=labmon). Puppet failures need to be tweaked further since they currently do not bail when puppet fails with a syntax error or something like that.
Comment 10 Yuvi Panda 2014-09-11 10:13:34 UTC
Note that the alert are for all the machines, in betalabs, not just for the ones listed. I added more features to our check_graphite script to make this kind of monitoring easy / possible.
Comment 11 Gerrit Notification Bot 2014-09-11 10:25:12 UTC
Change 159694 had a related patch set uploaded by Yuvipanda:
labmon: Add low space check for / on betalabs

https://gerrit.wikimedia.org/r/159694
Comment 12 Gerrit Notification Bot 2014-09-11 11:09:58 UTC
Change 159701 had a related patch set uploaded by Yuvipanda:
labmon: Add puppet freshness check for betalabs

https://gerrit.wikimedia.org/r/159701
Comment 13 Yuvi Panda 2014-09-11 11:23:33 UTC
Also, who is responsible for fixing the errors that pop up? There are puppet failures on videoscaler-01 now, and I've no idea how to fix those.
Comment 14 Yuvi Panda 2014-09-11 11:31:39 UTC
(On that note, I'd also remove myself from the alert groups once the initial setting up is stabilized)
Comment 15 Gerrit Notification Bot 2014-09-11 13:40:14 UTC
Change 159694 merged by Andrew Bogott:
labmon: Add low space check for / on betalabs

https://gerrit.wikimedia.org/r/159694
Comment 16 Gerrit Notification Bot 2014-09-11 13:43:44 UTC
Change 159701 merged by Andrew Bogott:
labmon: Add puppet freshness check for betalabs

https://gerrit.wikimedia.org/r/159701
Comment 17 Greg Grossmeier 2014-09-23 20:21:58 UTC
Yuvi: Thanks for the first pass work! Once you remove yourself from the list of people who get the alerts, feel free to close this bug (the "first pass" of this is done).
Comment 18 Greg Grossmeier 2014-10-07 15:45:52 UTC
(In reply to Greg Grossmeier from comment #17)
> Yuvi: Thanks for the first pass work! Once you remove yourself from the list
> of people who get the alerts, feel free to close this bug (the "first pass"
> of this is done).

Done waiting, closing for housekeeping reasons :)

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links