Last modified: 2014-09-23 22:56:48 UTC
Automated monitoring + alerts for tool users would be awesome, and will probably increase reliability, etc of toollabs a fair bit. No need to play 'is this tool up or not?!' guessing game.
This should be a separate setup than what we have for production and also for critical infrastructure on toollabs (such as the mysql or apache deamons).
I am willing to be told I'm wrong - but I think this is a pretty important step in improving our own reliability, and in providing high-reassurance support to our users.
(In reply to comment #2) > I am willing to be told I'm wrong - but I think this is a pretty important > step > in improving our own reliability, and in providing high-reassurance support > to > our users. IIRC the scope of this bug is Icinga for users' tools; for Tools's reliability in general we have icinga.wmflabs.org with (currently) various shortcomings (if it is running at all) that should be addressed in a different bug. For the latter, I remember hashar being interested in using it more for beta as well.
Is this something I can work on?
Not just yet; we're currently at the stage where we are setting equipment aside for the task and doing our first round of specifications. I expect we'll spend some time at the Hackaton in London working on this; if you're around then you'd be welcome to join us. Otherwise, as we return, we'll probably have something worth hacking on.
Handing off to Yuvi, who is the gatekeeper of labmon1001
Good luck, Yuvi!