Last modified: 2014-08-03 20:59:10 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T60871, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 58871 - Set up Icinga monitoring for mail queue
Set up Icinga monitoring for mail queue
Status: NEW
Product: Wikimedia Labs
Classification: Unclassified
tools (Other open bugs)
unspecified
All All
: Unprioritized enhancement
: ---
Assigned To: Marc A. Pelletier
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2013-12-22 20:15 UTC by Tim Landscheidt
Modified: 2014-08-03 20:59 UTC (History)
2 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Tim Landscheidt 2013-12-22 20:15:41 UTC
We should add some monitoring for a stuck mail queue.  Googling revealed various solutions for Nagios/Icinga.  We should set the thresholds (how many mails of what age may be in the queue) fairly low.  While I believe on Toolserver ACC sent mail to addresses specified by users and was thus prone to typos ("hotmaiil.com", etc.), most mail on Tools should probably be successfully delivered within minutes.
Comment 1 Tim Landscheidt 2014-01-31 04:05:16 UTC
Some lessons learned from cleaning the stuck queues:

- We need to check the queues on *every* host, not just tools-mail.  We could use different thresholds for tools-mail and the rest as the latter only needs to talk to tools-mail, but I don't think that's necessary.
- Sometimes there are leftovers in /var/spool/exim4/{input,msglog} that are results of exim hiccups (OOM?) where only -D or only -H files exist.  Correlating them with the queue is hard; easier: Check for any files there that are older than the Icinga threshold + x days.  This will not detect hiccups instantly, but not *so* late.
Comment 2 Tim Landscheidt 2014-02-05 22:16:06 UTC
Also, /var/log/exim4/panic should either be empty or not exist.
Comment 3 Gerrit Notification Bot 2014-06-30 20:39:59 UTC
Change 143111 had a related patch set uploaded by Yuvipanda:
toollabs: Send exim queue length to graphite

https://gerrit.wikimedia.org/r/143111
Comment 4 Gerrit Notification Bot 2014-06-30 21:26:47 UTC
Change 143111 merged by coren:
toollabs: Send exim queue length to graphite

https://gerrit.wikimedia.org/r/143111

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links