Last modified: 2014-08-03 20:59:10 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T60871, the corresponding Phabricator task for complete and up-to-date bug report information.

Bug 58871 - Set up Icinga monitoring for mail queue


Summary:	Set up Icinga monitoring for mail queue

Status:	NEW

Product:	Wikimedia Labs
Classification:	Unclassified
Component:	tools (Other open bugs)
Version:	unspecified
Hardware:	All All

Importance:	Unprioritized enhancement
Target Milestone:	---
Assigned To:	Marc A. Pelletier

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:
	Show dependency tree / graph

Reported:	2013-12-22 20:15 UTC by Tim Landscheidt
Modified:	2014-08-03 20:59 UTC (History)
CC List:	2 users (show)

See Also:
Web browser:	---
Mobile Platform:	---
Assignee Huggle Beta Tester:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Description Tim Landscheidt 2013-12-22 20:15:41 UTC

We should add some monitoring for a stuck mail queue.  Googling revealed various solutions for Nagios/Icinga.  We should set the thresholds (how many mails of what age may be in the queue) fairly low.  While I believe on Toolserver ACC sent mail to addresses specified by users and was thus prone to typos ("hotmaiil.com", etc.), most mail on Tools should probably be successfully delivered within minutes.

Comment 1 Tim Landscheidt 2014-01-31 04:05:16 UTC

Some lessons learned from cleaning the stuck queues:

- We need to check the queues on *every* host, not just tools-mail.  We could use different thresholds for tools-mail and the rest as the latter only needs to talk to tools-mail, but I don't think that's necessary.
- Sometimes there are leftovers in /var/spool/exim4/{input,msglog} that are results of exim hiccups (OOM?) where only -D or only -H files exist.  Correlating them with the queue is hard; easier: Check for any files there that are older than the Icinga threshold + x days.  This will not detect hiccups instantly, but not *so* late.

Comment 2 Tim Landscheidt 2014-02-05 22:16:06 UTC

Also, /var/log/exim4/panic should either be empty or not exist.

Comment 3 Gerrit Notification Bot 2014-06-30 20:39:59 UTC

Change 143111 had a related patch set uploaded by Yuvipanda:
toollabs: Send exim queue length to graphite

https://gerrit.wikimedia.org/r/143111

Comment 4 Gerrit Notification Bot 2014-06-30 21:26:47 UTC

Change 143111 merged by coren:
toollabs: Send exim queue length to graphite

https://gerrit.wikimedia.org/r/143111

Wikimedia Bugzilla is closed!

Search

Personal tools

Navigation

Links