Last modified: 2014-08-03 20:59:10 UTC
We should add some monitoring for a stuck mail queue. Googling revealed various solutions for Nagios/Icinga. We should set the thresholds (how many mails of what age may be in the queue) fairly low. While I believe on Toolserver ACC sent mail to addresses specified by users and was thus prone to typos ("hotmaiil.com", etc.), most mail on Tools should probably be successfully delivered within minutes.
Some lessons learned from cleaning the stuck queues: - We need to check the queues on *every* host, not just tools-mail. We could use different thresholds for tools-mail and the rest as the latter only needs to talk to tools-mail, but I don't think that's necessary. - Sometimes there are leftovers in /var/spool/exim4/{input,msglog} that are results of exim hiccups (OOM?) where only -D or only -H files exist. Correlating them with the queue is hard; easier: Check for any files there that are older than the Icinga threshold + x days. This will not detect hiccups instantly, but not *so* late.
Also, /var/log/exim4/panic should either be empty or not exist.
Change 143111 had a related patch set uploaded by Yuvipanda: toollabs: Send exim queue length to graphite https://gerrit.wikimedia.org/r/143111
Change 143111 merged by coren: toollabs: Send exim queue length to graphite https://gerrit.wikimedia.org/r/143111