Last modified: 2014-06-30 13:09:25 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T67861, the corresponding Phabricator task for complete and up-to-date bug report information.

Bug 65861 - Mail notifications from fab.wmflabs.org delivered only days later (or not at all?)


Summary:	Mail notifications from fab.wmflabs.org delivered only days later (or not at ...

Status:	RESOLVED FIXED

Product:	Wikimedia Labs
Classification:	Unclassified
Component:	General (Other open bugs)
Version:	unspecified
Hardware:	All All

Importance:	High major
Target Milestone:	---
Assigned To:	Nobody - You can work on this!

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:
	Show dependency tree / graph

Reported:	2014-05-28 16:48 UTC by Andre Klapper
Modified:	2014-06-30 13:09 UTC (History)
CC List:	11 users (show)

See Also:
Web browser:	---
Mobile Platform:	---
Assignee Huggle Beta Tester:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Description Andre Klapper 2014-05-28 16:48:50 UTC

Copying from http://fab.wmflabs.org/T348 ; confirmed by other users

Notification mail from http://fab.wmflabs.org/ has lag (and I think I also miss notifications, so they get lost). Two examples:

My last bugmail notification from http://fab.wmflabs.org/T335 I received was from Fri, 16 May 2014 20:24:51 +0000:

"aklapper added a subscriber: aklapper."

so I am explicitly subscribed to it. But I only realized that there are new comments on T335 via the feed icon on http://fab.wmflabs.org itself. I should have received notifications. My spam folder is empty and searching for "335" does not show any newer messages. I've checked both in my desktop email client and via the browser on mail.google.com.

Furthermore, for my visibility change in http://fab.wmflabs.org/T12 I received a notification message on Thu, 22 May 2014 23:31:27 UTC when I slept already for a few hours:

Delivered-To: aklapper@wikimedia.org
Received: by 10.140.47.144 with SMTP id m16csp235382qga; Thu, 22 May 2014
16:31:29 -0700 (PDT)
X-Received: by 10.224.134.194 with SMTP id k2mr1301090qat.5.1400801489448;
Thu, 22 May 2014 16:31:29 -0700 (PDT)
Return-Path: <root@wikimedia.org>
Received: from mchenry.wikimedia.org (mchenry.wikimedia.org.
[208.80.152.186]) by mx.google.com with ESMTP id
f1si1648140qai.52.2014.05.22.16.31.29 for <aklapper@wikimedia.org>; Thu, 22
May 2014 16:31:29 -0700 (PDT)
Received-SPF: pass (google.com: domain of root@wikimedia.org designates
208.80.152.186 as permitted sender)
Authentication-Results: mx.google.com; spf=pass (google.com: domain of
root@wikimedia.org designates 208.80.152.186 as permitted sender)
smtp.mail=root@wikimedia.org
Received: from [10.68.16.110] (port=33970 helo=fab2.eqiad.wmflabs) by
mchenry.wikimedia.org with esmtp (Exim 4.69) (envelope-from
<root@wikimedia.org>) id 1WncSN-00018T-UL for aklapper@wikimedia.org; Thu,
22 May 2014 23:31:29 +0000
Received: from jenkins-bot by fab2.eqiad.wmflabs with local (Exim 4.76)
(envelope-from <root@wikimedia.org>) id 1WncSN-0000Qp-N6 for
aklapper@wikimedia.org; Thu, 22 May 2014 23:31:27 +0000
Date: Thu, 22 May 2014 23:31:27 +0000

However I made this change on Tue, May 20, 16:55 UTC according to the date stamp in http://fab.wmflabs.org/T12 .
So there seems to be a delivery delay of more than two days.

My attention workflow is mostly email based so this is really really distracting.


<Coren> andre: If I had to venture a guess, that's a problem with the mail setup on the phabricator instances and not related to mchenry -- but it might be a problem in the *default* mail handling setup on instances.
My exim4 skillz are limited - I'm more of a postfix dude - but I could take a look at it if you open a bz for me (sorry, the labs workflow is still on the old tools) :-)
andre: I should say the first thing to look at would be the mail logs on the instance the email originates from; in particular, you want to look for emails being deffered.

Comment 1 Andre Klapper 2014-05-28 16:53:00 UTC

and docs, for the records: https://secure.phabricator.com/book/phabricator/article/configuring_outbound_email/

Comment 2 Andre Klapper 2014-05-28 17:35:35 UTC

Another example from a few minutes ago: 
http://fab.wmflabs.org/T148#26 says "Via Web · Fri, May 23, 6:25 AM"; email received more than five days later: Wed, 28 May 2014 17:31:39 UTC.

Comment 3 Marc A. Pelletier 2014-05-28 17:51:08 UTC

One useful bit of data that would be important to have:  what instance is the email originally generated on?  (I.e., is it the public facing one, a backend, etc)

Comment 4 Andre Klapper 2014-05-29 13:55:25 UTC

Coren: Not sure if I understand the question correctly :-/

> Received: from jenkins-bot by fab2.eqiad.wmflabs with local (Exim 4.76)

Or does this refer to how Phabricator internally handles mail?

Comment 5 Marc A. Pelletier 2014-05-29 14:07:24 UTC

Ideally, I need to be able to follow the mail flow from where it first enters the system - unless there is a relay /within/ the phabricator project, then fab2 is probably it.  I'll start my investigation there.

Comment 6 Marc A. Pelletier 2014-05-29 14:40:16 UTC

The issue, for that instance at least, is that the MTA is running out of space for spooling mail (forcing them to be deferred) because /var is full.

In fab2's case, there are two proximate causes:

(a) While /var/log is properly on a separate volume, there exist two other log directories on the volume that take up several hundreds of megs of space (/var/log2 and /var/log3).  At first glance, neither of those have been in actual use since May 13.

(b) while a separate volume has been set aside for the mysql database, phabricator itself seems to also use /var/tmp for large, persistent storage.  Pointing it at (a subdirectory of) /mnt * would offer more resillience.

* I note that the extra disk space allocated to the volume has been mounted at /mnt which is deprecated practice (in fact, it was never *correct* to do so but that is indeed how Labs used to do it while in Tempa).  There is no necessity to change that now, but if the instance is rebuilt in the future it would be best if the "correct" mount point of /srv is used instead.

Comment 7 Daniel Zahn 2014-05-29 14:45:10 UTC

the instance is a completely manual setup and as has been pointed out before it really needs to be setup for real (via puppet, incl. log rotate etc., no manual mounts, enough space etc)  (before it's actually being used for tickets)

Comment 8 Marc A. Pelletier 2014-05-29 14:47:05 UTC

At any rate, a simple stopgap would be to delete or move aside /var/log2 and /var/log3.  This will free ~700M of space which will unclog email - at least as long as phabricator does not consume all of it).

Comment 9 Daniel Zahn 2014-05-29 14:49:35 UTC

afaik Chad has set it up but i doubt he will care about keeping those logs, so the suggestion sounds good

Comment 10 Marc A. Pelletier 2014-05-29 14:59:31 UTC

I've made tarballs of both directories to /mnt/log[23].tgz, freeing >600M of space.

Comment 11 Andre Klapper 2014-05-29 19:20:16 UTC

Thanks so much, Coren (and Daniel)!

Comment 12 Andre Klapper 2014-06-09 14:04:36 UTC

There seems to be a problem again for me. 
I have received no Phabricator notifications since Thursday (though I have actively commented). http://fab.wmflabs.org/T294#38 was the last notification I received.
Could somebody take a look? :-/

Comment 13 Marc A. Pelletier 2014-06-09 14:49:00 UTC

/var was full again, because /var/tmp/phd/log had a 1.4G logfile(!)

I've truncated it to the last 50K lines, but this is trying to bail the sea with a spoon; at the very least that logfile needs to be configured to go into /var/log (which has more space) until the box is properly puppetized.

Also, given the speed of growth of logs, the proper setup will need to have aggressive log rotation.

Comment 14 Andre Klapper 2014-06-09 15:00:05 UTC

Sigh. Thanks! Confirming I received mail now.

> at the very least /var/tmp/phd/log needs to be configured to go into /var/log
> (which has more space) until the box is properly puppetized.

Alright.

Comment 15 Matthew Flaschen 2014-06-09 20:19:29 UTC

(In reply to Marc A. Pelletier from comment #13)
> at the very least that logfile needs to be configured to go
> into /var/log (which has more space) until the box is properly puppetized.

Is that being tracked anywhere?

Comment 16 Matthew Flaschen 2014-06-09 20:20:12 UTC

The /var/log workaround, I mean.

Comment 17 Andre Klapper 2014-06-25 13:23:37 UTC

Problem happens again, as expected. I've asked on #labs.

CC'ing Chase because this sounds relevant for the final setup:

(In reply to Marc A. Pelletier from comment #13)
> /var was full again, because /var/tmp/phd/log had a 1.4G logfile(!)
> 
> I've truncated it to the last 50K lines, but this is trying to bail the sea
> with a spoon; at the very least that logfile needs to be configured to go
> into /var/log (which has more space) until the box is properly puppetized.
> 
> Also, given the speed of growth of logs, the proper setup will need to have
> aggressive log rotation.

Comment 18 Marc A. Pelletier 2014-06-30 13:09:25 UTC

Same logfile.  I've nuked it.

Wikimedia Bugzilla is closed!

Search

Personal tools

Navigation

Links