Last modified: 2014-04-19 08:41:42 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T58525, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 56525 - Improve spam filtering for Mailman mailing lists
Improve spam filtering for Mailman mailing lists
Status: NEW
Product: Wikimedia
Classification: Unclassified
Mailing lists (Other open bugs)
wmf-deployment
All All
: Normal normal (vote)
: ---
Assigned To: Nobody - You can work on this!
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2013-11-02 23:53 UTC by Risker
Modified: 2014-04-19 08:41 UTC (History)
15 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Risker 2013-11-02 23:53:25 UTC
The volume of spam getting through to the mailing list moderators on multiple Mailman-based mailing lists is increasing very significantly. It is common to see the same email address spam multiple lists one after the other.  I moderate half a dozen lists, and in total am seeing about 150-200 spam emails being sent to the moderation queue on a daily basis, even after auto-discarding emails from the hundreds of addresses already on "auto-discard".  

The volume started increasing around August and is steadily rising.
Comment 1 Andre Klapper 2013-11-04 10:10:44 UTC
Does that mean that Mailman's /privacy/spam section is not sufficient? 
Wondering if http://jamesh.id.au/articles/mailman-spamassassin/ would be overkill.

I'm not sure what is expected by this report. :-/
Comment 2 Thehelpfulone 2013-11-06 02:15:45 UTC
We do have some sort of configuration of SpamAssassin on sodium, but I'm not sure how up to date the details at https://wikitech.wikimedia.org/wiki/Mailing_lists#SpamAssassin are. Currently I believe SA adds X-Spam-Score headers to the emails, which allow you to configure spam filters through the list admin interface (at https://lists.wikimedia.org/mailman/admin/list-name/privacy/spam).

Risker do you think we should add some of this spam filtering as default across all mailing lists? Please could you check some of the messages that you go through this week to see if they have an X-Spam-Score headers (you can check this when you are going through the moderation queue by clicking on the link to read full details about each email), and if they do note a rough idea of the score that they're given?
Comment 3 Daniel Zahn 2013-11-06 02:24:50 UTC
we do run spamassassin and it scores the mails for you already. You can see the score in the headers as X-Spam-Score. what is missing is activating the filtering on it via the mailman ui, which is a "per list" thing a list admin can do. basically you put a regex in to filter by spam score. RobH just recently did that for the ops list f.e.
Comment 4 Daniel Zahn 2013-11-06 02:26:15 UTC
the main issue is finding the right score threshold to filter. this might vary per list. also, you can choose between just "hold" or "discard" as an action. when experimenting with the spam filter i suggest you select just "hold" first, which will prevent the messages from being delivered but you can still check them as a list admin/mod to make sure they are not false positives
Comment 5 Thehelpfulone 2013-11-06 02:28:55 UTC
I imagine most lists have mail from non-members to be set to hold Daniel - so the "discard" option would probably be preferred, but it's getting a level that's high enough to minimise the false positives. From the lists that I admin, this could be around 3+ - but I'd like to see what other scores other admins are getting.
Comment 6 Daniel Zahn 2013-11-06 02:32:33 UTC
try:   Privacy options... -> Spam filters -> Spam Filter Regexp -> put in the value "x-spam-status: yes" -> Select your "action"

this would be default settings. if that doesn't prove effective there should be other ways to make more specific regexes, see;


https://lists.wikimedia.org/mailman/admin/<LISTNAME>/?VARHELP=privacy/spam/header_filter_rules
Comment 7 Thehelpfulone 2013-11-06 02:35:08 UTC
So far I've been using X-Spam-Score: \d{1,2}\.\d \(\+{3,}\) as the Regexp - does yours affect all messages that have any X-Spam-Status?
Comment 8 Daniel Zahn 2013-11-06 02:37:48 UTC
Thehelpfulone, ya, that example using X-Spam-Score is is what i used before and i meant with "other ways" basically.The way i described above is another option that RobH activated today, to try out the defaults and see how good it works as opposed to specific values we'd have to pick.
Comment 9 Risker 2013-11-06 03:53:40 UTC
(In reply to comment #6)
> try:   Privacy options... -> Spam filters -> Spam Filter Regexp -> put in the
> value "x-spam-status: yes" -> Select your "action"
> 
> this would be default settings. if that doesn't prove effective there should
> be
> other ways to make more specific regexes, see;
> 
> 
> https://lists.wikimedia.org/mailman/admin/<LISTNAME>/?VARHELP=privacy/spam/
> header_filter_rules

Thank you, Daniel, this might be the ticket.  If I put in "x-spam-status: yes" does it send *all* messages with a spam value to wherever I send it? (Likely it would be "discard".)

Just to give you a notion of the extent of the spam, there are 69 spam messages to functionaries-en-L in less than 24 hours; 26 to arbcom-en-appeals; 22 to arbcom-L. I've got a couple more on my list, but they're not being spammed nearly as badly, probably because they've had a very low number of people who've either been subscribers or have had posts accepted over the last 3-5 years.  

----
Example from one email that was definitely spam:

X-Spam-Score: 3.1 (+++)
X-Spam-Report: Spam detection software, running on the system "mchenry.wikimedia.org", has
	identified this incoming email as possible spam. If you have any
	questions, see the administrator of that system for details.
	Content analysis details:   (3.1 points, 4.0 required)

I note that other emails sent to the same mailing list replace "mchenry.wikimedia.org" with "sodium.wikimedia.org" - not sure if that is relevant.  

On scanning several emails to functionaries-en-L, most of them are well above the 4.0 spam score. 

I'm concerned about tweaking the "admin_immed_notify" settings so that list admins and moderators get fewer emails (right now we get one for every email sent to moderation, every one auto-rejected, and everyone auto-discarded). It seems the alternatives are lots of emails, which wouldn't change even if we up the spam filters (they'd be auto-rejected) or no emails, which makes it less likely that we can recover legitimate emails that sometimes get missed in the haystack of spam.
Comment 10 Risker 2013-11-06 03:57:36 UTC
I frequently see the same email address spamming a lot of the lists.wikimedia.org mailing lists in a serial fashion (seeing the same email come up on 3-5 of the lists that I moderate, one after the other).  Is that something that can be filtered before the spam goes all the way through the system?
Comment 11 Kunal Mehta (Legoktm) 2013-11-07 16:37:59 UTC
(In reply to comment #10)
> I frequently see the same email address spamming a lot of the
> lists.wikimedia.org mailing lists in a serial fashion (seeing the same email
> come up on 3-5 of the lists that I moderate, one after the other).  Is that
> something that can be filtered before the spam goes all the way through the
> system?

I did a little bit of searching and found https://bugzilla.mozilla.org/show_bug.cgi?id=681460 which is duped to a WONTFIXED bug.

If the X-Spam-Score filter method works, maybe we just need sane defaults?
Comment 12 Tilman Bayer 2014-02-04 21:50:23 UTC
I looked into activating it for the Wikimediaannounce-l list a while ago, but unfortunately there are a lot of false positives, see the example below where SpamAssassin thought that I am trying to commit money fraud ;)

(Note that Mailman includes this report in the mail header, i.e. any list subscriber can look through past messages and help find such false positives.)

----

From: Tilman Bayer <tbayer@wikimedia.org>
Date: Thu, 14 Nov 2013 22:49:56 -0800
Message-ID: <CAPDdKA5q+Nr2J6XmEvXUR6Fg=HKZv=NKg_QpLG_FMru-k-CYFg@mail.gmail.com>
Subject: Wikimedia Foundation Report, October 2013
To: wikimediaannounce-l@lists.wikimedia.org
Cc: Staff All <wmfall@lists.wikimedia.org>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
X-Spam-Score: 10.1 (++++++++++)
X-Spam-Report: Spam detection software, running on the system "sodium.wikimedia.org", has
 identified this incoming email as possible spam.  The original message
 has been attached to this so you can view it (if it isn't spam) or label
 similar future email.  If you have any questions, see
 the administrator of that system for details.
 
 Content preview:  Hi all, please find below the WMF report for October 2013,
    in plain text. As always, the editable and formatted version has been published
    on Meta: https://meta.wikimedia.org/wiki/Wikimedia_Foundation_Report,_October_2013
    [...] 
 
 Content analysis details:   (10.1 points, 4.0 required)
 
  pts rule name              description
 ---- ---------------------- --------------------------------------------------
 -0.0 SPF_PASS               SPF: sender matches SPF record
  2.5 US_DOLLARS_3           BODY: Mentions millions of $ ($NN,NNN,NNN.NN)
  0.0 WEIRD_PORT             URI: Uses non-standard port number for HTTP
  0.0 LOTS_OF_MONEY          Huge... sums of money
  0.0 T_DKIM_INVALID         DKIM-Signature header exists but is not valid
  3.6 MONEY_FRAUD_3          Lots of money and several fraud phrases
  4.0 ADVANCE_FEE_2_NEW_MONEY Advance Fee fraud and lots of money
Comment 13 Nemo 2014-02-04 22:17:50 UTC
(In reply to comment #12)
> I looked into activating it for the Wikimediaannounce-l list a while ago, but
> unfortunately there are a lot of false positives, see the example below where
> SpamAssassin thought that I am trying to commit money fraud ;)

LOL, I assume we can't customise SpamAssassin rules per-list? I hope humans don't subconsciously discard those phrases as spam-sounding too, though. :p
Comment 14 Seb35 2014-04-19 08:36:04 UTC
I’m managing WMFR’s mailing lists since some years and we have a quite low level of spam in moderation. I don’t know a lot Exim+Mailman (we use Postfix+Sympa) so I may miss some things, but I wonder about three config stategies:

* throttling: I didn’t see a lot of such config parameters, apart "smtp_accept_max = 4000" and around this parameter; perhaps a finer config here, possibly per host, could limit the spam volume.

* DNSBL: this is found very effective on WMFR’s mailing lists (with Zen-Spamhaus) and I guess it is much more cheap than SpamAssassin; I know there are some arguments about such methods, and if you don’t want to enable it to reject connections, perhaps there exists some program which could take this into account for the computation of the spam score.

* blacklists/whitelists: WMFR’s lists are generally whitelisted for members of the list, WMFR’s members, and members of an additionnal global whitelist, and in moderation for others. We have a low level of spam, so a global blacklist is not needed, but perhaps it would be worth using it for WMF mailing lists. I am thinking about some webpage (or in Mailman interface) where all list admins could add "addresses of known spammers", and this could be used either in SpamAssassin either directly in Exim. This would save time for all list admins. Or a reverse scenario could be to implement a global whitelist, manual or semi-automatic with members of any mailing list, and hence the moderation queue would have less false negatives/non-spam messages awaiting moderation.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links