mail

Using CRM114 for spam filtering on Debian GNU/Linux

I've been using CRM114 as spam filter for a while now, and I'm quite happy with it. Due to bug #529720 though (incompatible upstream file format changes) I decided to start my setup from scratch with a recent CRM114 version from unstable. Here's a short HOWTO, hope it's useful for others.

First you need to install crm114 and set up a few files in your $HOME directory.

  $ sudo apt-get install crm114
  $ mkdir ~/.crm114
  $ cd ~/.crm114
  $ cp /usr/share/doc/crm114/examples/mailfilter.cf.gz .
  $ gunzip mailfilter.cf.gz
  $ cp /usr/share/crm114/mailtrainer.crm .
  $ touch rewrites.mfp priolist.mfp

Edit ~/.crm114/mailfilter.cf and set the following variables (some are optional, but that's what I currently use):

  :spw: /mypassword/
  :add_verbose_stats: /no/
  :add_extra_stuff: /no/
  :rewrites_enabled: /no/
  :spam_flag_subject_string: //
  :unsure_flag_subject_string: //
  :log_to_allmail.txt: /no/

The :log_to_allmail.txt: /no/ option should probably stay at "yes" for the first few days until you have tested your setup and everything works OK. The ~/.crm114/allmail.txt file will contain all your mails, in case something goes wrong.

Now set up empty spam and nonspam files like this:

  $ cssutil -b -r spam.css
  $ cssutil -b -r nonspam.css

Test the setup by invoking mailreaver.crm as follows, typing some test text and then pressing CTRL+d:

  $ /usr/share/crm114/mailreaver.crm -u ~/.crm114
  test
  [CTRL-d]
  ** ACCEPT: CRM114 PASS osb unique microgroom Matcher **
  CLASSIFY fails; success probability: 0.5000  pR: 0.0000
  Best match to file #0 (nonspam.css) prob: 0.5000  pR: 0.0000
  Total features in input file: 8
  #0 (nonspam.css): features: 1, hits: 0, prob: 5.00e-01, pR:   0.00
  #1 (spam.css): features: 1, hits: 0, prob: 5.00e-01, pR:   0.00
  X-CRM114-Version: 200904023-BlameSteveJobs ( TRE 0.7.6 (BSD) ) MF-35EB8B9A [pR: 0.0000]
  X-CRM114-CacheID: sfid-20090920_151224_574131_D290E589
  X-CRM114-Status: UNSURE (0.0000) This message is 'unsure'; please train it!

The output should look similar to the above. If there are errors instead, you should check your settings in ~/.crm114/mailfilter.cf.

Now you have to setup a procmail rule for crm114:

  :0fw: crm114.lock
  | /usr/share/crm114/mailreaver.crm -u /home/uwe/.crm114

  :0:
  * ^X-CRM114-Status: SPAM.*
  IN.spam-crm114

In my case this rule is also followed by a spamassassin rule, so all my mail goes through two different spam filters (will look into dspam and bogofilter also I guess, the more the better).

Finally, in .muttrc I have the following configs so I can press SHIFT+x to mark a mail as spam, and SHIFT+h to mark it as non-spam (ham).

macro index X '| formail -I X-CRM114-Status -I X-CRM114-Action -I X-CRM114-Version | /usr/share/crm114/mailreaver.crm -u /home/uwe/.crm114/ --spam'
macro index H '| formail -I X-CRM114-Status -I X-CRM114-Action -I X-CRM114-Version | /usr/share/crm114/mailreaver.crm -u /home/uwe/.crm114/ --good'
macro pager X '| formail -I X-CRM114-Status -I X-CRM114-Action -I X-CRM114-Version | /usr/share/crm114/mailreaver.crm -u /home/uwe/.crm114/ --spam'
macro pager H '| formail -I X-CRM114-Status -I X-CRM114-Action -I X-CRM114-Version | /usr/share/crm114/mailreaver.crm -u /home/uwe/.crm114/ --good'

Important: crm114 is most effective if you start with empty CSS files (as shown above) and only train it by marking mails as spam/ham when it gets them wrong. The process will take a few hours or maybe a day (depending on how many mails per day you get), then the misclassification rate gets very low...

Update 2009-09-23: Changed --spam/--nonspam to the correct options for mailreaver/mailtrainer, --spam/--good.

Converting Mailman "Gzip'd Text" archive files to proper mbox files

Mailman archives are often only available in the pretty useless "Gzip'd Text" format, which you cannot easily download and view locally (and threaded) in a MUA such as mutt. But that is exactly what I want to do from time to time (e.g. because I want to read the discussions of the past weeks on mailing lists where I'm newly subscribed).

After some searching I found one way to do it which I stripped down to my needs:

 $ cat mailman2mbox
 #!/usr/bin/perl
 while (<STDIN>) {
   s/^(From:? .*) (at|en) /\1\@/;
   s/^Date: ([A-Z][a-z][a-z]) +([A-Z][a-z][a-z]) +([0-9]+) +([0-9:]+) +([0-9]+)/Date: \1, \3 \2 \5 \4 +0000/; 
   print;
 }

Example run on some random mail archive:

 $ wget http://participatoryculture.org/pipermail/develop/2009-August.txt.gz
 $ gunzip 2009-August.txt.gz
 $ ./mailman2mbox < 2009-August.txt > 2009-August.mbox

You can then view the mbox as usual in mutt:

 $ mutt -f 2009-August.mbox

Suggestions for a simpler method to do this are highly welcome. Maybe some mbox related Debian package already ships with a script to do this?

Drupal 4.5.8 / 4.6.6 / 4.7.0-beta6 fix four security issues!

New versions of Drupal are out for the 4.5.x, the 4.6.x and the 4.7.0-beta branches which fix 4 (in words: four) security issues from four different categories, namely: access control bypassing, cross-site scripting, session fixation, and mail header injection.

All the gory details are available in the release announcement and the four advisories: DRUPAL-SA-2006-001, DRUPAL-SA-2006-002, DRUPAL-SA-2006-003, and DRUPAL-SA-2006-004.

Upgrade now!

Warning: If you're using 4.5.x, the patches for DRUPAL-SA-2006-003 will not fix the security issue immediately. You have two options: a) upgrade to 4.6.6 instead of 4.5.8, or b) upgrade to PHP >= 4.3.2.

Unsubscribing from mailing lists and RSS feeds

Frederico Oliveira talks about some interesting issues regarding the information overload most of us are experiencing. He unsubscribed from several RSS feeds in order to cut down the mass of information.

I'm goint to do the same thing, too. My first step was to reduce the amount of daily email, though. I have just finished unsubscribing from 16 mailing lists which I don't really read very often (this automatically reduces the amount of spam I get, too). I'm left with ca. 80 mailing lists now. I'll probably remove some more and concentrate on those which I really need and/or read regularly.

The next step is to unsubscribe from several RSS feeds, but that's not that much of an issue. I find tracking RSS feeds easier and more manageable than tracking mailinglists (not sure why). For the statistics freaks, I currently subscribe to ca. 320 feeds, but I read way more of them regularly than I do with mailing lists...

Syndicate content