Using CRM114 for spam filtering on Debian GNU/Linux

I've been using CRM114 as spam filter for a while now, and I'm quite happy with it. Due to bug #529720 though (incompatible upstream file format changes) I decided to start my setup from scratch with a recent CRM114 version from unstable. Here's a short HOWTO, hope it's useful for others.

First you need to install crm114 and set up a few files in your $HOME directory.

  $ sudo apt-get install crm114
  $ mkdir ~/.crm114
  $ cd ~/.crm114
  $ cp /usr/share/doc/crm114/examples/ .
  $ gunzip
  $ cp /usr/share/crm114/mailtrainer.crm .
  $ touch rewrites.mfp priolist.mfp

Edit ~/.crm114/ and set the following variables (some are optional, but that's what I currently use):

  :spw: /mypassword/
  :add_verbose_stats: /no/
  :add_extra_stuff: /no/
  :rewrites_enabled: /no/
  :spam_flag_subject_string: //
  :unsure_flag_subject_string: //
  :log_to_allmail.txt: /no/

The :log_to_allmail.txt: /no/ option should probably stay at "yes" for the first few days until you have tested your setup and everything works OK. The ~/.crm114/allmail.txt file will contain all your mails, in case something goes wrong.

Now set up empty spam and nonspam files like this:

  $ cssutil -b -r spam.css
  $ cssutil -b -r nonspam.css

Test the setup by invoking mailreaver.crm as follows, typing some test text and then pressing CTRL+d:

  $ /usr/share/crm114/mailreaver.crm -u ~/.crm114
  ** ACCEPT: CRM114 PASS osb unique microgroom Matcher **
  CLASSIFY fails; success probability: 0.5000  pR: 0.0000
  Best match to file #0 (nonspam.css) prob: 0.5000  pR: 0.0000
  Total features in input file: 8
  #0 (nonspam.css): features: 1, hits: 0, prob: 5.00e-01, pR:   0.00
  #1 (spam.css): features: 1, hits: 0, prob: 5.00e-01, pR:   0.00
  X-CRM114-Version: 200904023-BlameSteveJobs ( TRE 0.7.6 (BSD) ) MF-35EB8B9A [pR: 0.0000]
  X-CRM114-CacheID: sfid-20090920_151224_574131_D290E589
  X-CRM114-Status: UNSURE (0.0000) This message is 'unsure'; please train it!

The output should look similar to the above. If there are errors instead, you should check your settings in ~/.crm114/

Now you have to setup a procmail rule for crm114:

  :0fw: crm114.lock
  | /usr/share/crm114/mailreaver.crm -u /home/uwe/.crm114

  * ^X-CRM114-Status: SPAM.*

In my case this rule is also followed by a spamassassin rule, so all my mail goes through two different spam filters (will look into dspam and bogofilter also I guess, the more the better).

Finally, in .muttrc I have the following configs so I can press SHIFT+x to mark a mail as spam, and SHIFT+h to mark it as non-spam (ham).

macro index X '| formail -I X-CRM114-Status -I X-CRM114-Action -I X-CRM114-Version | /usr/share/crm114/mailreaver.crm -u /home/uwe/.crm114/ --spam'
macro index H '| formail -I X-CRM114-Status -I X-CRM114-Action -I X-CRM114-Version | /usr/share/crm114/mailreaver.crm -u /home/uwe/.crm114/ --good'
macro pager X '| formail -I X-CRM114-Status -I X-CRM114-Action -I X-CRM114-Version | /usr/share/crm114/mailreaver.crm -u /home/uwe/.crm114/ --spam'
macro pager H '| formail -I X-CRM114-Status -I X-CRM114-Action -I X-CRM114-Version | /usr/share/crm114/mailreaver.crm -u /home/uwe/.crm114/ --good'

Important: crm114 is most effective if you start with empty CSS files (as shown above) and only train it by marking mails as spam/ham when it gets them wrong. The process will take a few hours or maybe a day (depending on how many mails per day you get), then the misclassification rate gets very low...

Update 2009-09-23: Changed --spam/--nonspam to the correct options for mailreaver/mailtrainer, --spam/--good.

Converting Mailman "Gzip'd Text" archive files to proper mbox files

Mailman archives are often only available in the pretty useless "Gzip'd Text" format, which you cannot easily download and view locally (and threaded) in a MUA such as mutt. But that is exactly what I want to do from time to time (e.g. because I want to read the discussions of the past weeks on mailing lists where I'm newly subscribed).

After some searching I found one way to do it which I stripped down to my needs:

 $ cat mailman2mbox
 while (<STDIN>) {
   s/^(From:? .*) (at|en) /\1\@/;
   s/^Date: ([A-Z][a-z][a-z]) +([A-Z][a-z][a-z]) +([0-9]+) +([0-9:]+) +([0-9]+)/Date: \1, \3 \2 \5 \4 +0000/; 

Example run on some random mail archive:

 $ wget
 $ gunzip 2009-August.txt.gz
 $ ./mailman2mbox < 2009-August.txt > 2009-August.mbox

You can then view the mbox as usual in mutt:

 $ mutt -f 2009-August.mbox

Suggestions for a simpler method to do this are highly welcome. Maybe some mbox related Debian package already ships with a script to do this?

Syndicate content