spam

Using CRM114 for spam filtering on Debian GNU/Linux

I've been using CRM114 as spam filter for a while now, and I'm quite happy with it. Due to bug #529720 though (incompatible upstream file format changes) I decided to start my setup from scratch with a recent CRM114 version from unstable. Here's a short HOWTO, hope it's useful for others.

First you need to install crm114 and set up a few files in your $HOME directory.

  $ sudo apt-get install crm114
  $ mkdir ~/.crm114
  $ cd ~/.crm114
  $ cp /usr/share/doc/crm114/examples/mailfilter.cf.gz .
  $ gunzip mailfilter.cf.gz
  $ cp /usr/share/crm114/mailtrainer.crm .
  $ touch rewrites.mfp priolist.mfp

Edit ~/.crm114/mailfilter.cf and set the following variables (some are optional, but that's what I currently use):

  :spw: /mypassword/
  :add_verbose_stats: /no/
  :add_extra_stuff: /no/
  :rewrites_enabled: /no/
  :spam_flag_subject_string: //
  :unsure_flag_subject_string: //
  :log_to_allmail.txt: /no/

The :log_to_allmail.txt: /no/ option should probably stay at "yes" for the first few days until you have tested your setup and everything works OK. The ~/.crm114/allmail.txt file will contain all your mails, in case something goes wrong.

Now set up empty spam and nonspam files like this:

  $ cssutil -b -r spam.css
  $ cssutil -b -r nonspam.css

Test the setup by invoking mailreaver.crm as follows, typing some test text and then pressing CTRL+d:

  $ /usr/share/crm114/mailreaver.crm -u ~/.crm114
  test
  [CTRL-d]
  ** ACCEPT: CRM114 PASS osb unique microgroom Matcher **
  CLASSIFY fails; success probability: 0.5000  pR: 0.0000
  Best match to file #0 (nonspam.css) prob: 0.5000  pR: 0.0000
  Total features in input file: 8
  #0 (nonspam.css): features: 1, hits: 0, prob: 5.00e-01, pR:   0.00
  #1 (spam.css): features: 1, hits: 0, prob: 5.00e-01, pR:   0.00
  X-CRM114-Version: 200904023-BlameSteveJobs ( TRE 0.7.6 (BSD) ) MF-35EB8B9A [pR: 0.0000]
  X-CRM114-CacheID: sfid-20090920_151224_574131_D290E589
  X-CRM114-Status: UNSURE (0.0000) This message is 'unsure'; please train it!

The output should look similar to the above. If there are errors instead, you should check your settings in ~/.crm114/mailfilter.cf.

Now you have to setup a procmail rule for crm114:

  :0fw: crm114.lock
  | /usr/share/crm114/mailreaver.crm -u /home/uwe/.crm114

  :0:
  * ^X-CRM114-Status: SPAM.*
  IN.spam-crm114

In my case this rule is also followed by a spamassassin rule, so all my mail goes through two different spam filters (will look into dspam and bogofilter also I guess, the more the better).

Finally, in .muttrc I have the following configs so I can press SHIFT+x to mark a mail as spam, and SHIFT+h to mark it as non-spam (ham).

macro index X '| formail -I X-CRM114-Status -I X-CRM114-Action -I X-CRM114-Version | /usr/share/crm114/mailreaver.crm -u /home/uwe/.crm114/ --spam'
macro index H '| formail -I X-CRM114-Status -I X-CRM114-Action -I X-CRM114-Version | /usr/share/crm114/mailreaver.crm -u /home/uwe/.crm114/ --good'
macro pager X '| formail -I X-CRM114-Status -I X-CRM114-Action -I X-CRM114-Version | /usr/share/crm114/mailreaver.crm -u /home/uwe/.crm114/ --spam'
macro pager H '| formail -I X-CRM114-Status -I X-CRM114-Action -I X-CRM114-Version | /usr/share/crm114/mailreaver.crm -u /home/uwe/.crm114/ --good'

Important: crm114 is most effective if you start with empty CSS files (as shown above) and only train it by marking mails as spam/ham when it gets them wrong. The process will take a few hours or maybe a day (depending on how many mails per day you get), then the misclassification rate gets very low...

Update 2009-09-23: Changed --spam/--nonspam to the correct options for mailreaver/mailtrainer, --spam/--good.

ASCII art spam

Whoa, those spammers are getting really desperate now, aren't they?

Today in my inbox:

apy8       0lyk   b8xvdtfa        glb13            0uurqjl     5xju3p0jb            1uk9z
 yhak     o3vl    tytjx4ui        m6frp           64zx9238iq   128lkxk2wh           fzqpv
 mkqj     g1tn    wgd293sv       s3mnhaq        y1vvng731dsy   f39iqddc65f         5fgnwcx
 t9ba4   wg8j1      ucq8         uviyoz6        4k2g4     fo   wz0i   q7pn         hqblemz
  pu9t   1dwr       mocp         nlihfws       mm3w0           j4zb   0fzh         o6nljyq
  0luy   to8a       ljd0        5bi8 zpfh      93ab            tbpr   hztc        foza p7sf
  5vw7t a4nce       2fjr        oxto 2t0r      37v3   mxvfq0   x6qtw1j6me         ye51 b7pt
   pwtx gg5l        mtfr       h7390 0voxg     btc8   t7vj3n   twn72qv80         92sj8 8qhuc
   4xoq 9m3u        r3i0       4dgf   2k8l     o6u8   eegabt   70vrl5ukj         6bpp   u336
    9p5tqyo         ixkj       7mkcss82ko2     6dgtj    tdei   eayi tnjgi        ujh0x073p63
    jbxotva         alrs      ubvdw9kele9rs     ed7bi   vbjz   0tlb b1svn       15xh90ojyj56u
    zzfla7m       3o1jnrrc    kvlxt74rl46l1     yy5mng2kl7dj   8bmq  793jb      qzqkjf00glzsf
     e6doi        hfcqgi2t    w8bd     vydk      elqfyxtdk7g   upqf   ippbf     ca5l     cgrm
     npnrd        dzsgo4jz   q9zo       co4g       6kabvxc     sqpy    5ds54   qhpb       krpw

Recently on debian-curiosa...

A recent debian-curiosa thread made my day:

# Subject: looking for someone?
# From: "Mitch"

Hi there locvely,
aThis kind aof opportucnity comes ones in a life. I don't want
to miss it. Do you? I am coming to your place in few days
and I thoughc may be we can meet each other. If cyou don't mind
I can send you my pcicturea. I am a girl.
You can bcorrespond with me using my email cpael@popmailme.com

# From: 'Mash
Sorry I prefer a women who isn't so keen on placing random letters
in her words. Apparently they are rubbish in bed.
I mean what the hell is a "pcicturea," something from the
Anne-summers Jurassic collection?

'Mash

# From: Shawn McMahon
I prefer women who aren't named "Mitch".

New form of wiki spam?

Today, I had the "pleasure" to experience a new form of wiki spam in the Unmaintained Free Software wiki. Or at least it was the first time I had seen it...

The registered wiki user "Drunkers" (yes, the spammer scripts not only spam anonymously, they also create real accounts lately!) spammed several pages in the wiki, adding tons of spam links, hidden with fancy CSS and other tricks. Nothing unusual so far.

Another registered user ("Mootlif3") reverted the spammer's changes with comments like "reverted (spam)", "unrelated links removed", "deleted spam links", and "damn spammers". Or so I thought.

The real "wtf" moment emerged when I checked what "Mootlif3" really had done. He didn't really revert the changes of the spammer. He only removed a few of the links from the page, leaving most of them still in there! So it would look like a nice (human) wiki user had helped out with cleaning spam, but in reality he only created a false sense of "security" for people who really want to clean the spam...

Damn spammers, indeed.

Better browsing using Privoxy

I've been a happy Privoxy user for quite some time now. I can really recommend it to anybody who wants to get rid of all the nasty stuff floating around on the web these days. From the Privoxy homepage:

Privoxy is a web proxy with advanced filtering capabilities for protecting privacy, modifying web page content, managing cookies, controlling access, and removing ads, banners, pop-ups and other obnoxious Internet junk.

The most useful feature for me is that it automatically removes almost all of those ugly flash-based ad banners.

My todo list:

  • Fine tune the filters for my needs. I'm currently using the stock Debian package of Privoxy, without any customizations.
  • Check out Neil van Dyke's privoxy rules which filter even more nasty stuff.
  • Check out the combination of Privoxy and Tor for anonymous browsing.
Syndicate content