Re: a concept for spam filter

Cameron Simpson Sat, 03 Nov 2012 15:05:24 -0700

On 03Nov2012 11:03, Russell L. Harris <rlhar...@broadcaster.org> wrote:
| I thank you for taking the trouble to give a detailed reply, Cameron.
| I have printed it out, and I plan to study it carefully in the
| morning.


On 03Nov2012 09:53, Jamie Paul Griffin <ja...@kode5.net> wrote:
| I just have my mail delivered by smtp so use OpenBSD spamd and
| spamassassin as well as clamav and unofficial sigs, with procmail sorting
| as i mentioned. So my set up is different of course.

I should have made it clear that my setup is a bit roundabout.

The natural macro for this would simply pipe the messages to
email-add-spam-subject and then delete it or save it to the "known spam"
bucket.

I save it to a special spool folder for two reasons:

  - it is snappier to just save a message to a folder than to pipe the
    message to a program which does some work, making for a snappier user
    experience; I don't care that the subject line isn't in my rules until a
    few seconds later (mailfiler will pick it up as part of its regular
    scan)

  - I've already got my system monitoring maildirs as spools with simple
    rules, so folding this in was very easy

The important thing is the script to add a new rule and telling your
filtering software about the rule update.

With procmail it rereads (and therefore recompiles, alas) the rules file
every time you fire it up; my mailfiler notices rule files changes and
reloads if they get updated.

I outlined my setup to give background and to show that a small leading
blacklist and an "UNKNOWN" folder for messages matching no filing rule
diverts most stuff away from your inbox fairly effectively without
spamassassin et al.

Regarding filing tools:

I used to use procmail. At some point I decided its rule syntax was
too painful, especially if you want to do a few things with _every_
filing, like X-Labels, log lines and so forth, so some years ago I
wrote cats2procmailrc to take a simple rule syntax and transcribe
a procmailrc. And I finally decided to write something that directly
understood my rule syntax, which has several advantages: reads the rules
once (more performant!), doesn't need a wrapper script to watch maildirs,
leaves me free to make the rules say what I want instead of what can be
said to procmail.

My core gripe with procmail, aside from the from-scratch startup per
message thing, is that it works entirely off regexps. This is not a good
way to parse email addresses. These are all equivalent:

  c...@zip.com.au
  Cameron Simpson <c...@zip.com.au>
  (Cameron Simpson) c...@zip.com.au

Matching that while not matching:

  c...@zip.com.au
  foo.cs.zip.com.au

and so forth just does not work reliably. A mailfiler rule like this:

  me to-me c...@zip.com.au

files to the folder "me" with the tag/x-label "to-me" if the to/cc/bcc
contains "c...@zip.com.au" in the address component as extracted by a
proper RFC2822 parser. No regexps, just string equality tests.

It also parses each message header just one on demand, so to test
hundreds of rules the parsing happens only once. And of course the rules
are parsed when I start mailfiler, not for each message. The other
upside of extracting the core address part is that you can do this:

  friends Friends from:(FRIENDS)

which means match is the address in the From: header is in my "friends"
group, a set of addresses pulled in from a text db. Again parsed, just
at load time. So very fast. When I was using procmail I actually had
code that generated an enormous regexp with tens of addresses in it.
Ghastly!

  :0
  * 
^(to|cc):.*\<(cameron\.simpson@gmail\.com|cameron\.simpson@me\.com|cs@zip\.com\.au|...
  * ^from:.*(huge regexp for "family" etc kilobytes long...

My now obsolete .procmailrc for the spool-in folder is 1036401 bytes
long. Nasty!

Cheers,
-- 
Cameron Simpson <c...@zip.com.au>

Very few things happen at the right time, and the rest do not happen at all.
The conscientious historian will correct these defects.
- Mark Twain, _A Horse's Tale_

Re: a concept for spam filter

Reply via email to