Just for the record, here is my approach to my spam filter design just in case 
anyone is interested.

Design goals:

1.  It should Just Work with an absolute minimum amount of user intervention 
and training required.  That said, there are cases where user intervention will 
be necessary.  In particular, emails that are sent in order to verify that the 
user controls their email address (password resets, 2FA auth codes, 
subscription confirmations) tend to be indistinguishable form phishing attacks 
and so some amount of user intervention will be needed.

2.  It should be entirely server-side.  This project was motivated in large 
part because my current spam filter is SpamSieve, which works great, but it 
runs on my laptop, so whenever my laptop is off-line so is my spam filter.  So 
if I check my mail from my phone and my laptop is off-line I get a ton of spam.

3.  I want to use as much off-the-shelf software as possible, but I also want 
to be able to have full control over the system.  That means that whatever 
off-the-shelf software I use has to either Just Work, or be written in a 
language that I am proficient in.  That rules out a lot of stuff because one of 
the languages I am *not* proficient in is Perl, and a lot of off-the-shelf spam 
filter stuff is written in Perl, and doesn’t do what I want out of the box.

Approach:

1.  Seed the process with a source of reliable ham and reliable spam that does 
not require user labeling.  The reliable ham is provided by keeping track of 
outgoing messages, which are presumed to be ham, and messages inbound to a 
honeypot address, which are presumed to be spam.  The honeypot spam training 
corpus is shared among all users.  The outgoing ham corpus is user-specific 
because who knows, someone may actually want information about Viagra.

2.  The filter consists of two parts, a milter on the MTA side and a set of 
minimal sieve scripts on the LDA side that connect to a more or less 
traditional Bayesian filter.  The milter tracks outgoing mail, and tags “easy” 
spam and ham on the incoming side.  “Easy” ham consists of messages from 
senders to whom the user has previously sent messages, or with subjects that 
are “Re:” a subject about which the user has previously sent a message.  Easy 
spam is things like Chinese text (at least that’s easy spam for me — that would 
obviously not work for a user in China, but this is mainly for my personal use) 
and certain super-spammy TLDs like .ru, .biz, etc.  The milter also does 
greylisting.  It’s written in Python using the pymilter library.  The LDA-side 
Bayesian filter is written in Common Lisp.  State is stored in a shared DB 
(currently SQLite3, but I’m probably going to switch to MariaDB because SQLite3 
doesn’t play well with threads.)

3.  What is left over after the milter is a set of messages from addresses with 
which the user has never corresponded.  Those get put into an INCOMING folder 
and where they are processed by the Bayesian filter.  The reason for putting 
the Bayesian filter on the LDA side is that this filter also applies the 
heuristic that if a message is received from two different unknown senders with 
the same subject within a relative short period of time (like 10-15 minutes) 
then both of those messages are almost certainly spam.  Empirically, an 
MTA-side Bayesian filter misses a lot of easy spam because it cannot apply this 
heuristic.  I mean yes, it’s possible, but it causes a lot of problems, not 
least of which is that timely things like password resets and 2FA auths get 
held up along with potential spam.  (Greylisting has this problem too.  I’m 
actually still trying to decide what to do about that.)

4.  User training input is provided in the usual way, by having dovecot-sieve 
scripts that intercept messages being moved from INBOX to Junk and vice versa.  
(I have not yet decided what to do about messages that the user moves out of 
INCOMING.)

That’s where I’m at.  Currently stuck on trying to figure out how to get the 
LDA-side filter to move messages.  My baseline plan was to use external calls 
to doveadm, but it seems like there has to be a better way.  Any and all advice 
and commentary much appreciated.

rg


On Jan 30, 2021, at 12:07 PM, Ron Garret <r...@flownet.com> wrote:

> 
> On Jan 30, 2021, at 11:54 AM, Tom Hendrikx <t...@whyscream.net> wrote:
> 
>> IMHO you're still trying to re-invent the wheel :)
> 
> I don’t deny that.  The goal of this project is as much (maybe more) to be a 
> learning experience as it is to produce something useful.
> 
> FWIW, there are two reasons I don’t want to use a non-user-visible 
> quarantine.  First, there is always the possibility of a false positive, so 
> all email must be made accessible to the user somehow.  And second, there are 
> occasions when you are expecting an email that looks spammy and you need to 
> be able to get to it in a timely manner.  The most common use case here is 
> password reset links or 2FA authorization codes.  It is not possible for a 
> spam filter to distinguish a legitimate email of this type from a phishing 
> attack.  Only the user know if they recently requested a password reset.  But 
> *most* password reset emails are phishing attacks (at least most of the ones 
> I get are) so I don’t want to see them by default.
> 
> rg
> 

Reply via email to