Ok, it's done.  That was the last thing on the list to get done before a
2.1 release, so now I think I'll go ahead and release in a day or two
(after people have a chance to notice that the new stuff is broken).

Here is a description of how the changed AWL stuff works:

One major thing: the AWL is no longer part of the header tests, as it
used to be.  It's now its own separate checking stage, which happens
after all other tests have run.  Rather than adding a constant amount
(constant -ve amount that is) to the score, AWL will now instead shift
the score based on the long term average score observed for a particular
sender, so it needs the total message score to be calculated before it
can operate.

If you picture the range of possible scores for a message as a line:


- ---------------------|----------------------- +
   non-spam messages   5    spam messages

The idea is that a particular sender can also be scored, over their
lifetime.  There are generally-spammy senders, and generally-nonspammy
senders.  If we track the total score of all messages for each sender,
and the number of messages observed, we can calculate the mean score for
a particular sender, and place the sender on the line somewhere.  Then,
when we receive a new message from that sender, we calculate a score as
normal for the message, following all the rules.  We come up with a
score somewhere on the line.  Now, instead of using that score as the
final score, we "move" toward the sender's average score along the
line.  The distance we move we'll call the shrinkage factor (settable in
the cf files as auto_whitelist_factor).  By default, shrinkage will be
0.5, so if we have:

 ----------|-----------|-----|-----------
         mean          5   pre-score

Then we'll move the score "half way" toward the mean, and we'll end up
as:

 ------|-----------|----|-----------------
      mean   post-score 5

And so the message will be identified as "non-spam" even though the
rules consider it spam.  We'll then update the user's mean with the
score for the new message (currently using the post-score, which might
be wrong -- I'll have to think about that) for next time.


This system has a number of advantages over the simple counting method
of the old AWL implementation:

1. Spammers before could just send you 3 "clean" messages and thereby
get themselves permanently obtaining a -100 bonus.  Now they would have
to keep restocking their spamming addresses by sending dummy messages to
keep their mean low.  And if their mean were, say, 1 long-term, any
message they sent scoring >=9 would count as spam anyway.
2. Spammers could use a "well-known good" address which they reasonably
guess to be whitelisted (think from: [EMAIL PROTECTED]) and get
the -100 bonus.  Now, they can of course still use some well-known good
address, but the bonus obtained will be far lower.
3. The AWL not operates automatically as an auto-blacklist too!  If you
generally receive spammy mails from a particular address, then the
scores will be pulled toward the spammy mean, *raising* their score if
the spammer happens to send you a less-spammy message.


This is all now checked into CVS, including changes to the rules and
scores files to make the changes effective.  The contents of CVS are now
basically a release candidate for 2.1 -- I'm not going to add any more
features to the tree until after 2.1 release, and I'm only going to make
show-stopper bugfixes.  Please get the latest stuff from CVS (or wait
till after ~1am PST and get the 2.1 tarball from the website) and try it
out over the next few days.  I've re-instated the "-a" flag in the spamd
startup scripts, but make sure you're using it, and let me know how it's
working.

C

_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

Reply via email to