Sidney Markowitz wrote:

SM> The fact that the -S option is reasonable points out that the scoring is
SM> not a linear measure of spamminess. The function P(s) of the probability
SM> that a message with score s is spam stays near 0 until some small
SM> positive s, then asymptotically approaches 1 somewhere around where you
SM> want to set the spam threshold. This means that a message with score 20
SM> and one with score 70 are both certainly spam and should not contribute
SM> different weights to the AWL calculation. What we really want is some
SM> measure of the probability that a message from somewhere is spam based
SM> on our past experience with messages from the same place. That indicates
SM> that rather than a linear average of the score we should be averaging
SM> something that approximates the probability of being spam, i.e., convert
SM> the score into a "spamminess" level that is 0 below some threshold, 1
SM> above some threshold, and a few values in between for spam scores that
SM> are not considered by themselves to be certain spam or non-spam. Of
SM> course the "1" can be something larger so the whole thing can be scaled
SM> to integers if that seems more aesthetic.

I have thought somewhat of trying to determine the shape/slope of the S-curve
which maps the message score onto the probability of spam.  It's a classical
statistical problem, well known math, well known approaches to doing it.

But, I don't see the point much.  Things seem to me to work pretty good the way
they are.  I imagine the S is actually reasonably linear in the 2-6 score range,
and only gets flatter outside that range, so it looks like:

           |      ____                            | /------
           |    -                                 | |
           |  /                                   |/
           | /                                    ||
           |/                                     ||
 ----------+----------       and not    ----------+----------
          /|                                     ||
         / |                                     ||
        /  |                                     /|
       -   |                                    | |
 ------    |                              ------/ |


Of course, the flatter the curve, the less you need to care about it.

SM> This gives me another idea: If you consider the AWL as being a way of
SM> assigning an a priori probability of spamminess to a message based on
SM> local experience with messages with the same From: header, we can
SM> generalize that to keep track of experience with messages that are
SM> similar based on other criteria. Is there a reason not to track any
SM> other headers, such as the return-path or the first or second received
SM> header? Would it make sense to have a configurable AWL that tracks
SM> criteria that are more useful at a local site? A local spam phrase or
SM> non-spam phrase list?

Yes, this certainly makes sense.  If we're going to go there, it certainly would
make sense to rigorize the statical processes better, take better consideration
of Stein's paradox, and even probably do some sensible work on estimating the
a priori spamminess of people who are *not* on the whitelist, using
zero-frequency estimation stuff.  Certainly over time as your AWL gets fuller,
simply not being on the list is probably a pretty good indicator of spamminess.

Partly the reason for not having spent much time/effort in this area to date is
twofold:

1. SA does a great job already, even without this stuff
2. It would take a reasonable amount of time and expertise to do some of this
stuff.

Part 2 may become less of an issue in the near future.  I am investigating some
possibilities around working full time on SpamAssassin and being able to finance
a substantial investment in taking it leaps and bounds beyond where it is today.
You probably have all noticed that the last two weeks or so I've been spending a
*lot* more time on this list.  This is not purely coincidental.

C


_______________________________________________________________

Have big pipes? SourceForge.net is looking for download mirrors. We supply
the hardware. You get the recognition. Email Us: [EMAIL PROTECTED]
_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

Reply via email to