Sidney Markowitz wrote: SM> The fact that the -S option is reasonable points out that the scoring is SM> not a linear measure of spamminess. The function P(s) of the probability SM> that a message with score s is spam stays near 0 until some small SM> positive s, then asymptotically approaches 1 somewhere around where you SM> want to set the spam threshold. This means that a message with score 20 SM> and one with score 70 are both certainly spam and should not contribute SM> different weights to the AWL calculation. What we really want is some SM> measure of the probability that a message from somewhere is spam based SM> on our past experience with messages from the same place. That indicates SM> that rather than a linear average of the score we should be averaging SM> something that approximates the probability of being spam, i.e., convert SM> the score into a "spamminess" level that is 0 below some threshold, 1 SM> above some threshold, and a few values in between for spam scores that SM> are not considered by themselves to be certain spam or non-spam. Of SM> course the "1" can be something larger so the whole thing can be scaled SM> to integers if that seems more aesthetic.
I have thought somewhat of trying to determine the shape/slope of the S-curve which maps the message score onto the probability of spam. It's a classical statistical problem, well known math, well known approaches to doing it. But, I don't see the point much. Things seem to me to work pretty good the way they are. I imagine the S is actually reasonably linear in the 2-6 score range, and only gets flatter outside that range, so it looks like: | ____ | /------ | - | | | / |/ | / || |/ || ----------+---------- and not ----------+---------- /| || / | || / | /| - | | | ------ | ------/ | Of course, the flatter the curve, the less you need to care about it. SM> This gives me another idea: If you consider the AWL as being a way of SM> assigning an a priori probability of spamminess to a message based on SM> local experience with messages with the same From: header, we can SM> generalize that to keep track of experience with messages that are SM> similar based on other criteria. Is there a reason not to track any SM> other headers, such as the return-path or the first or second received SM> header? Would it make sense to have a configurable AWL that tracks SM> criteria that are more useful at a local site? A local spam phrase or SM> non-spam phrase list? Yes, this certainly makes sense. If we're going to go there, it certainly would make sense to rigorize the statical processes better, take better consideration of Stein's paradox, and even probably do some sensible work on estimating the a priori spamminess of people who are *not* on the whitelist, using zero-frequency estimation stuff. Certainly over time as your AWL gets fuller, simply not being on the list is probably a pretty good indicator of spamminess. Partly the reason for not having spent much time/effort in this area to date is twofold: 1. SA does a great job already, even without this stuff 2. It would take a reasonable amount of time and expertise to do some of this stuff. Part 2 may become less of an issue in the near future. I am investigating some possibilities around working full time on SpamAssassin and being able to finance a substantial investment in taking it leaps and bounds beyond where it is today. You probably have all noticed that the last two weeks or so I've been spending a *lot* more time on this list. This is not purely coincidental. C _______________________________________________________________ Have big pipes? SourceForge.net is looking for download mirrors. We supply the hardware. You get the recognition. Email Us: [EMAIL PROTECTED] _______________________________________________ Spamassassin-talk mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/spamassassin-talk