At 12:03 PM 6/6/2003 -0500, Chris Barnes wrote:
I am more concerned about the false positives.  Looking at the headers
of those messages that are legit, it seems that the biggest score comes
from the BAYES test.

Is there a place that talks about how that score is derived?

Bayes is a learning system which is "trained" against a corpus of spam and nonspam messages. It's behavior is highly dependant on how it is trained, and bayes is usually best used on the email of a single user, or group of similar users such as a company, not a group of diverse users such as an ISP. I'd certainly advise that your ISP not use bayes site wide without some serious thinking, since the accuracy of it will likely be too low to justify the disk-space overhead.


As a brief overview of how bayes roughly works:

When learning Bayes engine breaks the messages down into substrings called tokens and the spam vs nonspam statistics for each token is calculated based on how many of the spam and nonspam messages contain that token. The tokens and their spam vs nonspam statistics are stored in a database.

When new mail is analyzed, it is broken into tokens, and those tokens are checked against the database. The statistics for the tokens are totaled up and the message is assigned a "percent chance of spam" based on the totals. SA then assigns points based on the percentage.


More information on bayes and bayes-like learning mail analysis can be found at:
http://www.paulgraham.com/spam.html


and
http://www.paulgraham.com/better.html




-------------------------------------------------------
This SF.net email is sponsored by:  Etnus, makers of TotalView, The best
thread debugger on the planet. Designed with thread debugging features
you've never dreamed of, try TotalView 6 free at www.etnus.com.
_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

Reply via email to