[SAtalk] question about the number of tokens analyzed in Bayes

Jean-Sébastien Guay-Leroux Fri, 10 Oct 2003 09:54:14 -0700

What is the reason for Bayes in spamassassin to use the 150 most significant tokens in a email if Paul Graham mentions that you only should use the fifteen most significant ?

Quote from Paul Graham :

“Fourth, they calculated probabilities differently. They used all the tokens, whereas I only use the 15 most significant. If you use all the tokens you'll tend to miss longer spams, the type where someone tells you their life story up to the point where they got rich from some multilevel marketing scheme. And such an algorithm would be easy for spammers to spoof: just add a big chunk of random text to counterbalance the spam terms. “

From Bayes.pm :

# How many of the most significant tokens should we use for the p(w)

# calculation?

use constant N_SIGNIFICANT_TOKENS => 150;

[SAtalk] question about the number of tokens analyzed in Bayes

Reply via email to