On Fri, 2013-04-26 at 19:38 -0400, Joe Acquisto-j4 wrote: > To feed "ham" to bayes, should one only user mis-flagged mail, or may > one use unflagged (below 5) mail?
The Bayesian classifier is a subsystem mostly independent from SA. Most SA rules are rather white or black. Match, or don't. And scored according to the probability of actually distinguishing ham from spam. The higher the absolute score of a given rule, the higher the probability to be ham (negative score) or spam (positive score). Mere hints, but not reliable indicators, have low scores. For a scoring system like SA, this is generically true. With different, varying scales. It is correct for single rules. "Dunno" would be a rule's score of zero. The higher the score, the more spammy it is. It is correct for the overall, resulting score of a message. The "dunno" tipping point is 5 by default. A message scoring 4.5 is more likely ham, though you'd better not bet on it. And it also is correct for the Bayes subsystem, with a notable scale of it's own -- ranging from 0 (ham) to 1 (spam), with 0.5 being a big fat shrug. The BAYES_nn rules and their scores are set accordingly. BAYES_50 really should have no score. Back to the question, and explaining why I mentioned the above. "mis-flagged" mail, false positives and false negatives, do exist on multiple levels. The OP mentioned it with respect to the *overall* score. And asked about *Bayes* training. Training Bayes, first and foremost, helps Bayes only. In the end, it might make a significant difference overall, sure. However, when it comes to the question whether training Bayes might help... Look at the Bayesian probability. Not the overall SA score. Do train those, which have a Bayesian probability close(r) to 0.5. Or even worse, have a Bayesian probability contrary to the overall score, or actual classification. Training the plethora of spam hitting BAYES_99 might not be a mistake. But it is pretty likely, to *not* improve general SA performance. You're training Bayes. Not SpamAssassin. -- char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4"; main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1: (c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}