> -----Original Message----- > From: Tony Earnshaw [mailto:[EMAIL PROTECTED] > Sent: Monday, June 16, 2003 7:58 AM > To: [EMAIL PROTECTED] > Subject: Re: [SAtalk] Removing headers etc.. to feed Bayes correctly > >...people on the list were > saying about > murdering the Bayes database before it had even reached > maturity made me > feel like Gerhard Schröder. >
Tony, To my mind, it's not murdering, or anything remotely approaching it. The suggestion to let sa-learn do the initial ham and spam seeding is simply not optimal. Autolearning above (or below) a threshold established by SpamAssassin is an ill-conceived method of establishing an initial Bayes token base. Pre-selecting a corpus through spamassassin directly contradicts the entire basis upon which Bayesian theory relies for a token database: the assumption that there are "interesting tokens" that normal heuristics are missing. A Bayes database doesn't reach maturity by having a certain number of SA-filtered spams >15 and SA-filtered hams <-2; it reaches maturity by having a certain number of confirmed hams and spams, period. Therefore, if one organization obtains initial Bayes seeding strictly through auto-learning for three weeks and get 2000 hams and 2000 spams in it, and another does theirs in 15 minutes by manually teaching it 2000 hams from this week, and 2000 spams from this week (that SpamAssassin has never touched), the LATTER would be the much, much more accurate Bayesian seeding procedure. This is discussed in-depth in Paul Graham's writing on the topic, specifically the part where he mentions that tokens like "per" and "FL" and "ff0000" are actually very reliable indicators of spammishness. -tom ------------------------------------------------------- This SF.NET email is sponsored by: eBay Great deals on office technology -- on eBay now! Click here: http://adfarm.mediaplex.com/ad/ck/711-11697-6916-5 _______________________________________________ Spamassassin-talk mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/spamassassin-talk