To my mind, it's not murdering, or anything remotely approaching it.
Bless you for bothering, Tom. Your line wrap is so utterly impossible with Mozilla 1.4rc1 (would have been better in Evo 1.2.4, but I fscked that up on my machine by compiling and installing gtk+2.2), that I don't know whether I can cope with it. We'll see :-)
> The suggestion to let sa-learn do the initial ham and spam seeding is simply not optimal.
That's what I keep on saying.
> Autolearning above (or below) a threshold established by SpamAssassin is an ill-conceived method of establishing an initial Bayes token base.
That's what I keep on saying.
> Pre-selecting a corpus through spamassassin directly contradicts the entire basis upon which Bayesian theory relies for a token database:
Agreed.
> the assumption that there are "interesting tokens" that normal heuristics are missing.
Agreed.
A Bayes database doesn't reach maturity by having a certain number of SA-filtered spams >15 and SA-filtered hams <-2; it reaches maturity by having a certain number of confirmed hams and spams, period.
Disagree. It never reache maturity. But just as a kid or a kitten, it has to reach maturity. No good teaching it as an adult until then.
> Therefore, if one organization obtains initial Bayes seeding strictly through auto-learning for three weeks and get 2000 hams and 2000 spams in it, and another does theirs in 15 minutes by manually teaching it 2000 hams from this week, and 2000 spams from this week (that SpamAssassin has never touched), the LATTER would be the much, much more accurate Bayesian seeding procedure.
That last line went on for 1,5 kilometers <gasp>. It's the *pattern* of tokens that matters. And until that pattern is established, it's useless to rely on it. It's useless to expect that a kid of 5 should know the difference between play and reality when he's pointing a Colt 45 at someone, unless he either shoots him, or you smack his hand and take the gun away. I choose for smacking his hand and taking the gun away.
This is discussed in-depth in Paul Graham's writing on the topic, specifically the part where he mentions that tokens like "per" and "FL" and "ff0000" are actually very reliable indicators of spammishness.
Probably. But a: I'm pig-headed and b: you can't reach a statistical conclusion with a population of 1. The greater the bias, the greater the accuracy. The kid of 5 will probably thank you in later life for smacking his hand and taking the gun away. Maybe you're a mathematician, I wouldn't know. I hate math. But I've done enough chi squared and other analyses to now that.
Best - and I really appreciate your involvement,
Tony
-- Tony Earnshaw
Working to get a life
http://j-walk.com/blog/docs/conference.htm http://www.billy.demon.nl Mail: [EMAIL PROTECTED]
------------------------------------------------------- This SF.NET email is sponsored by: eBay Great deals on office technology -- on eBay now! Click here: http://adfarm.mediaplex.com/ad/ck/711-11697-6916-5 _______________________________________________ Spamassassin-talk mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/spamassassin-talk