On Tue, 12 Aug 2003, Louis LeBlanc wrote: > > Ideally you'd train [SA's Bayesian] on every message you receive. > > That particular message will almost certainly never pass through his > system again, so why use the content to train bayes?
If that argument was valid, you'd never train Bayes on ham, because almost by definition _all_ ham consists of messages that will never pass through the system again. Remember that sa-learn doesn't look _just_ at body content for tokens. For example, it also collects header data into tokens in a variety of clever ways. Teaching the classifier that a particular Received: line is a ham-sign may have much greater value than does [avoiding] teaching it that "erectile" and "dysfunction" are not always spam-signs. > Yeah, and the next time a *real* spammer sends him a carefully worded > ad for Vigorex, his bayes db will have learned it as ham. If this were the first message carrying that token that he'd ever fed sa-learn, it's true that he _might_ get a false negative the next time he gets one. But then he'd feed the FN to sa-learn and that token would no longer be a ham-sign for the third message. The system gets better the more data it has. It's also true that in that specific example the token also wouldn't be a spam-sign; it'd be a neutral token. Without examining all the mail ever received, there's no way to know whether that's a correct judgement. > Unless I'm mistaken, the tokens will be used to reduce or increase > their tendency to indicate spam. Bayes will not learn from this > message that it's ok to get erectile dysfunction in a message so long > as it comes from this sender AND is accompanied by text referring to > lower interest rates. This is correct so far as it goes, but then again that's not really what you'd want it to learn. > And you HAVE to make value judgements. Keep in mind that the bayes > classifier is a PROGRAM, and it has no real ability to make fool proof > judgements. It makes a best guess based on the info it is fed, and no > matter how good the program gets, until we get true AI checking our > email for spam, garbage in == garbage out. Never give your program > data that will decrease its accuracy, just make allowances for > exceptions, like the SA developers did when they added a whitelist > feature in the first place. The point is that -- aside from the rule "do not teach spam as ham, nor teach ham as spam" -- YOU DON'T REALLY KNOW what data will increase or decrease the classifier's accuracy. As a human, you're good at making the gestalt (and subjective) judgement "this is spam" (or ham). You're not good at instantly recognizing every fragment of the message that the classifier considers to be a token and then determining whether each such token occurs more frequently (or uniquely) in spam or ham. ------------------------------------------------------- This SF.Net email sponsored by: Free pre-built ASP.NET sites including Data Reports, E-commerce, Portals, and Forums are available now. Download today and enter to win an XBOX or Visual Studio .NET. http://aspnet.click-url.com/go/psa00100003ave/direct;at.aspnet_072303_01/01 _______________________________________________ Spamassassin-talk mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/spamassassin-talk