On Tue, 12 Aug 2003, Louis LeBlanc wrote:

> > Ideally you'd train [SA's Bayesian] on every message you receive.
> 
> That particular message will almost certainly never pass through his
> system again, so why use the content to train bayes?

If that argument was valid, you'd never train Bayes on ham, because almost 
by definition _all_ ham consists of messages that will never pass through 
the system again.

Remember that sa-learn doesn't look _just_ at body content for tokens.  
For example, it also collects header data into tokens in a variety of
clever ways.  Teaching the classifier that a particular Received: line is
a ham-sign may have much greater value than does [avoiding] teaching it
that "erectile" and "dysfunction" are not always spam-signs.

> Yeah, and the next time a *real* spammer sends him a carefully worded
> ad for Vigorex, his bayes db will have learned it as ham.

If this were the first message carrying that token that he'd ever fed
sa-learn, it's true that he _might_ get a false negative the next time he
gets one.  But then he'd feed the FN to sa-learn and that token would no
longer be a ham-sign for the third message.  The system gets better the
more data it has.

It's also true that in that specific example the token also wouldn't be a
spam-sign; it'd be a neutral token.  Without examining all the mail ever
received, there's no way to know whether that's a correct judgement.

> Unless I'm mistaken, the tokens will be used to reduce or increase
> their tendency to indicate spam.  Bayes will not learn from this
> message that it's ok to get erectile dysfunction in a message so long
> as it comes from this sender AND is accompanied by text referring to
> lower interest rates.

This is correct so far as it goes, but then again that's not really what
you'd want it to learn.

> And you HAVE to make value judgements.  Keep in mind that the bayes
> classifier is a PROGRAM, and it has no real ability to make fool proof
> judgements.  It makes a best guess based on the info it is fed, and no
> matter how good the program gets, until we get true AI checking our
> email for spam, garbage in == garbage out.  Never give your program
> data that will decrease its accuracy, just make allowances for
> exceptions, like the SA developers did when they added a whitelist
> feature in the first place.

The point is that -- aside from the rule "do not teach spam as ham, nor
teach ham as spam" -- YOU DON'T REALLY KNOW what data will increase or
decrease the classifier's accuracy.  As a human, you're good at making the
gestalt (and subjective) judgement "this is spam" (or ham).  You're not
good at instantly recognizing every fragment of the message that the
classifier considers to be a token and then determining whether each such
token occurs more frequently (or uniquely) in spam or ham.



-------------------------------------------------------
This SF.Net email sponsored by: Free pre-built ASP.NET sites including
Data Reports, E-commerce, Portals, and Forums are available now.
Download today and enter to win an XBOX or Visual Studio .NET.
http://aspnet.click-url.com/go/psa00100003ave/direct;at.aspnet_072303_01/01
_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

Reply via email to