Re: Mondo bayes_toks - millions of entries

Kevin Parris Fri, 30 Nov 2007 10:58:19 -0800

If I have followed the discussion correctly so far, the explanation for 
manual-learn not being distinguished from auto-learn is this:  no matter what 
mode of learning caused a token to appear in the database, if there is ongoing 
mail traffic that "hits" on the token then said token will not expire out 
anyway.


In other words, tokens don't expire because of where or how they came to be 
listed, they expire because no more incoming mail traffic references them.  If 
you manually train a message that is the ONLY instance of that particular spam 
to slip through your other filter, and your Bayes never sees another message 
that matches the tokens it generated, then those tokens are irrelevant 
regardless of learn mode.

>>> Wes <[EMAIL PROTECTED]> 11/30/07 11:56 AM >>>
> 
> The whole reason bayes works is the fact that there's a *LOT* of tokens
> that are repeated over and over and over again for any given kind of
> mail. So the set of tokens acted on by one message are 95% the same as
> the ones in another, provided the general type of email is the same (and
> by general type, I'm thinking all email fits into maybe 20 types, I'm
> talking really broad categories like "conversation" "newsletter" "spam"
> "nonspam ad", etc..)

Guess I need to read up on Bayes some more.

I was thinking more along the lines of separate databases for auto and
manual learning that are combined for a result, giving more weight to manual
learning.  Maybe that just isn't reasonable, though.  I can't see (at least
here) that manual learning would get any kind of significant volume.
Someone's only going to send in a message for manual learning if it is a
leaked spam or a false positive, and then only if they bother to do it.  I'd
be surprised if the manual learning volume was 1 in 10,000 of the messages
going through the auto-learning.

Wes

Re: Mondo bayes_toks - millions of entries

Reply via email to