Re: Train SA with e-mails 100% proven spams and next time it should be marked as spam

RW Thu, 15 Feb 2018 16:03:55 -0800

On Thu, 15 Feb 2018 14:32:36 -0600 (CST)
sha...@shanew.net wrote:


> I haven't checked the math in the Bayes plugin, but it explicitly
> mentions using the "chi-square probability combiner" which is
> described at http://www.linuxjournal.com/print.php?sid=6467
> 
> Maybe I'm misunderstanding what that article describes, but I'm pretty
> sure what it boils down to is that when the occurence of a token is
> too small (he uses the phrase "rare words") it can lead to
> probabilities at the extremes (like a token that occurs only once and
> is in spam, so its probability is 1).  The way to address these
> extremely low or extremely high probabilities is to use the Fisher
> calculation (which is described in the second page of the article).

Tokens with low counts are detuned a bit, but not as much as you might
think. In a database with a 1:1 ratio you get hapax token probabilities
of 0.016 and 0.987, IIRC Robinson anticipated something much closer to
neutral.

This is similar to the defaults in spambayes and bogofilter, and I
think at least one of the three project would have derived them from
optimization. My guess it's because enough tokens with low counts are
very strong, but short-lived indicators that it's worth putting with
the noise.

Re: Train SA with e-mails 100% proven spams and next time it should be marked as spam

Reply via email to