On Thu, 15 Feb 2018 14:32:36 -0600 (CST) sha...@shanew.net wrote:
> I haven't checked the math in the Bayes plugin, but it explicitly > mentions using the "chi-square probability combiner" which is > described at http://www.linuxjournal.com/print.php?sid=6467 > > Maybe I'm misunderstanding what that article describes, but I'm pretty > sure what it boils down to is that when the occurence of a token is > too small (he uses the phrase "rare words") it can lead to > probabilities at the extremes (like a token that occurs only once and > is in spam, so its probability is 1). The way to address these > extremely low or extremely high probabilities is to use the Fisher > calculation (which is described in the second page of the article). Tokens with low counts are detuned a bit, but not as much as you might think. In a database with a 1:1 ratio you get hapax token probabilities of 0.016 and 0.987, IIRC Robinson anticipated something much closer to neutral. This is similar to the defaults in spambayes and bogofilter, and I think at least one of the three project would have derived them from optimization. My guess it's because enough tokens with low counts are very strong, but short-lived indicators that it's worth putting with the noise.