On Fri, 30 Apr 2010 11:53:49 +0200 "Giampaolo Tomassoni" <g.tomass...@libero.it> wrote:
> > Correct, but if those counts came from autolearning 90% of spam and > > 30% of ham, then rescaling may be the correct thing to do. > > > > It may also be pragmatic, if a high spam/ham ratio is leading to > > FPs, to keep the learned ratio closer to 1:1 than the actual ratio. > > I was almost thinking your statement of pragmatism was wrong in > principle, but after having checked the bayesian filtering equation > and having seen what happens forcing a 1:1 ratio in the number of > received spam messages over ham ones, I see that it is: The first case is mathematically correct - note that I wrote 30% *of* ham, not 30% ham. Where the imbalance is due to unbalanced selective learning, token rescaling brings the ratio back in line with the actual ratio in received mail. The OP said he was using very conservative autolearning thresholds, which can lead to unbalanced selective learning. The second case, the pragmatic reason, doesn't appeal to theory, so it can't be wrong on principle. In any case, real-world statistical spam filters are constructed out of a combination of sound statistics used in a dubious way, empirical equations and downright kludges. > > > a) expire old tokens; > > > > Token retention is a good thing. The only reason for ageing-out > > tokens is to limit the database size. > > This is not the only reason to ageing-out tokens. Ham and spam tokens > evolve with time. ... This is not a good way of making the system more responsive to change. Giving more weight to recent learning, or something like DSPAM's train-until-mature mode are better ways of doing that. If you want a simple way of doing it, you might try periodically halving all token counts where ham+spam>200. > Please note from this point of view the SA > implementation of the bayesian filtering is less than optimal, since > it doesn't expire tokens which roll out of a given time window, Bayesian filter that don't update timestamps are avoiding a write-lock, they aren't trying to do the right thing. > > > c) eliminate tokens with very close nham to nspam values; > > > > This is only superficially appealing - similar arguments apply to > > "b". What's the point in deleting a token with counts of 7483:7922 > > when an hour later it might be back at 2:0? > > 2:0 means a definitive answer about token spamminess or hamminess. > Removing tokens where nham ~ nspam means discarding the history of a > token which actually doesn't play any role, letting it to have a new > chance in current world. 2:0 is good spam indicator in a new token, in a token that previously had counts 7483:7922, it's almost certainly a fluctuation. By deleting such tokens you overwhelmingly replace useful information with noise.