On Thu, 29 Apr 2010 18:32:04 +0200 "Giampaolo Tomassoni" <g.tomass...@libero.it> wrote:
> > what you need to do write a script that divides the metadata > > num_spam value and all the token Nspam counts by 3. The updated > > database can then be loaded back in with --restore. > > I don't know if this is going to be effective. After all, this way > you are basically lowering the effectiveness of all the spam tokens, > even potentially remarkable ones. Correct, but if those counts came from autolearning 90% of spam and 30% of ham, then rescaling may be the correct thing to do. It may also be pragmatic, if a high spam/ham ratio is leading to FPs, to keep the learned ratio closer to 1:1 than the actual ratio. > I would instead, in order of effectiveness: > > a) expire old tokens; Token retention is a good thing. The only reason for ageing-out tokens is to limit the database size. > b) eliminate tokens with very few ham/spam occurrences. Some Bayesian filters, such as dspam, allow low-count tokens to be aged-out quicker, but the point of that is to free-up space for longer retention of high-count tokens. There's no other reason for deleting them. Either a low-count token is never seen again, in which case it's just wasting space, or we are still learning it's frequencies, in which case resetting the counters makes no sense. > c) eliminate tokens with very close nham to nspam values; This is only superficially appealing - similar arguments apply to "b". What's the point in deleting a token with counts of 7483:7922 when an hour later it might be back at 2:0? I don't see anything here that would reduce FPs, "b" and "c" simply free-up some space, but "a" means you are not taking any advantage of it.