> On Thu, 29 Apr 2010 08:25:29 -0400 > Frank Bures <lisfr...@chem.toronto.edu> wrote: > what you need to do write a script that divides the metadata num_spam > value and all the token Nspam counts by 3. The updated database can > then be loaded back in with --restore.
I don't know if this is going to be effective. After all, this way you are basically lowering the effectiveness of all the spam tokens, even potentially remarkable ones. I would instead, in order of effectiveness: a) expire old tokens; b) eliminate tokens with very few ham/spam occurrences. c) eliminate tokens with very close nham to nspam values; If he receives a lot more spam than ham (like most of us), option a) would get rid of a lot of no-more useful spam tokens. Option b) would eliminate the huge amount of 1/0 and 0/1 nham/nspam tokens which may have been introduced by user mistypes or bayes poisoning, and that will probably be purged anyway by some expire in the future. Finally, option c) would then get rid of less-than-useful tokens. This way you basically don't touch important ham/spam signs and the bayes overall effectiveness wouldn't get hurt. Giampaolo