RE: Bayes spam and ham out of proportion

Giampaolo Tomassoni Thu, 29 Apr 2010 09:32:33 -0700

> On Thu, 29 Apr 2010 08:25:29 -0400
> Frank Bures <lisfr...@chem.toronto.edu> wrote:
> what you need to do write a script that divides the metadata num_spam
> value and all the token Nspam counts by 3. The updated database can
> then be loaded back in with --restore.


I don't know if this is going to be effective. After all, this way you are
basically lowering the effectiveness of all the spam tokens, even
potentially remarkable ones.

I would instead, in order of effectiveness:

        a) expire old tokens;
        b) eliminate tokens with very few ham/spam occurrences.
        c) eliminate tokens with very close nham to nspam values;

If he receives a lot more spam than ham (like most of us), option a) would
get rid of a lot of no-more useful spam tokens.

Option b) would eliminate the huge amount of 1/0 and 0/1 nham/nspam tokens
which may have been introduced by user mistypes or bayes poisoning, and that
will probably be purged anyway by some expire in the future.

Finally, option c) would then get rid of less-than-useful tokens.

This way you basically don't touch important ham/spam signs and the bayes
overall effectiveness wouldn't get hurt.

Giampaolo

RE: Bayes spam and ham out of proportion

Reply via email to