On Thu, 29 Apr 2010 18:32:04 +0200
"Giampaolo Tomassoni" <g.tomass...@libero.it> wrote:

> > what you need to do write a script that divides the metadata
> > num_spam value and all the token Nspam counts by 3. The updated
> > database can then be loaded back in with --restore.
> 
> I don't know if this is going to be effective. After all, this way
> you are basically lowering the effectiveness of all the spam tokens,
> even potentially remarkable ones.

Correct, but if those counts came from autolearning 90% of spam and 30%
of ham, then rescaling may be the correct thing to do. 

It may also be pragmatic, if a high spam/ham ratio is leading to FPs,
to keep the learned ratio closer to 1:1 than the actual ratio.

  
> I would instead, in order of effectiveness:
> 
>       a) expire old tokens;

Token retention is a good thing.  The only reason for ageing-out tokens
is to limit the database size.

>       b) eliminate tokens with very few ham/spam occurrences.

Some Bayesian filters, such as dspam, allow low-count tokens to be
aged-out quicker, but the point of that is to free-up space for longer
retention of high-count tokens. 

There's no other reason for deleting them. Either a low-count token is
never seen again, in which case it's just wasting space, or we are
still learning it's frequencies, in which case resetting the counters
makes no sense.

>       c) eliminate tokens with very close nham to nspam values;

This is only superficially appealing - similar arguments apply to "b".
What's the point in deleting a token with counts of 7483:7922 when an
hour later it might be back at 2:0?


I don't see anything here that would reduce FPs, "b" and "c" simply
free-up some space, but "a" means you are not taking any advantage of
it. 

Reply via email to