RE: Bayes spam and ham out of proportion

Giampaolo Tomassoni Thu, 29 Apr 2010 10:48:38 -0700

> Hi,
> 
> > I would instead, in order of effectiveness:
> >
> >        a) expire old tokens;
> >        b) eliminate tokens with very few ham/spam occurrences.
> >        c) eliminate tokens with very close nham to nspam values;
> 
> Can you explain how to do this, or point to documentation that would
> explain?
> 
> My bayes DB is way too big, but mostly effective. I'd just like to
> trim it to remove the ones infrequently occurring.


In order to expire old tokens (option a), simply configure the bayes
expiration engine and run "sa-learn --force-expire". You may want to look at
the sa-learn man page, as there is a lot of useful info about token
expiration at the end of that manpage (see the EXPIRATION section).

In order to accomplish option b) or c), you instead have to parse the token
lines from "sa-learn --backup" and trim out rows matching your exclusion
criterion, then you have to recompute the correct num_spam and num_nonspam
grand-totals.

In example, to work on option b) you could trim out rows having nham+nspam <
1, or even nham = 0 and nspam = 1, if you want to bias the bayes db toward
ham.

To work on c), you may throw out token rows having
abs(nham-nspam)/(nham+nspam) < epsilon, where epsilon is a real value >0 you
choose.

If you meant instead if there is somewhere a software bit ready to use to do
b) or c), well, I guess that there isn't yet.

You may use perl or awk to extract token rows from the backup file and
format them in a way suitable to be loaded in your preferred sql db: that
way it would be much more easy to test and trim token rows the way you want
(and compute back the num_spam and num_nonspam grantotals).

You could even think to move to a SQL-based bayes database, which should
also allow you to test the effectiveness of your trimming on the fly.

Giampaolo

PS: Keep at hand a backup of your bayes database anyway, you never know...

RE: Bayes spam and ham out of proportion

Reply via email to