> Hi, > > > I would instead, in order of effectiveness: > > > > a) expire old tokens; > > b) eliminate tokens with very few ham/spam occurrences. > > c) eliminate tokens with very close nham to nspam values; > > Can you explain how to do this, or point to documentation that would > explain? > > My bayes DB is way too big, but mostly effective. I'd just like to > trim it to remove the ones infrequently occurring.
In order to expire old tokens (option a), simply configure the bayes expiration engine and run "sa-learn --force-expire". You may want to look at the sa-learn man page, as there is a lot of useful info about token expiration at the end of that manpage (see the EXPIRATION section). In order to accomplish option b) or c), you instead have to parse the token lines from "sa-learn --backup" and trim out rows matching your exclusion criterion, then you have to recompute the correct num_spam and num_nonspam grand-totals. In example, to work on option b) you could trim out rows having nham+nspam < 1, or even nham = 0 and nspam = 1, if you want to bias the bayes db toward ham. To work on c), you may throw out token rows having abs(nham-nspam)/(nham+nspam) < epsilon, where epsilon is a real value >0 you choose. If you meant instead if there is somewhere a software bit ready to use to do b) or c), well, I guess that there isn't yet. You may use perl or awk to extract token rows from the backup file and format them in a way suitable to be loaded in your preferred sql db: that way it would be much more easy to test and trim token rows the way you want (and compute back the num_spam and num_nonspam grantotals). You could even think to move to a SQL-based bayes database, which should also allow you to test the effectiveness of your trimming on the fly. Giampaolo PS: Keep at hand a backup of your bayes database anyway, you never know...