On Apr 25, 2007, at 5:49 AM, Arik Raffael Funke wrote:
I was wondering if it has any negative effects on my Bayes database if I regularly learn all spam/ham messages via a cron job.

Sa-learn skips already learned messages. Am I thus right to assume that apart from the relatively high CPU load there are no drawbacks? Or should I keep a separate folder for "new" spam/ham?

I did this for a while and didn't find any problems.

I'm using Maildir, and I only trained on the cur folders, not the new folders. In theory this would prevent me from training on something that had come in mis-filed (so long as I remembered to quit my mail client at night).

See here for details and a script to do this:

http://www.faisal.com/software/sa-harvest/

Note that this script will also attempt to rebuild your whitelist (all the code after the 'sa-learn --dump magic'). This has some downsides, and turns out to be less useful with modern Spamassassin, so I'm reworking the script to break out the whitelist code into a separate script.

That said, I keep a rolling 1 month corpus of spam, so it's easy to retrain when I need to. I stopped doing full retrains on cron, and at this point I only retrain on messages that were misfiled. See:

http://www.faisal.com/software/sa-harvest/quicktrain.xhtml

If you're doing any of this on a shared system, my one bit of advice is to set up the cron to use 'batch' and 'nice'.

-faisal


Reply via email to