On Apr 25, 2007, at 5:49 AM, Arik Raffael Funke wrote:
I was wondering if it has any negative effects on my Bayes database
if I regularly learn all spam/ham messages via a cron job.
Sa-learn skips already learned messages. Am I thus right to assume
that apart from the relatively high CPU load there are no
drawbacks? Or should I keep a separate folder for "new" spam/ham?
I did this for a while and didn't find any problems.
I'm using Maildir, and I only trained on the cur folders, not the new
folders. In theory this would prevent me from training on something
that had come in mis-filed (so long as I remembered to quit my mail
client at night).
See here for details and a script to do this:
http://www.faisal.com/software/sa-harvest/
Note that this script will also attempt to rebuild your whitelist
(all the code after the 'sa-learn --dump magic'). This has some
downsides, and turns out to be less useful with modern Spamassassin,
so I'm reworking the script to break out the whitelist code into a
separate script.
That said, I keep a rolling 1 month corpus of spam, so it's easy to
retrain when I need to. I stopped doing full retrains on cron, and
at this point I only retrain on messages that were misfiled. See:
http://www.faisal.com/software/sa-harvest/quicktrain.xhtml
If you're doing any of this on a shared system, my one bit of advice
is to set up the cron to use 'batch' and 'nice'.
-faisal