From: Cecil Westerhof <ce...@decebal.nl> Date: Sat, 09 Jan 2010 16:24:56 +0100 Jeff Mincy <j...@delphioutpost.com> writes: > I upgraded from 3.0.4 to 3.2.5. I have the feeling that sa-learn takes > more time with 3.2.5 as it took with 3.0.4. Can this be true? > > It is not a problem, because it is done by cron-tab, but I am just > curious. > > You can use spamc -L spam/ham to learn messages. Spamc -L is faster > than sa-learn. The spamd daemon needs to be started with > --allow-tell. That is not really an answer on my question. ;-)
I doubt that bayes learning has slowed down significantly. I would expect that choice of bayes_store_module, learning to journal, whether auto expiration runs, and lock contention matters more than the version. But it does not seem to be interesting in my situation. First my code has to grow from: sa-learn --${typeStr} ${HOME}/Maildir/.SpamDir.${dirStr}/cur/ to: for i in ${HOME}/Maildir/.SpamDir.${dirStr}/cur/*; do spamc -L ${typeStr} <${i} done Which is not even enough, because I need to take care of the situation that the directory is empty and I need to implement code to show the messages delivered by sa-learn. Oh. You're learning all of the messages in a directory. spamc -L is faster than sa-learn for learning single messages because sa-learn is a perl script that has to load Mail::SpamAssassin each time. For a large directory the slower startup of sa-learn is less of an issue. sa-learn is fine for doing directories. Which a low level of spam it work, but if it becomes bigger, it does not work: date echo ${echoStr} sa-learn --${typeStr} ${HOME}/Maildir/.SpamDir.${dirStr}/cur/ date for i in ${HOME}/Maildir/.SpamDir.${dirStr}/cur/*; do spamc -L ${typeStr} <${i} done echo learned in the new way date gives: za jan 9 16:09:25 CET 2010 Increase Learned tokens from 0 message(s) (45 message(s) examined) za jan 9 16:09:40 CET 2010 learned in the new way za jan 9 16:10:00 CET 2010 So sa-learn takes 15 seconds and spamc -L 20 seconds. (And I need more code. Beside taking care of an empty directory, I also need to implement the feedback given by sa-learn.) You learned tokens from 0 messages and looked at 45 messages. You've already previously learned from those 45 messages, which is just timing how fast it can do nothing. > You can try using bayes_learn_to_journal - and do a separate sa-learn > --sync job in cron. Learning to the journal is faster. I'll look into that. > Also, What is the size of your database? Maybe you are spending lots > of time doing expires or something. sa-learn --dump magic gives: 0.000 0 3 0 non-token data: bayes db version 0.000 0 57538 0 non-token data: nspam 0.000 0 74876 0 non-token data: nham 0.000 0 166338 0 non-token data: ntokens 0.000 0 1257478501 0 non-token data: oldest atime 0.000 0 1263049426 0 non-token data: newest atime 0.000 0 1263049538 0 non-token data: last journal sync atime 0.000 0 1263044805 0 non-token data: last expiry atime 0.000 0 5529600 0 non-token data: last expire atime delta 0.000 0 1868 0 non-token data: last expire reduction count Your database has 166338 tokens which is larger than the default bayes_expiry_max_db_size 150000. The last expiration ran this morning at 8:46. You could try letting the bayes database get larger and turn off bayes_auto_expire. If you turn off bayes_auto_expire you'll have to add something to cron to periodically expire tokens. bayes_auto_expire is fine for lower volumes of email, but can get in the way with higher volumes. -jeff