For the past several months I have been trying to find a way to make maintaining the SpamAssassin bayes database more effective on our SA servers. We have several SA servers, all running bayes globally on the server, not per user.
Bayes generally does a good job but on a fairly busy server bayes can be less effective based on how you set the database up to learn/expire, etc. So far I've followed just about every suggestion on trying to effectively maintain bayes, while bayes still works, it's not without some major problems mainly the one being when large syncs happen, the bayes token database can be locked out from other SA children for up to 10 minutes per sync. Basically we have setup our servers for "learn to journal" and we sync the journal to the main bayes database about once an hour. We've found that this process can take 8 to 10 minutes, give or take. We recently moved the bayes database into a RAM disk to see if that would help, and while reads/seeks have sped up considerably, sync has not. Expire does not seem to be a problem. Correct me if I'm wrong, but when you have bayes_learn_to_journal enabled and then you run a sync, sa-learn basically moves bayes_journal to bayes_journal.old and then starts merging/adding tokens into bayes_toks. When this happens, bayes_toks is locked for the entire time until the sync completes. So that, for us means the bayes database is locked for about 10 minutes an hour. Expires do not seem to run that long. In fact, expires finish about a minute.. which is acceptable. Would it make more sense that when you do a learn_to_journal and a sync to make a copy of the bayes_toks database, say to "bayes_toks.new" and merge/add tokens from the journal to that? Then, once the sync is complete you can lock and copy the .new to the current and continue. This should only lockout the database from updates for only seconds (if that) rather than locking it out during the entire learn/add process. I assume an expire could actually use the same logic for those of us using manually running expire/sync in cron and periodically rather than via auto methods. Thoughts? I guess my thought is to keep a read only version of bayes_toks at almost the whole time avoiding any lock contentions from the database being synced/expired. Our current bayes config: use_bayes 1 bayes_auto_learn 1 bayes_auto_expire 0 bayes_learn_to_journal 1 bayes_journal_max_size 0 bayes_expiry_max_db_size 1000000 lock_method flock SA 3.3.1 on FreeBSD 6.4 Perl 5.10 -- Robert Blayzor INOC, LLC rblay...@inoc.net http://www.inoc.net/~rblayzor/