For the past several months I have been trying to find a way to make 
maintaining the SpamAssassin bayes database more effective on our SA  servers.  
We have several SA servers, all running bayes globally on the server, not per 
user.

Bayes generally does a good job but on a fairly busy server bayes can be less 
effective based on how you set the database up to learn/expire, etc.  So far 
I've followed just about every suggestion on trying to effectively maintain 
bayes, while bayes still works, it's not without some major problems mainly the 
one being when large syncs happen, the bayes token database can be locked out 
from other SA children for up to 10 minutes per sync.

Basically we have setup our servers for "learn to journal" and we sync the 
journal to the main bayes database about once an hour.  We've found that this 
process can take 8 to 10 minutes, give or take.

We recently moved the bayes database into a RAM disk to see if that would help, 
and while reads/seeks have sped up considerably, sync has not.  Expire does not 
seem to be a problem.

Correct me if I'm wrong, but when you have bayes_learn_to_journal enabled and 
then you run a sync, sa-learn basically moves bayes_journal to 
bayes_journal.old and then starts merging/adding tokens into bayes_toks.  When 
this happens, bayes_toks is locked for the entire time until the sync 
completes.  So that, for us means the bayes database is locked for about 10 
minutes an hour.  Expires do not seem to run that long.  In fact, expires 
finish about a minute.. which is acceptable.

Would it make more sense that when you do a learn_to_journal and a sync to make 
a copy of the bayes_toks database, say to "bayes_toks.new" and merge/add tokens 
from the journal to that?  Then, once the sync is complete you can lock and 
copy the .new to the current and continue.  This should only lockout the 
database from updates for only seconds (if that) rather than locking it out 
during the entire learn/add process.  I assume an expire could actually use the 
same logic for those of us using manually running expire/sync in cron and 
periodically rather than via auto methods.

Thoughts?  I guess my thought is to keep a read only version of bayes_toks at 
almost the whole time avoiding any lock contentions from the database being 
synced/expired.


Our current bayes config:

use_bayes                    1
bayes_auto_learn             1
bayes_auto_expire            0
bayes_learn_to_journal       1
bayes_journal_max_size       0
bayes_expiry_max_db_size     1000000
lock_method                  flock


SA 3.3.1 on FreeBSD 6.4
Perl 5.10

-- 
Robert Blayzor
INOC, LLC
rblay...@inoc.net
http://www.inoc.net/~rblayzor/




Reply via email to