Mondo bayes_toks - millions of entries

Wes Wed, 28 Nov 2007 19:51:44 -0800

I've searched and searched the archives, but no answers..  Sorry for the
lengthy email, but...



Spam Assassin 3.2.3-1
Smf-spamd 1.3.1 with spamd
Dual quad-core Xeon 5355 (Woodcrest) systems with 8GB memory.

Configuration:

    bayes_auto_learn 1
    bayes_expiry_max_db_size 150000
    lock_method flock
    rules compiled with sa-compile
    Auto-whitelist module is loaded
    Number of spamd children: 5

We are only using the spam/not spam verdict, not any of the message
rewriting features (this is handled by the MTA).

Per-user preferences are not feasible (per-user policy applications based on
the verdict are done at the MTA level).  Since this is not an end-user
server, each message has many recipients and it is not reasonable to scan
for each recipient to get unique Bayes, scores, etc.

We are processing a large volume of mail.  Spam Assassin is running after a
commercial scanner to minimize the volume and system load.

In 12 hours, the bayes_toks file gets to 160-320 MB, with a ball park of
something over 7 million tokens.  Some time before this, performance drops
off a cliff and the queue starts backing up big time.   When this happens,
mail is taking 15-20 seconds per message to process, one spamd child is
using 100% of a CPU, and none of the other spamd's are using any CPU - I
assume because they can't get a lock on the DB because it is taking the
other process so long to get what it needs.

Auto-expire doesn't work due to the volume, so I turned that off and am
doing a manual expire.  Of course, since bayes_expiry_period is 12 hours,
the minimum token age is 12 hours so the number of tokens is never going to
drop below about 7 million, regardless of how often I expire it.

As an immediate solution, I modified

    /usr/lib/perl5/site_perl/5.8.5/Mail/SpamAssassin/Conf.pm

And set bayes_expiry_period to 21600 (6 hours) and run an expire every 3
hours (why isn't this a configuration file parameter??)

This seems to be enough to keep it away from the edge of the performance
cliff - the number of tokens varies from about 3.5-5 million and the DBM
file gets reorganized every 3 hours.  It's too early to tell for sure if
this will hold, but I may need to drop bayes_expiry_period down to 3 hours.

Tomorrow I'm going to set up a test on one of the servers using PostgreSQL
to hold the Bayes tokens and see if it scales better than the DBM file.
That would also allow our multiple servers to share information instead of
act independently.


On to the questions...

1. Setting the expiry period down that low doesn't see to be an optimal
thing to do from an effectiveness standpoint.  Comments on this?  Am I
missing something?  Due to the type of user base, all-manual learning isn't
likely to work well.  Is auto-learning just a waste of resources in this
case?

2. If I set up manual learning where false positives and false negatives can
be manually sent in by users and added to the site-wide configuration, won't
they also be subject to the (short) expiration period, or is manual learning
kept permanently?

Thanks

Wes

Mondo bayes_toks - millions of entries

Reply via email to