Wes wrote:
On 11/29/07 1:00 PM, "John D. Hardin" <[EMAIL PROTECTED]> wrote:

Have you considered pushing your autolearn thresholds a bit further
out, to reduce the number of messages that are elegible for autolearn
and thus reduce the growth of the token database?

I hadn't thought about that, but not sure it would make sense here.  Man
Mail::SpamAssassin::Plugin::AutoLearnThreshold shows:

       bayes_auto_learn_threshold_nonspam n.nn   (default: 0.1)
           The score threshold below which a mail has to score, to be fed
into SpamAssassin¹s learning systems automatically as a non-spam message.

       bayes_auto_learn_threshold_spam n.nn      (default: 12.0)
           The score threshold above which a mail has to score, to be fed
into SpamAssassin¹s learning systems automatically as a spam message.

Since the mail has already been processed by a commercial scanner, the
majority of the mail is now good - we're trying to catch leakage.  That
means most of the auto-learning is good mail.  I'm thinking increasing
bayes_auto_learn_threshold_nonspam  would be a bad thing, no?

It'd decrease your token count, but it'd decrease the usefulness of using bayes by a larger factor.

Do not waste any more time trying to get more performance out of DBM.
Just about any SQL based database will perform a lot better than DBM
will when your bayes database is large.

That's good feedback.  I was hoping that, but don't quite have the DB up and
running yet - gotta get it working in the test environment before putting it
somewhere with a load.  Then the question becomes how much network latency
can be tolerated before there's a performance problem (e.g. Between physical
locations).

IIRC (it's been about three years since I looked at the code for this) tokens are pulled in a loop 100 at a time for a message. So each message is probably going to have to poll the SQL server 5 times (+/- another 5/3?) just for tokens. Add in a couple of other queries (especially if it's decided to autolearn the message) and latency starts to add up.

The last time I tried sharing a bayes database over the internet didn't go to well at all past a few thousand messages a day (so not useful at all). However there was a cable modem in use without any traffic shaping in place to defeat the cable modem's huge buffer so that could have had an insane impact on it.

Even still though, 5 queries times, say, 50ms is a 1/4 of a second that you're idle in that spamd child process. That leaves you trying to make up for it by runnning more child processes (you've freed up some CPU time by having those children idle so you'll have some CPU time to run more) but you'll never get it all back and you'll be lucky to get even half of the lost throughput back.

If you'd like to share a database between distributed MXes/spamd machines you're best off to use replication and limit autolearning to the machines that connect to the master database server.

If you process a lot of mail and are using autolearn you are going to
have a large bayes database, period.  If the database isn't large enough
it is going to churn so fast that it'll defeat the purpose of even
having a bayes database.

I had pretty much come to that conclusion, but all the posts I found were
talking about token databases in the low hundreds of thousands, and I've
been seeing millions...  Wasn't sure I wasn't overlooking something big.

For a comparison, I've got a $10 month VPS with 128 MB of RAM serving a MySQL backed SA bayes database with 2.5 million tokens in the database. It runs fine.

Daryl

Reply via email to