Re: Mondo bayes_toks - millions of entries

Daryl C. W. O'Shea Thu, 29 Nov 2007 12:50:12 -0800

Wes wrote:

On 11/29/07 1:00 PM, "John D. Hardin" <[EMAIL PROTECTED]> wrote:

Have you considered pushing your autolearn thresholds a bit further
out, to reduce the number of messages that are elegible for autolearn
and thus reduce the growth of the token database?


I hadn't thought about that, but not sure it would make sense here.  Man
Mail::SpamAssassin::Plugin::AutoLearnThreshold shows:

       bayes_auto_learn_threshold_nonspam n.nn   (default: 0.1)
           The score threshold below which a mail has to score, to be fed
into SpamAssassin¹s learning systems automatically as a non-spam message.

       bayes_auto_learn_threshold_spam n.nn      (default: 12.0)
           The score threshold above which a mail has to score, to be fed
into SpamAssassin¹s learning systems automatically as a spam message.

Since the mail has already been processed by a commercial scanner, the
majority of the mail is now good - we're trying to catch leakage.  That
means most of the auto-learning is good mail.  I'm thinking increasing
bayes_auto_learn_threshold_nonspam  would be a bad thing, no?

It'd decrease your token count, but it'd decrease the usefulness ofusing bayes by a larger factor.

Do not waste any more time trying to get more performance out of DBM.
Just about any SQL based database will perform a lot better than DBM
will when your bayes database is large.


That's good feedback.  I was hoping that, but don't quite have the DB up and
running yet - gotta get it working in the test environment before putting it
somewhere with a load.  Then the question becomes how much network latency
can be tolerated before there's a performance problem (e.g. Between physical
locations).

IIRC (it's been about three years since I looked at the code for this)tokens are pulled in a loop 100 at a time for a message. So eachmessage is probably going to have to poll the SQL server 5 times (+/-another 5/3?) just for tokens. Add in a couple of other queries(especially if it's decided to autolearn the message) and latency startsto add up.

The last time I tried sharing a bayes database over the internet didn'tgo to well at all past a few thousand messages a day (so not useful atall). However there was a cable modem in use without any trafficshaping in place to defeat the cable modem's huge buffer so that couldhave had an insane impact on it.

Even still though, 5 queries times, say, 50ms is a 1/4 of a second thatyou're idle in that spamd child process. That leaves you trying to makeup for it by runnning more child processes (you've freed up some CPUtime by having those children idle so you'll have some CPU time to runmore) but you'll never get it all back and you'll be lucky to get evenhalf of the lost throughput back.

If you'd like to share a database between distributed MXes/spamdmachines you're best off to use replication and limit autolearning tothe machines that connect to the master database server.

If you process a lot of mail and are using autolearn you are going to
have a large bayes database, period.  If the database isn't large enough
it is going to churn so fast that it'll defeat the purpose of even
having a bayes database.


I had pretty much come to that conclusion, but all the posts I found were
talking about token databases in the low hundreds of thousands, and I've
been seeing millions...  Wasn't sure I wasn't overlooking something big.

For a comparison, I've got a $10 month VPS with 128 MB of RAM serving aMySQL backed SA bayes database with 2.5 million tokens in the database.It runs fine.


Daryl

Re: Mondo bayes_toks - millions of entries

Reply via email to