Re: Mondo bayes_toks - millions of entries

Wes Thu, 29 Nov 2007 12:20:16 -0800

On 11/29/07 1:00 PM, "John D. Hardin" <[EMAIL PROTECTED]> wrote:


> Have you considered pushing your autolearn thresholds a bit further
> out, to reduce the number of messages that are elegible for autolearn
> and thus reduce the growth of the token database?

I hadn't thought about that, but not sure it would make sense here.  Man
Mail::SpamAssassin::Plugin::AutoLearnThreshold shows:

       bayes_auto_learn_threshold_nonspam n.nn   (default: 0.1)
           The score threshold below which a mail has to score, to be fed
into SpamAssassin¹s learning systems automatically as a non-spam message.

       bayes_auto_learn_threshold_spam n.nn      (default: 12.0)
           The score threshold above which a mail has to score, to be fed
into SpamAssassin¹s learning systems automatically as a spam message.

Since the mail has already been processed by a commercial scanner, the
majority of the mail is now good - we're trying to catch leakage.  That
means most of the auto-learning is good mail.  I'm thinking increasing
bayes_auto_learn_threshold_nonspam  would be a bad thing, no?

> Do not waste any more time trying to get more performance out of DBM.
> Just about any SQL based database will perform a lot better than DBM
> will when your bayes database is large.

That's good feedback.  I was hoping that, but don't quite have the DB up and
running yet - gotta get it working in the test environment before putting it
somewhere with a load.  Then the question becomes how much network latency
can be tolerated before there's a performance problem (e.g. Between physical
locations).

> If you process a lot of mail and are using autolearn you are going to
> have a large bayes database, period.  If the database isn't large enough
> it is going to churn so fast that it'll defeat the purpose of even
> having a bayes database.

I had pretty much come to that conclusion, but all the posts I found were
talking about token databases in the low hundreds of thousands, and I've
been seeing millions...  Wasn't sure I wasn't overlooking something big.
 
Wes

Re: Mondo bayes_toks - millions of entries

Reply via email to