On 11/29/07 1:00 PM, "John D. Hardin" <[EMAIL PROTECTED]> wrote:
> Have you considered pushing your autolearn thresholds a bit further > out, to reduce the number of messages that are elegible for autolearn > and thus reduce the growth of the token database? I hadn't thought about that, but not sure it would make sense here. Man Mail::SpamAssassin::Plugin::AutoLearnThreshold shows: bayes_auto_learn_threshold_nonspam n.nn (default: 0.1) The score threshold below which a mail has to score, to be fed into SpamAssassin¹s learning systems automatically as a non-spam message. bayes_auto_learn_threshold_spam n.nn (default: 12.0) The score threshold above which a mail has to score, to be fed into SpamAssassin¹s learning systems automatically as a spam message. Since the mail has already been processed by a commercial scanner, the majority of the mail is now good - we're trying to catch leakage. That means most of the auto-learning is good mail. I'm thinking increasing bayes_auto_learn_threshold_nonspam would be a bad thing, no? > Do not waste any more time trying to get more performance out of DBM. > Just about any SQL based database will perform a lot better than DBM > will when your bayes database is large. That's good feedback. I was hoping that, but don't quite have the DB up and running yet - gotta get it working in the test environment before putting it somewhere with a load. Then the question becomes how much network latency can be tolerated before there's a performance problem (e.g. Between physical locations). > If you process a lot of mail and are using autolearn you are going to > have a large bayes database, period. If the database isn't large enough > it is going to churn so fast that it'll defeat the purpose of even > having a bayes database. I had pretty much come to that conclusion, but all the posts I found were talking about token databases in the low hundreds of thousands, and I've been seeing millions... Wasn't sure I wasn't overlooking something big. Wes