Re: Mondo bayes_toks - millions of entries

Matt Kettler Thu, 29 Nov 2007 21:09:17 -0800

Wes wrote:
> On 11/29/07 7:45 PM, "Matt Kettler" <[EMAIL PROTECTED]> wrote:
>
>   
>> As a starting point I'd suggest:
>>    either disable your force-expire calls or disable bayes_auto_expire.
>>     
>
> I am doing only force-expires.  I disabled auto-expire when I started doing
> force expires.
>
>   
>> Doesn't matter to me which, but you really want to be expiring at the
>> bayes_expiry_period interval
>>     drop bayes_expiry_period to 3 hours, if you're still using
>> force-expires, make it just a tad under 3 hours.
>>     expand bayes_expiry_max_db_size to at least 300,000, maybe 600,000.
>>     
>
> Thanks.  I'll give this a try.
>
> If bayes_expiry_period is set to 3 hours, shouldn't the force-expire be just
> *over* 3 hours, not just under?  
Well, I was suggesting making the expiry period just under, not the
force-expire.. Really you can do it either way as long as expiry_period
< force-expire.


> Otherwise wouldn't the "can't use
> estimation method for expiry" always be triggered as it is now?
>   
Aye.
> I planned to have the PostgreSQL DB enabled on one live system tonight, but
> have to wait on a couple of missing RPM's to be installed.  I have great
> hopes for it...  I am running a nearly 2 billion record database under
> PostgreSQL with great performance.  A few million records should be
> nothing...  Guess it depends on what the update vs. read load is.
>   
Should be mostly read, except the atimes for the tokens in a message
will be updated every message that's scanned.
> I would think it would be extremely useful to be able to treat
> manually-learned rules separately from auto-learned rules.  In a high volume
> environment, you'd want to keep manually learned rules far longer than you
> could possibly keep auto-learned ones.  Manually learned rules should be
> more important.
>   
The problem is that doesn't make any physical sense. The tokens are the
same.

It's not like there's 6 tokens generated for one message, and 5
completely different ones for the next. Odds are you'd only have 6
tokens total. SA  just tracks them as counters. So, it's not like it
tracks "this instance of "hello" was learned on xyz date, and came from
message-id 1234".. SA just tracks "hello was in 150 nonspams 120 spams,
and was last present in an email on 11/29/2007"

Besides, let's say you've got some kind of flag that makes manually
learned tokens be retained longer, and added it onto the end of the
record. In very short order your entire database would have this flag if
you have any regular manual training. Any token that got autolearned is
likely to get flagged by a manual training in very short order, because
even if the emails aren't the same, the tokens generally are.

The whole reason bayes works is the fact that there's a *LOT* of tokens
that are repeated over and over and over again for any given kind of
mail. So the set of tokens acted on by one message are 95% the same as
the ones in another, provided the general type of email is the same (and
by general type, I'm thinking all email fits into maybe 20 types, I'm
talking really broad categories like "conversation" "newsletter" "spam"
"nonspam ad", etc..)

Re: Mondo bayes_toks - millions of entries

Reply via email to