Re: Mondo bayes_toks - millions of entries

Wes Fri, 30 Nov 2007 08:57:32 -0800

> Well, I was suggesting making the expiry period just under, not the
> force-expire.. Really you can do it either way as long as expiry_period
> < force-expire.


Ok, I misunderstood what you were saying.  I set bayes_expiry_period to 3
hours, and ran expires every 4 hours over night.

I still get the same results - which makes sense after thinking about it.
It calculates newdelta - the time point where it wants to expire back to.
If newdelta is below bayes_expiry_period, which it still is, it reverts to
the "can't use estimation method for expiry" mode.  I think the only way to
make this work as intended would be to set bayes_expiry_period much shorter,
short enough that there are fewer than bayes_expiry_max_db_size tokens
created (accessed?) in that period - or increase bayes_expiry_max_db_size
above the number created in bayes_expiry_period.

To make 600,000 work, I'd need to set bayes_expiry_period to less than an
hour.  Or, for bayes_expiry_period of 3 hours, set bayes_expiry_max_db_size
to something like 2 million.  Which of course is why my original comment
about bayes_expiry_period should be a config parameter instead of hard
coded.

> The problem is that doesn't make any physical sense. The tokens are the
> same.
> 
> It's not like there's 6 tokens generated for one message, and 5
> completely different ones for the next. Odds are you'd only have 6
> tokens total. SA  just tracks them as counters. So, it's not like it
> tracks "this instance of "hello" was learned on xyz date, and came from
> message-id 1234".. SA just tracks "hello was in 150 nonspams 120 spams,
> and was last present in an email on 11/29/2007"
> 
> Besides, let's say you've got some kind of flag that makes manually
> learned tokens be retained longer, and added it onto the end of the
> record. In very short order your entire database would have this flag if
> you have any regular manual training. Any token that got autolearned is
> likely to get flagged by a manual training in very short order, because
> even if the emails aren't the same, the tokens generally are.
> 
> The whole reason bayes works is the fact that there's a *LOT* of tokens
> that are repeated over and over and over again for any given kind of
> mail. So the set of tokens acted on by one message are 95% the same as
> the ones in another, provided the general type of email is the same (and
> by general type, I'm thinking all email fits into maybe 20 types, I'm
> talking really broad categories like "conversation" "newsletter" "spam"
> "nonspam ad", etc..)

Guess I need to read up on Bayes some more.

I was thinking more along the lines of separate databases for auto and
manual learning that are combined for a result, giving more weight to manual
learning.  Maybe that just isn't reasonable, though.  I can't see (at least
here) that manual learning would get any kind of significant volume.
Someone's only going to send in a message for manual learning if it is a
leaked spam or a false positive, and then only if they bother to do it.  I'd
be surprised if the manual learning volume was 1 in 10,000 of the messages
going through the auto-learning.

Wes

Re: Mondo bayes_toks - millions of entries

Reply via email to