> Well, I was suggesting making the expiry period just under, not the > force-expire.. Really you can do it either way as long as expiry_period > < force-expire.
Ok, I misunderstood what you were saying. I set bayes_expiry_period to 3 hours, and ran expires every 4 hours over night. I still get the same results - which makes sense after thinking about it. It calculates newdelta - the time point where it wants to expire back to. If newdelta is below bayes_expiry_period, which it still is, it reverts to the "can't use estimation method for expiry" mode. I think the only way to make this work as intended would be to set bayes_expiry_period much shorter, short enough that there are fewer than bayes_expiry_max_db_size tokens created (accessed?) in that period - or increase bayes_expiry_max_db_size above the number created in bayes_expiry_period. To make 600,000 work, I'd need to set bayes_expiry_period to less than an hour. Or, for bayes_expiry_period of 3 hours, set bayes_expiry_max_db_size to something like 2 million. Which of course is why my original comment about bayes_expiry_period should be a config parameter instead of hard coded. > The problem is that doesn't make any physical sense. The tokens are the > same. > > It's not like there's 6 tokens generated for one message, and 5 > completely different ones for the next. Odds are you'd only have 6 > tokens total. SA just tracks them as counters. So, it's not like it > tracks "this instance of "hello" was learned on xyz date, and came from > message-id 1234".. SA just tracks "hello was in 150 nonspams 120 spams, > and was last present in an email on 11/29/2007" > > Besides, let's say you've got some kind of flag that makes manually > learned tokens be retained longer, and added it onto the end of the > record. In very short order your entire database would have this flag if > you have any regular manual training. Any token that got autolearned is > likely to get flagged by a manual training in very short order, because > even if the emails aren't the same, the tokens generally are. > > The whole reason bayes works is the fact that there's a *LOT* of tokens > that are repeated over and over and over again for any given kind of > mail. So the set of tokens acted on by one message are 95% the same as > the ones in another, provided the general type of email is the same (and > by general type, I'm thinking all email fits into maybe 20 types, I'm > talking really broad categories like "conversation" "newsletter" "spam" > "nonspam ad", etc..) Guess I need to read up on Bayes some more. I was thinking more along the lines of separate databases for auto and manual learning that are combined for a result, giving more weight to manual learning. Maybe that just isn't reasonable, though. I can't see (at least here) that manual learning would get any kind of significant volume. Someone's only going to send in a message for manual learning if it is a leaked spam or a false positive, and then only if they bother to do it. I'd be surprised if the manual learning volume was 1 in 10,000 of the messages going through the auto-learning. Wes