Hi, Quick background: I've run SA (using amavisd-new and postfix) for a few years, on a server hosting a few domains. I may say I have some understanding of SA and rudimentary understanding of statistics. I've read docs and searched the mailing list for previous discussions on this topic, and finding nothing, I post. My apologies if this is considered OT.
When we first started using bayes, we used auto-learning. Starting from two years ago, we switched completely to manual training, harvesting ham and spam from different users and domains. I'l try to summarize my grounds for doing so, hoping for any input from people with more precise knowledge regarding SA's bayes implementation. On our 60-some domains SA has, since 2006-03-01, caught 70 000 spam out of a total 150 000 messages. Now, say we were to use auto-learn. Only over this short time span, the bayes filter would have learned from tens of thousands of message above the threshold. Sounds good? Not so sure... In straight terms, the bayes principle is simple. New messages are compared to a database of patterns (?) gathered from previous spam and ham. Level of similarity to these corpuses is represented as "spam probability" 0 to 1. Something like that. Now, I've seen many people advise that one should use auto-learn with a fairly high threshold, and also feed the occational false positives/negatives as learning material. My lay man's thinking then works like this: If I bayes learns *only* by auto-learn, then new messages will have a high spam probability as seen by the bayes filter if they look very much like messages that scored 12+ or 20+ by the other rules. So, point #1: Wouldn't that new message most probably get a truckload of points from other rules *anyway*? What use is bayes then, if it's just saying "haha, you scored 24.5, but I'm *reallyreally* sure you're spam - so now you're at 28.3! Eat that!". Of course, I see the chance of a spam message which "looks like" spam to bayes, but ducking all or most other rules - but I just don't think that's a likely scenario, with this kind of training. Then, point #2: Say you use auto-learning like above, and then you drop in a few false positives/negatives. Would the latter material get drowned out by all the other material in the bayes db? So that mail in the "twilight zone" will not get much from bayes in getting pulled to the right side of the threshold (which is what matters, anyway - in your first week of running SA it's fun to shout out new highscores, then suddenly that doesn't matter so much anymore). So, I started off with an empty database. I gathered ham from users from the different domain, and taking only mail that had had 2.0 or higher. For spam, I pulled up about 2000 mail which had scored below 9.0. I've kept add a little of both every now and then. But the database is obviusly beginnging to becoma at bit outdated, so I'm thinking about going the same round again, even if it's few hours work. So, I was hoping to get a different opinion here. What does people think about my reasoning? Med vennleg helsing / Best regards Gaute Lund IT consultant iDrift AS Phone: (+47) 53 47 22 00 Fax: (+47) 53 47 22 01 Mobile: (+47) 97 00 82 00