Hi,
 
Quick background: I've run SA (using amavisd-new and postfix) for a few years,
on a server hosting a few domains. I may say I have some understanding of SA and
rudimentary understanding of statistics. I've read docs and searched the mailing
list for previous discussions on this topic, and finding nothing, I post. My
apologies if this is considered OT.

When we first started using bayes, we used auto-learning. Starting from two
years ago, we switched completely to manual training, harvesting ham and spam
from different users and domains. I'l try to summarize my grounds for doing so,
hoping for any input from people with more precise knowledge regarding SA's
bayes implementation.

On our 60-some domains SA has, since 2006-03-01, caught 70 000 spam out of a
total 150 000 messages. Now, say we were to use auto-learn. Only over this short
time span, the bayes filter would have learned from tens of thousands of message
above the threshold. Sounds good? Not so sure...

In straight terms, the bayes principle is simple. New messages are compared to a
database of patterns (?) gathered from previous spam and ham. Level of
similarity to these corpuses is represented as "spam probability" 0 to 1.
Something like that. Now, I've seen many people advise that one should use
auto-learn with a fairly high threshold, and also feed the occational false
positives/negatives as learning material. 

My lay man's thinking then works like this: If I bayes learns *only* by
auto-learn, then new messages will have a high spam probability as seen by the
bayes filter if they look very much like messages that scored 12+ or 20+ by the
other rules. So, point #1: Wouldn't that new message most probably get a
truckload of points from other rules *anyway*? What use is bayes then, if it's
just saying "haha, you scored 24.5, but I'm *reallyreally* sure you're spam - so
now you're at 28.3! Eat that!". Of course, I see the chance of a spam message
which "looks like" spam to bayes, but ducking all or most other rules - but I
just don't think that's a likely scenario, with this kind of training.

Then, point #2: Say you use auto-learning like above, and then you drop in a few
false positives/negatives. Would the latter material get drowned out by all the
other material in the bayes db? So that mail in the "twilight zone" will not get
much from bayes in getting pulled to the right side of the threshold (which is
what matters, anyway - in your first week of running SA it's fun to shout out
new highscores, then suddenly that doesn't matter so much anymore).

So, I started off with an empty database. I gathered ham from users from the
different domain, and taking only mail that had had 2.0 or higher. For spam, I
pulled up about 2000 mail which had scored below 9.0. I've kept add a little of
both every now and then. But the database is obviusly beginnging to becoma at
bit outdated, so I'm thinking about going the same round again, even if it's few
hours work.

So, I was hoping to get a different opinion here. What does people think about
my reasoning?
 
Med vennleg helsing / Best regards
Gaute Lund
IT consultant
iDrift AS
Phone: (+47) 53 47 22 00
Fax: (+47) 53 47 22 01
Mobile: (+47) 97 00 82 00

Reply via email to