Re: False positives and Bayes

Anthony Peacock Fri, 25 Aug 2006 01:26:46 -0700

Hi,

Justin Lloyd wrote:

Hello, all.


A couple of months ago I built new mail servers to replace our existing
ones that had aging mail configurations (and disparate OS
configurations), running sendmail 8.12.6 and SA 3.0.2. Our configuration
now consists of 2 RHEL 4 ES servers that share the load using DNS
round-robin, running sendmail 8.13.7 and SpamAssassin 3.1.3, and we are
running sa-update and rulesdujour nightly (though actual updates are
rare). We use spamass-milter 0.31, which we have configured to drop
spams with scores >= 10, thereby dropping about 75% of the incoming
email before it gets to our Exchange servers. Speaking of which, these
servers do not deliver mail locally, rather all received mail either
goes to internal MS Exchange servers or Linux helpdesk and mailing list
servers. Also, our company is about 350 people and we receive a good
deal of legitimate international email.

Here is our SA configuration from /etc/mail/spamassassin/local.cf:

required_score 5
rewrite_header Subject *** SPAM [_SCORE_] ***
report_safe 0
dcc_path /usr/local/bin/dccproc
razor_config /etc/mail/spamassassin/.razor/razor-agent.conf
dns_available yes
bayes_path /localhost/home/spamd/bayes
bayes_auto_learn_threshold_spam      30
bayes_auto_learn_threshold_nonspam   -0.1
bayes_min_ham_num  100000
bayes_min_spam_num 100000
auto_whitelist_path /localhost/home/spamd/auto-whitelist
include /etc/mail/spamassassin/whitelist
include /etc/mail/spamassassin/blacklist

Here are the statistics from both mail servers for the past 31 days:
        
Email:  1303815  Autolearn: 608540  AvgScore:  12.23  AvgScanTime:  1.38
sec
Spam:    745609  Autolearn: 139632  AvgScore:  23.36  AvgScanTime:  1.52
sec
Ham:     558206  Autolearn: 468908  AvgScore:  -2.63  AvgScanTime:  1.20
sec

Email:   945103  Autolearn: 284139  AvgScore:  15.33  AvgScanTime:  1.46
sec
Spam:    701327  Autolearn: 131994  AvgScore:  22.30  AvgScanTime:  1.46
sec
Ham:     243776  Autolearn: 152145  AvgScore:  -4.74  AvgScanTime:  1.44
sec

(We think the disparity in mail counts between the two is due to some
senders having cached or hard-coded the first one's IP address and using
it rather than MX lookups like normal people do.)

The major problem we are seeing is a number of false positives in the
6-8 point range due to 3.5 points from BAYES_99 on messages that should
not be hitting that rule from what we can see. One thing we've noticed
is that many such messages are from mailing lists and newsletters and
from ISPs that shall remain nameless, though many of these also score
high due to several rfc-ignorant rules being hit.

We have turned off Bayes in the past (before the upgrade) and are
debating doing so again, but first we decided to see what constructive
criticism and advice the SA community may have regarding our
configuration. Please let me know if any additional information would be
useful.


How do you train your Bayes database?

You should be feeding the false positives back using sa-learn as ham, sothat the Bayes scorer learns that these are not spam. I manually trainBayes with false positives and false negatives on a regular basis.

You probably should also be looking at whitelisting some of the mailinglists. When the manual training really doesn't convinve Bayes that thespammy looking maling lists messages are ham I add those lists to one ofthe whitelists.


--
Anthony Peacock
CHIME, Royal Free & University College Medical School
WWW:    http://www.chime.ucl.ac.uk/~rmhiajp/
"If you have an apple and I have  an apple and we  exchange apples
then you and I will still each have  one apple. But  if you have an
idea and I have an idea and we exchange these ideas, then each of us
will have two ideas." -- George Bernard Shaw

Re: False positives and Bayes

Reply via email to