Hi Bob,

Many thanks for taking the time to send such a detailed reply.

Your situation is similar to mine, but I'm still at SA 2.63. Last week's
performance stunk at 0 false positives and 20 false negatives (a rotten
99.5% accuracy record; I'm not satisfied unless I hit 99.8%).

That's awesome; I'd be happy with 97%, as long as there are almost no false positives on unicast mail.

Adding custom rules is among the last things you want to do. I do them,
and I can help you with the process (provided you can run bash scripts
under cron), but there are things you want to do first.

I had considered running SpamAssassin from a background job, but there seemed to be a bad interaction with IMAP (see below).

Step 1: If False Positives are your major problem,
a) identify which rules are causing the false positives and lower their
scores, or
b) raise your required_hits, or
c) both.  I use required_hits of 9.0, and have modified the scores of
several dozen rules.

We don't have an FP problem at all. Mail sent by individuals almost always gets a negative score, and our users know that they need to make a whitelist entry if they don't want to miss "Sex News Daily" ;) It's the dozen spams per user per day that leak through that is our problem.

Step 2: Having done step 1, you'll increase the amount of spam that comes
through. Identify which distribution rules hit that spam, and raise their
scores enough to score the spam, without causing false positives.

Well, a typical false negative shows:
X-Spam-Status: No, hits=3.7 required=5.0 tests=BAYES_50,HTML_90_100,
HTML_IMAGE_ONLY_02,HTML_MESSAGE,MIME_HTML_ONLY,RCVD_IN_SBL The only difference, unfortunately, between this and much commercial
ham is the SBL, but that gets too polluted with ham sources to assign
it a much bigger score.


Step 3: Bayes is your friend. Identify all email as guaranteed spam,
guaranteed not-spam, spam discussions, and uncertain. Feed the first two
into the Bayes system consistently and accurately, and that will help
enormously.
So enormously that some people will recommend doing step 3 before steps 1
and 2.

Yes. I made a big mistake here, naively thinking that the autolearn feature would do an adequate job. I now suspect that the bayes_* files on my server are garbage. Should I save and delete them before feeding the spam and ham corpera to sa-learn? Is it necessary to run sa-learn on mail that SpamAssassin has already correctly classified?

Step 4: Your system does allow for whitelist and blacklist entries. Maybe
this should be in front of step 1 also: identify from your false
positives those sites that can be reliably whitelisted with
whitelist_from_rcvd (use the _rcvd version rather than just
whitelist_from whenever possible). Copy William Sterns' blacklist file
from http://www.stearns.org/sa-blacklist/sa-blacklist.current.cf into
your user_prefs.

Many thanks for this link. I manually checked some uncaught spam against it, and found hits on about 75% ! I'll be installing this right away. However, it is IMO unfortunate that we are forced to blacklist by name. Bill Waggoner alone accounts for about 1000 domains on Mr. Sterns' list. If we could say blacklist_from_rcvd 69.42.96.0/19, one line would do the job of 1000. More importantly, it would last a lot longer, because this A-hole got his IPs directly from ARIN and they are unlikely to change any time soon. OTOH, he registers a dozen new domains every day!

Bayes:  Do your people retrieve their email using POP3 (in which case
they probably get the inbox mail only), or do they use webmail? If the
latter, have them create two more folders: spam and notspam. Have them
move all spam into the spam folder. Have them copy (not move) all
non-spam intothe notspam folder. Have a cron job which runs sa-learn
against these mbox files on a regular basis (mine runs hourly), deleting
the mbox files when done.

We don't use POP3 at all; it's mostly IMAP and occasionally webmail. The good news is that the folders you describe are easily accessible; I'll try that in the next couple of days and let you know how it works. The bad news (I think) is that when users leave their Outlook open, then new mail appears on the desktop within seconds of when it is delivered to the server. This would prevent a cron-based task from resorting the mail properly.

No, under your setup there's no way for each mailbox to have its own
user_prefs; there's one user_prefs for each master domain and that's it.
There's also no way for each mailbox to have its own bayes database --
there's one bayes database for the entire master domain.

I realize that this is true for my present setup. However, I hope that the new setup won't have those restrictions. If it's possible to run SpamAssassin via cron or whatever, it should also be possible to run a private copy that is installed in my home directory. I hope that by determining the recipient and setting up an appropriate environment prior to invoking SpamAssassin, independent bayes and prefs will work. If not, hey, SpamAssassin is made of this amazing stuff called open source -- you can change the code and make it do what you want. Of course, it may take more effort than the improvement in performance would justify, so I'll first see how much improvement sa-learn gives.

Once you've done the above three steps, then we can explore whether the
method I use for implementing my own custom rules will work for you.

Thanks again,

Stewart




Reply via email to