RE: Best Practices: SpamAssassin

Bowie Bailey Fri, 31 Mar 2006 06:34:28 -0800

Ryan Kather wrote:
> 
> SpamAssassin-
> Now here is where I need the help (assuming my postfix section was
> sound).  I want to make sure this is as optimized as possible to
> provide a fair performance picture versus SpamAssassin and Barracuda.
> 
> It appears many seem to be using the Amavsid-new + Postfix +
> SpamAssassin configuration.  Is there a reason not to use this
> design?  I have had good luck with this in the past.


There are several ways to call SpamAssassin.  If you have used Amavis
and are familiar with its configuration, there is no reason not to use
it.  Just keep in mind that when you use SA through Amavis, Amavis
controls the spam threshold and message markup.

> I also have read a lot where people are improving accuracy by
> increasing the scoring of the Bayesian database (which needs
> training).  What would the optimal training method be, given my
> environment?  I could create a shared GroupWise IMAP folder for
> unclassified spam with a cron job to read this into sa-learn.  I
> cannot have a central IMAP folder for false positives, however, as
> other users must not be able to view the email for other users.  How
> can I insure user false positives are easily reportable?  What do
> others do to train the Bayesian database?  Maia-Mailguard?        

You need to be able to train on both spam and ham.  Is it possible to
create an imap folder that your users would be able to put messages
into, but not view?

I would also agree with another poster who said that you should widen
the bayes learning thresholds.  This is especially true if you are not
going to start with manual learning up front.

> I could pretty much trust a small subset of users to be fairly
> regular in their training.  There is a somewhat larger portion of
> users who would train here and there.  Lastly, the largest portion of
> users may never train.  We also do not know which user belongs to
> which group (yet).  With this scenario it seems that I will have to
> use some kind of common database.  In the default configuration SA
> uses one Bayesian database for all users.  Is there a reason to
> change this?  What is the consensus on a shared ruleset versus
> individual rulesets?        

Actually, the default SpamAssassin configuration uses per-user
databases.  It is Amavis that forces you to use a common database.

A common database is easier to manage, but a per-user database will be
more accurate (especially if the user trains it manually).

> It also seems that there is a falling out between pyzor, dcc, razor,
> and the community.  Is it simply a licensing issue (with legal
> implications), or are these systems flawed otherwise.  What
> alternatives are there?  Do I even need this functionality?  Has
> anyone seen a detriment to SpamAssassin's performance without DCC,
> Pyzor, or Razor.     

I think it is mainly the licensing issue.  Razor2 has recently changed
its licensing so that it is available for everyone.

I use all three on my server and get good results from them.

> What about an initial corpus to train the Bayesian database?  Will
> this hurt my accuracy in the long term?  What corpuses are being
> used?  Am I better off letting the Bayesian autolearn gradually
> perform this function?   

Since everyone's spam and ham are different, a generic corpus will not
get you very far.  The main advantage of Bayes is that it learns about
YOUR spam and ham and classifies messages accordingly.  If you train
it from a generic corpus, your results will not be nearly as good.

> SpamAssassin is typically represented as a magic dance of tweaking
> rules.  Are the default rule thresholds good values to start at?  How
> can I adequately decide which rules to tweak and how much to tweak
> them by?  In other words, how do you manage your adjustments without
> users noticing wide spam classifying variations?    

I have not done any score tweaking at my site.  I find that the
default rules do very well.  The only one that you might want to tweak
is the BAYES_99 rule once your Bayes database is performing well.

> Also, in regards to rules.  What is the preferred method for update? 
> Official rule releases, rulesdujour, custom?  All of the above? 

All of the above.  Rules_du_jour is extremely useful for keeping the
SARE rules up to date.  I would suggest that you visit the SARE rule
site www.rulesemporium.com and grab any of the rulesets there that
make sense for you.  They do a good job of describing the rulesets and
several of them have different versions depending on your tolerance
for false positives.  Then configure rules_du_jour to keep them up to
date for you.

sa-update can keep you up to date on the official releases.  And it is
always possible to create your own rules if there is a specific spam
that keeps getting by.

> How have people faired with MySQL replication of the DB?  I will need
> this solution to present the same data for backup MX which is not
> local to the primary MX.  

Haven't tried this.  Everything is on one machine for me.

> Thanks for any assistance and recommendations you can make.  It is
> probably impossible to make a balanced and unbiased comparison of SA
> to DSPAM, but I can try I suppose.  

Difficult maybe, but not impossible.  You are on the right road.  The
main thing is to take the time to tweak both of them so that they are
running at their best when you make the comparison.

-- 
Bowie

RE: Best Practices: SpamAssassin

Reply via email to