Re: Best Practices: SpamAssassin

Mark Martinec Fri, 31 Mar 2006 07:53:07 -0800

Ryan,

> Configuration:  Spam Filter Store and Forward Gateway (non authenticated)


You may want to add clamd to the mix.

> I want to make sure this is as optimized as possible to provide a fair
> performance picture versus SpamAssassin and Barracuda.
>...
> I also have read a lot where people are improving accuracy by increasing
> the scoring of the Bayesian database (which needs training).
>...
> SpamAssassin is typically represented as a magic dance of tweaking rules. 
> Are the default rule thresholds good values to start at?  How can I
> adequately decide which rules to tweak and how much to tweak them by?  In
> other words, how do you manage your adjustments without users noticing wide
> spam classifying variations?

For a fair comparision to other products, it would look good to do as
little tweaking to SA as reasonable. The default SA 3.1.1 scores are
quite well adjusted. The BAYES_99 used to be too low, but I believe
with SA 3.1 it now defaults to 3.5, which is just fine. Have a default
set of SARE rules (or hand-picked, if you prefer), and most if not all
network tests enabled.

After evaluation period is over, you might want to tweak a score or two,
but that will come from your experience.

> I could pretty much trust a small subset of users to be fairly regular in
> their training.  There is a somewhat larger portion of users who would
> train here and there.  Lastly, the largest portion of users may never
> train.  We also do not know which user belongs to which group (yet).  With
> this scenario it seems that I will have to use some kind of common
> database.  In the default configuration SA uses one Bayesian database for
> all users.  Is there a reason to change this?

For a company (a rather homogeneous user group), a common Bayesian database 
works fairly well, even left to itself for automatic learning only, the 
provided set of static (+SARE) and network tests alone does a remarkable job
of training Bayes, which after a while starts to contribute back what it
has learned. For an ISP with a diverse user population, that would't work
so well. Manual training does't hurt, but there is no need to rely heavily
on it, uneducated users may even do more damage than good when reporting
what they consider spam.


> It also seems that there is a falling out between pyzor, dcc, razor, and
> the community.  Is it simply a licensing issue (with legal implications),
> or are these systems flawed otherwise.

If license permits, use them all.
It might be interesting to tell that of all these, the pyzor is
the largest CPU consumer. Use it if you can afford, but it can be
the first to drop if host resources are scarce. Don't let network
latencies bother you (e.g. Razor or DNS checks can be slow at times),
the wait time can be easily utilized by other parallel processes.

> Do I even need this functionality?  Has anyone seen a detriment to
> SpamAssassin's performance without DCC, Pyzor, or Razor.

Yes, you need them, they often contribute the ultimate score to tip
the scale. Without networks tests (and static rules) you end up
in a dspam league.

> What about an initial corpus to train the Bayesian database?  Will this
> hurt my accuracy in the long term?  What corpuses are being used?  Am I
> better off letting the Bayesian autolearn gradually perform this function?

Let it auto-learn for a week, then start evaluation test.

If you do hand training, and you are comparing to dpam,
try to train both systems approximately the same.

  Mark

Re: Best Practices: SpamAssassin

Reply via email to