Re: Best Practices: SpamAssassin

Ryan Kather Fri, 31 Mar 2006 11:27:45 -0800

Thanks everyone.  Great responses.  

I think I have a good idea of where to go from here.  I will build up the 
solution and post my decided upon configuration.  I would appreciate any 
constructive feedback anyone has at that point, and you can be sure I will come 
back to the list with any questions I have.


I'm really looking forward to seeing the results.  

Ryan

>>> "Ryan Kather" <[EMAIL PROTECTED]> 03/31/06 12:14PM >>>
>I'll answer some parts...

>> Ideas: -------- Postfix- I would prefer to use SpamAssassin as a
>> store and forward mail filtering relay appliance.  It seems if I
>> place a Postfix Linux MTA in front of my existing spam solution I
>> could setup test groups.  100 users could be forwarded to the
>> SpamAssassin test box and passed internally to GroupWise.  100 users
>> could be forwarded to the DSPAM test box and passed internally to
>> GroupWise.  The rest of the users would be forwarded to the Symantec
>> Mail Security Gateway and passed internally to GroupWise (until such

>Wouldn't it make more sense to pass the same message through each system 
>under test?

Yes, from a purely testing perspective.  I don't have the liberty of this since 
I am live production testing. I suppose I could move all received messages for 
all users through all filters and then only deliver to those users who have 
opted into the various test breakdowns.  I'll look into this, as I think it 
would be a better picture of the accuracy then a subset of my users.  

>> I think I could accomplish this scenario with Postfix Transports,
>> though I may need to run multiple instances of Postfix.  Does anyone
>> see a flaw in this?

>You should be able to lookup on LDAP a custom attribute that means 
>next-hop hostname. You need some LDAP work, but very basic, and you're set!

Interesting.  I hadn't though to choose delivery with LDAP instead of 
Transports.  I suppose this would get rid of my multiple postix instance need.  
Good suggestion, I'll have to look into this.  

>> possible to provide a fair performance picture versus SpamAssassin

>Performance... are you hunting for speed or accuracy?
>(perhaps you wrote it before and I missed it)

Accuracy is most important, speed is only as important as insuring that 
messages don't back up in the processing queue or overload the servers.  

>> It appears many seem to be using the Amavsid-new + Postfix +
>> SpamAssassin configuration.  Is there a reason not to use this
>> design?  I have had good luck with this in the past.

>This is a very good combination. Amavisd-new allows per-user (!) LDAP 
>profiling and SQL quarantine management.
>I'm running both Postfix+SA and postfix+amavis+SA+clamav+mailzu+LDAP on 
>two different MX for different domains. Although the latter setup 
>requires more powerful hardware (not necessarily if your 4000 users have 
>a steady traffic and won't grow), it is much more manageable.

Well AV will definitely need to be incorporated.  I believe the hardware is 
sufficient to run the later scenario you have detailed.  

>Your review should take into account also these frills!

Oh I intend to.  In addition to accuracy I need to factor administrative 
overhead, since the majority of proprietary offerings slam SpamAssassin on 
requiring a full time administrator to manage, which I suspect isn't the case.  

>> I also have read a lot where people are improving accuracy by
>> increasing the scoring of the Bayesian database (which needs
[...]
>> can I insure user false positives are easily reportable?  What do
>> others do to train the Bayesian database?  Maia-Mailguard?

>After the initial setup, Bayes can live more or less its own life with 
>broad enough autolearn thresholds. We do not let users submit stuff for 
>training (80kusers!) but rather submit meaningful samples occasionally.

Interesting.  It would be nice to require zero user involvement in training.  
Are there any caveats to autolearning I should be aware of?  

>We've also found that spammers are targeting common addresses such as 
>info@, software@, john@, ... which were not used on some domains. So we 
>transformed those into spamtraps (with LDAP's mailAcceptingGeneralId or 
>mailAlternateAddress is pretty straightforward!), manually review and 
>feed to an IMAP folder for autospamlearn. HAM learning is unfortunately 
>underestimated and more rarely done, out of our own HAM messages.

Trap accounts are great, but I always worry they get different spam then real 
accounts and pollute the Bayesian database.  Has anyone experienced this?  Also 
how does SpamAssassin deal with the Bayesian pollution attempts seen recently 
(spam emails with garbage in them).  

>> I could pretty much trust a small subset of users to be fairly
>> regular in their training.  There is a somewhat larger portion of

>They might be telling less trusty users how to take part in the training 
>process, and then break-up your Bayes DB. Those less-smart users should 
>be managed with amaivsd-new LDAP profiling.

LDAP profiling.. Haven't seen examples of that yet.  I will definitely 
research. 

>> use some kind of common database.  In the default configuration SA
>> uses one Bayesian database for all users.  Is there a reason to
>> change this?  What is the consensus on a shared ruleset versus
>> individual rulesets?

>If your users share common-type messages, I'd go for a common Bayes >DB. 
>We do have a common one for all our domains (actually one for old and 
>another for new SA servers). Individual Bayes DBs get large and if they 
>break you've got to troubleshoot each individually...

The individual DBs sound painful, but I'm not sure how consistent our users 
are.  I guess I will have to watch the Bayesian accuracy as it's built and make 
a decision later.  

>> What about an initial corpus to train the Bayesian database?  Will
>> this hurt my accuracy in the long term?  What corpuses are being
>> used?  Am I better off letting the Bayesian autolearn gradually
>> perform this function?

>You don't keep your spam, do you? :-) Train the DB with your *own* 
>(company's) spam and ham corpus. It will not hurt. Don't use public 
>corpuses.

It seems as if most of the recommendations advise against trying to force feed 
your Bayesian.  I suppose there's no shortcuts if you want it to be accurate.  

>> SpamAssassin is typically represented as a magic dance of tweaking
>> rules.  Are the default rule thresholds good values to start at?  How
>> can I adequately decide which rules to tweak and how much to tweak
>> them by?  In other words, how do you manage your adjustments without
>> users noticing wide spam classifying variations?

>We do not adjust rules scoring. Not with SA 3.1, while we did it on SA 
>2.6 Bayes scores. Since most of our traffic is non-English, this helped 
>a bit.

>Default values are the most suitable for each rule.

A number of people have confirmed that SA 3.1 needs little rule weight 
adjustments.  

>> Also, in regards to rules.  What is the preferred method for update?
>> Official rule releases, rulesdujour, custom?  All of the above?

>Test them and decide which apply to your case. Dunno how indipendent 
>your current antispam solution is, with SA you need to invest some time 
>to review false negatives/positives (if any) and review extra rulesets.

I am beginning to think I won't be able to select new rulesets until the system 
is online, and I have a present metric on it to go by.

>> How have people faired with MySQL replication of the DB?  I will need
>> this solution to present the same data for backup MX which is not
>> local to the primary MX.

>>First of all: we dropped the secondary MX record because it received 
>>more spam than primary. We use a load balancer for HA.

Yes.. this is true. I wonder if I could use that to my advantage by adjusting 
learning thresholds for Bayesian on a backup MX.  

>What do you want to store on MySQL? Bayes, AWL, quarantine are your 
>non-mutually exclusive options.

Bayes and AWL, not quarantine.  

>Bayes and AWL can be regenerated in matter of minutes, and you can start 
>(I mean "power up") a backup MX without them.
>Replicating quarantine is like replicating your trash between two bins. 
>If you provide delegated quarantine, how likely is that a HW failure 
>will destroy a false positive? You're probably better off without MySQL 
>master-slave replication hassle.

Interesting.  I'll have to think of this a bit.  I don't see how I could avoid 
an SQL replication in DSPAM, but SA looks like maybe this isn't necessary.  
That could be a win for reduced complexity perhaps.

Re: Best Practices: SpamAssassin

Reply via email to