Properly integrating clamAV into SpamAssassin

Adam Katz Sun, 03 May 2009 15:47:55 -0700

This lengthy email (sorry) contains three sections:
  1. Filtering order (spam, virus  vs  virus, spam  vs  spam+virus)
  2. SA's use of ClamAV to retain the benefits in #1
  3. SA's use of short-circuiting to reduce frivolous scans



The filtering order that I see recommended all the time is virus
detection before spam detection.  However, the vast majority of incoming
mail is spam, and even the majority of virus-laden mail is caught by
spam filters without any hooks into "real" virus scanners.

On a mail server that rejects mail at the door ("SMTP-time") for both
anti-virus and anti-spam, rejecting in the first step would mean that
the second check is never run.  Since the amount of spam (and viruses!)
blocked by SpamAssassin vastly outnumbers the amount of viruses that
would have been blocked before running SA, the only way to justify
running virus detection in front of SA would be if it were more
efficient by a larger order of magnitude than the spam to virus ratio.

I am under the impression that virus checking is *not* that much easier
than a fully-loaded SA implementation, so therefore spam detection
should run first.  Counter-point:  online lookups cost bandwidth and
latency, virus detection doesn't (yet) require any.

Pause.  Constructive comments and criticisms?

Don't get too caught up in the above part, it is all illustrative in
getting to my question below.


Mail that passes SpamAssassin but gets caught by ClamAV would add value
to SA's Bayesian and AWL databases and thus the message stands a chance
at getting caught in the future regardless of its viral content.

To best take advantage of that system while not compromising the
short-circuiting, SA's ClamAV plugin should be configured to run at the
very end of the scan and should be skipped for any message scoring high
enough to hit autolearn (which should be higher than the SMTP rejection
threshold).  As I can't figure out how to do this, I run it separately.

How do I configure the ClamAV plugin to be run by SpamAssassin, but only
on mail otherwise scoring under bayes_auto_learn_threshold_spam?  The
ShortCircuit and priority mechanisms do not seem to be capable of this.
 The closest I can get is:

########
loadplugin CompareScores comparescores.pm
loadplugin ClamAV clamav.pm

ifplugin Mail::SpamAssassin::Plugin::Shortcircuit

  ifplugin Mail::SpamAssassin::Plugin::AutoLearnThreshold
    full  __STOP_IF_SPAM  eval:check_if_autolearn_spam()
  else
    full  __STOP_IF_SPAM  eval:check_score_is_under(12)
  endif

  # note, this is after AWL (1000)
  priority      __STOP_IF_SPAM  10000
  shortcircuit  __STOP_IF_SPAM  on

endif

full      CLAMAV eval:check_clamav()
describe  CLAMAV Clam AntiVirus detected a virus
priority  10001
score     CLAMAV 15
########

Of course, CompareScores and its two functions do not yet exist (or is
there already something I can use to that effect?).

This runs after autowhitelist (AWL) because it has to; though it would
be nice to recalculate AWL after running CLAMAV, the __STOP_IF_SPAM
check would prevent AWL from running on any message that isn't already
surefire spam.  The workaround solution (requiring yet more new code)
would be to recalculate it (ignoring the first AWL results) after
priority 10001, and the "real" solution is described below.

Am I splitting hairs?  Is this so trivial that it doesn't matter? ...


I'm sure Justin or Theo or some other developer will chime in and state
that the whole short-circuiting system needs revisiting for the larger
picture: to handle points of diminishing returns.

Consider the following order or scanning within SA (with each step
containing specific short-circuits as currently implemented):

  1. local ham checks (only quick & efficient checks here)
  2. local spam checks (only quick & efficient checks here)
  3. network + slow ham checks
  4. network + slow spam checks
  5. autowhitelist

Step four would be able to have a default short-circuit every step of
the way (and it would only short-circuit the remainder of its tasks,
thus still enabling AWL); once you hit the autolearn=spam threshold (or
perhaps something higher if you really care about AWL), there's no
reason to run more checks.  This means that mail nailed by step 2 and
not rescued by step 3 would bypass step 4 altogether.  No DNS lookups,
no Razor2, no ClamAV.

It is my current understanding that SA doesn't do this.


-- 
Adam Katz
khopesh on irc://irc.freenode.net/#spamassassin
http://khopesh.com/Anti-spam

Properly integrating clamAV into SpamAssassin

Reply via email to