Re: [SAtalk] Extremely expensive SA calls

Justin Shore Fri, 26 Sep 2003 10:42:17 -0700

On Fri, 26 Sep 2003, Simon Byrnand wrote:

> >If I eliminate the SA -d call then that leaves me with only one other
> >CPU-draining call:  SA -r
> >
> ># Report to Pyzor
> >:0 Wc
> >| /usr/bin/pyzor report
> >
> ># Report to Razor
> >:0 Wc
> >| spamassassin -r
> >
> >Now one thing I never thought about till just now is if my Pyzor call is
> >redundant.  See I don't call Pyzor from SA at all in normal mail
> >processing.  I only call Razor.  I assume that that SA -r only calls what
> >I have configured SA to use normally, correct?  I need to look into that.
> >
> >That call is very CPU intensive for some reason.  You wouldn't think that
> >simply reporting a message to Razor would be that intensive but it is.


I actually read the man page this morning and was unpleasantly surprised 
with what I found.  

       -r, --report
        Report this message as verified spam.  This will submit the mail
        message read from STDIN to various spam-blocker databases.  
        Currently, these are Vipul's Razor ( http://razor.sourceforge.net/),
        the Distributed Checksum Clearinghouse
        (http://www.rhyolite.com/anti-spam/dcc/ ), and Pyzor.

        If the message contains SpamAssassin markup, this will be
        stripped out automatically before submission. The support modules
        for DCC, Razor and/or Pyzor must be installed for spam to be
        reported to each service.

        The message will also be submitted to SpamAssassin's learning
        systems; currently this is the internal Bayesian
        statistical-filtering system (the BAYES rules).  (Note that if you
        only want to perform statistical learning, and do not want to
        report mail to a third-party server, you should use the "sa-learn"
        command directly instead.)

I'm sure I'd read it before but apparently it never sank in.

So essentially --report does more than report.  It also stuffs the spam
into the Bayes database.  Since I use MIMEDefang and can't use the Bayes
DB that accounts for a fair amount of wasted CPU time.

It also says that it strips the SA markup.  It appears to do this 
regardless of whether or not it's already been stripped.  I suspect this 
CPU time adds up fast.  I understand the need to strip this stuff out 
before reporting the spam but there should be some way to say that this 
has already been done.  That would at least prevent the waste of CPU time 
on looking for markup to strip.

It also says it will submit to Razor, Pyzor, and DCC if their modules are
installed.  It doesn't however say if it honors the config file options
for these services:

use_razor2 1
use_pyzor 1
use_dcc 1

I propose some changes to how --report works.  I propose that some way of 
telling SA who you want the spam reported to be devised.  For example

--report=razor2,pyzor,dcc,bayes

or

--report-to=razor2,pyzor,dcc,bayes

At the least it should honor the SA config file (if it doesn't already).  
This would fix two of the problems I noted above (stuffing spam in Bayes 
when reporting and not being able to define where reports go).  It would 
also prevent Bayes from learning a spam twice, once for sa-learn and once 
for spamassassin --report.


I also propose an additional option be added to tell SA to not strip the 
markup when reporting.  For example

--report --nostrip

or

--report --already-stripped

This would prevent the waste of cycles on parsing an already-stripped 
message for markup and take care of the 3rd problem I noted.

Does anyone have any comments on this?  Is there a place I need to submit 
this to?

Justin



-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

Re: [SAtalk] Extremely expensive SA calls

Reply via email to