Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new

Ben Johnson Wed, 16 Jan 2013 07:49:58 -0800


On 1/15/2013 5:22 PM, John Hardin wrote:
> On Tue, 15 Jan 2013, Ben Johnson wrote:
> 
>>
>>
>> On 1/15/2013 1:55 PM, John Hardin wrote:
>>> On Tue, 15 Jan 2013, Ben Johnson wrote:
>>>
>>>> On 1/14/2013 8:16 PM, John Hardin wrote:
>>>>> On Mon, 14 Jan 2013, Ben Johnson wrote:
>>>>>
>>>>> Question: do you have any SMTP-time hard-reject DNSBL tests in
>>>>> place? Or
>>>>> are they all performed by SA?
>>>>
>>>> In postfix's main.cf:
>>>>
>>>> smtpd_recipient_restrictions = permit_mynetworks,
>>>> permit_sasl_authenticated, check_recipient_access
>>>> mysql:/etc/postfix/mysql-virtual_recipient.cf,
>>>> reject_unauth_destination, reject_rbl_client bl.spamcop.net
>>>>
>>>> Do you recommend something more?
>>>
>>> Unfortunately I have no experience administering Postfix. Perhaps one of
>>> the other listies can help.
>>
>> Wow! Adding several more reject_rbl_client entries to the
>> smtpd_recipient_restrictions directive in the Postfix configuration
>> seems to be having a tremendous impact. The amount of spam coming
>> through has dropped by 90% or more. This was a HUGELY helpful
>> suggestion, John!
> 
> Which ones are you using now? There are DNSBLs that are good, but not
> quite good enough to trust as hard-reject SMTP-time filters. That's why
> SA does scored DNSBL checks.


smtpd_recipient_restrictions =
        reject_rbl_client bl.spamcop.net,
        reject_rbl_client list.dsbl.org,
        reject_rbl_client sbl-xbl.spamhaus.org,
        reject_rbl_client cbl.abuseat.org,
        reject_rbl_client dul.dnsbl.sorbs.net,

I acquired this list from the article that I cited a few responses back.
It is quite possible that some of these are obsolete, as the article is
from 2009. I seem to recall reading that sbl-xbl.spamhaus.org is
obsolete, but now I can't find the source.

These are "hard rejects", right? So if this change has reduced spam,
said spam would not be accepted for delivery at all; it would be
rejected outright. Correct? (And if I understand you, this is part of
your concern.)

The reason I ask, and a point that I should have clarified in my last
post, is that the *volume* of spam didn't drop by 90% (although, it may
have dropped by some measure), but rather the accuracy with which SA
tagged spam was 90% higher.

Ultimately, I'm wondering if the observed change was simply a product of
these message "campaigns" being black-listed after a few days of
circulation, and not the Postfix configuration change.

At this point, the vast majority of X-Spam-Status headers include Razor2
and Pyzor tests that contribute significantly to the score. I should
have mentioned earlier that I installed Razor2 and Pyzor after making my
initial post. The only reasons I didn't are that a) they didn't seem to
be making a significant difference for the first day or so after I
installed them (this could be for the snowshoe reasons we've already
discussed), and b) the low Bayes scores seemed to be the real problem
anyway.

That said, the Bayes scores seem to be much more accurate now, too. I
was hardly ever seeing BAYES_99 before, but now almost all spam messages
have BAYES_99.

Is it possible that the training I've been doing over the last week or
so wasn't *effective* until recently, say, after restarting some
component of the mail stack? My understanding is that calling SA via
Amavis, which does not need/use the spamd daemon, forces all Bayes data
to be up-to-date on each call to spamassassin.

It bears mention that I haven't yet dumped the Bayes DB and retrained
using my corpus. I'll do that next and see where we land once the DB is
repopulated.

>>>> Yes, users are allowed to train Bayes, via Dovecot's Antispam plug-in.
>>>> They do so unsupervised. Why this could be a problem is obvious. And
>>>> no,
>>>> I don't retain their submissions. I probably should. I wonder if I can
>>>> make a few slight modifications to the shell script that Antispam
>>>> calls,
>>>> such that it simply sends a copy of the message to an administrator
>>>> rather than calling sa-learn on the message.
>>>
>>> That would be a very good idea if the number of users doing training is
>>> small. At the very least, the messages should be captured to a permanent
>>> corpus mailbox.
>>
>> Good idea! I'll see if I can set this up.
>>
>>> Do your users also train ham? Are the procedures similar enough that
>>> your users could become easily confused?
>>
>> They do. The procedure is implemented via Dovecot's Antispam plug-in.
>> Basically, moving mail from Inbox to Junk trains it as spam, and moving
>> mail from Junk to Inbox trains it as ham. I really like this setup
>> (Antispam + calling SA through Amavis [i.e. not using spamd]) because
>> the results are effective immediately, which seems to be crucial for
>> combating this snowshoe spam (performance and scalability aside).
>>
>> I don't find that procedure to be confusing, but people are different, I
>> suppose.
> 
> Hm. One thing I would watch out for in that environment is people who
> have intentionally subscribed to some sort of mailing list deciding they
> don't want to receive it any longer and just junking the messages rather
> than unsubscribing.

Good point. I hadn't thought of that. All the more reason to "screen"
the messages that are submitted for training.

> However, your problem is FN Bayes scores...
> 
>>> The extremely odd thing is that you say you sometimes train a message as
>>> spam, and its Bayes score goes *down*. Are you training a message and
>>> then running it torough spamc to see if the score changed, or is this
>>> about _similar_ messages rather than _that_ message?
>>
>> Sorry for the ambiguity. This is about *similar* messages. Identical
>> messages, at least visually speaking (I realize that there is a lot more
>> to it than the visual component). For example, yesterday, I saw several
>> Canadian Pharmacy emails, all of which were identical with respect to
>> appearance. I classified each as spam, yet the Bayes score didn't budge
>> more than a few percent for the first three, and went *down* for the 4th.
>>
>> I have to assume that while the messages (HTML-formatted) *appear* to be
>> identical, the underlying code has some pseudo-random element that is
>> designed very specifically to throw Bayes classifiers.
>>
>> Out of curiosity, does the Bayes engine (or some other element of
>> SpamAssassin) have the ability to "see" rendered HTML messages, by
>> appearance, and not by source code? If it could, it would be far more
>> effective it seems.
> 
> That I don't know.
> 
>>> That, and configure the user-based training to at the very least capture
>>> what they submit to a corpus so you can review it. Whether you do that
>>> review pre-training or post-bayes-is-insane is up to you.
>>
>> Right, right, that makes sense. I hope I can modify the Antispam plug-in
>> to accommodate this requirement.
>>
>> Well, I can't thank you enough here, John and everyone else. I seem to
>> be on the right track; all is not lost.
>>
>> That said, it seems clear that SA is nowhere near as effective as it can
>> be when an off-the-shelf configuration is used (and without configuring
>> the MTA to do some of the blocking).
>>
>> I'll keep the list posted (pardon the pun) with regard to configuring
>> Antispam to fire-off a copy of any message that is submitted for
>> training. Ideally, whether the message is reviewed before or after
>> sa-learn is called will be configurable.
> 
> Great! Thanks!
> 

Thanks again for all the insight here, John.

-Ben

Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new

Reply via email to