Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new

Ben Johnson Tue, 15 Jan 2013 08:27:23 -0800


On 1/14/2013 8:16 PM, John Hardin wrote:
> On Mon, 14 Jan 2013, Ben Johnson wrote:
> 
>> I understand that snowshoe spam may not hit any net tests. I guess my
>> confusion is around what, exactly, classifies spam as "snowshoe".
> 
>   http://www.spamhaus.org/faq/section/Glossary
> 
> Basically, a large number of spambots sending the message so that no one
> sending IP can be easily tagged as evil.
> 
> Question: do you have any SMTP-time hard-reject DNSBL tests in place? Or
> are they all performed by SA?


In postfix's main.cf:

smtpd_recipient_restrictions = permit_mynetworks,
permit_sasl_authenticated, check_recipient_access
mysql:/etc/postfix/mysql-virtual_recipient.cf,
reject_unauth_destination, reject_rbl_client bl.spamcop.net

Do you recommend something more?

> Recommendation: consider using the Spamhaus ZEN DNSBL as a hard-reject
> SMTP-time DNS check in your MTA. It is well-respected and very reliable.
> One thing it includes is ranges of IP addresses that should not ever be
> sending email, so it may help reduce snowshoe spam.
> 
>   http://www.spamhaus.org/zen/

This article looks to be pretty thorough:

http://www.cyberciti.biz/faq/howto-configure-postfix-dnsrbls-under-linux-unix/

I'll add Spamhaus ZEN and a few others to the list.

> Another tactic that many report good results from is Greylisting. Do you
> have greylisting in place? Does your userbase demand no delays in mail
> delivery? In addition to blocking spam from spambots that do not retry,
> it can delay mail enough for the BLs to get a chance to list new
> IPs/domains, which can reduce the leakage if you happen to be at the
> leading edge of a new delivery campaign.
> 
>   http://www.greylisting.org/

Hmm, very interesting. No, I have no greylisting in place as yet, and
no, my userbase doesn't demand immediate delivery. I will look into
greylisting further.

>> Are most/all of the BL services hash-based?
> 
> Generally:
> 
>     DNSBL: Blacklist of IP addresses
>     URIBL: Blacklist of domain and host names appearing in URIs
>     EMAILBL: (not widely used) Blacklist of email addresses (e.g.
>         phishing response addresses)
>     Razor, Pyzor: Blacklist of message content checksums/hashes

Perfect; that answers my question.

>> In other words, if a known spam message was added yesterday, will it
>> be considered "snowshoe" spam if the spammer sends the same message
>> today and changes only one character within the body?
> 
> No, the diverse IP addresses are the hallmark of "snowshoe", not so much
> the specific message content. If you see identical or generally-similar
> (e.g.) pharma spam coming from a wide range of different IP addresses,
> that's snowshoe.

I see. Given this information, it concerns me that Bayes scores hardly
seem to budge when I feed sa-learn nearly identical messages 3+ times.
We'll get into that below.

>> If so, then I guess the only remedy here is to focus on why Bayes seems
>> to perform so miserably.
> 
> Agreed.
> 
>> It must be a configuration issue, because I've sa-learn-ed messages
>> that are incredibly similar for two days now and not only do their
>> Bayes scores not change significantly, but sometimes they decrease.
>> And I have a hard time believing that one of my users is sa-train-ing
>> these messages as ham and negating my efforts.
> 
> This is why you retain your Bayes training corpora: so that if Bayes
> goes off the rails you can review your corpora for misclassifications,
> wipe and retrain. Do you have your training corpora? Or do you discard
> messages once you've trained them?

I had the good sense to retain the corpora.

> _Do_ you allow your users to train Bayes? Do they do so unsupervised or
> do you review their submissions? And if the process is automated, do you
> retain what they have provided for training so that you can go back
> later and do a troubleshooting review?

Yes, users are allowed to train Bayes, via Dovecot's Antispam plug-in.
They do so unsupervised. Why this could be a problem is obvious. And no,
I don't retain their submissions. I probably should. I wonder if I can
make a few slight modifications to the shell script that Antispam calls,
such that it simply sends a copy of the message to an administrator
rather than calling sa-learn on the message.

> Do you have autolearn turned on? My opinion is that autolearn is only
> appropriate for a large and very diverse userbase where a sufficiently
> "common" corpus of ham can't be manually collected. but then, I don't
> admin a Really Large Install, so YMMV.

No, I was sure to disable autolearn after the last Bayes fiasco. :)

> Do you use per-user or sitewide Bayes? If per-user, then you need to
> make sure that you're training Bayes as the same user that the MTA is
> running SA as.

Site-wide. And I have hard-coded the username in the SA configuration to
prevent confusion in this regard:

bayes_sql_override_username amavis

> What user does your MTA run SA as? What user do you train Bayes as?

The MTA should pass scanning off to "amavis". I train the DB in two
ways: via Dovecot Antispam and by calling sa-learn on my training
mailbox. Given that I have hard-coded the username, the output of
"sa-learn --dump magic" is the same whether I issue the command under my
own account or "su" to the "amavis" user.

> One possibility is that the MTA is running SA as a different user than
> you are training Bayes as, and you have autolearn turned on, and Bayes
> has been running in its own little world since day one regardless of
> what you think you're telling it to do.

That is what happened last year. I hope to have eliminated those issues
this time around. (I dumped the old DB and started over after that
debacle.) The X-Spam-Status header always displays "autolearn=disabled".

>> I have ensured that the spam token count increases when I train these
>> messages. That said, I do notice that the token count does not *always*
>> change; sometimes, sa-learn reports "Learned tokens from 0 message(s) (1
>> message(s) examined)". Does this mean that all tokens from these
>> messages have already been learned, thereby making it pointless to
>> continue feeding them to sa-learn?
> 
> No, it means that Message-ID has been learned from before.

I see. So, when this happens, it means that one of my users has already
dragged the message from Inbox to Junk (which triggers the Antispam
plug-in and feeds the message to sa-learn).

When this scenario occurs, my efforts in feeding the same message to
sa-learn are wasted, right? Bayes doesn't "learn more" from the message
the second time, or increase it's tokens' "weight", right? It would be
nice if I could eliminate this duplicate effort.

>> Finally, I added the test you supplied to my SA configuration, restarted
>> Amavis, and all messages appear to be tagged with RCVD_IN_HITALL=0.001.
> 
> So this proves DNS lookups are indeed working for all messages.
> 

Okay, good to know. I think we're "all clear" in the DNS/network test
department.

Based on my responses, what's the next move? Backup the Bayes DB, wipe
it, and feed my corpus through the ol' chipper?

Thanks again!

-Ben

Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new

Reply via email to