On 1/15/2013 1:55 PM, John Hardin wrote:
> On Tue, 15 Jan 2013, Ben Johnson wrote:
> 
>> On 1/14/2013 8:16 PM, John Hardin wrote:
>>> On Mon, 14 Jan 2013, Ben Johnson wrote:
>>>
>>> Question: do you have any SMTP-time hard-reject DNSBL tests in place? Or
>>> are they all performed by SA?
>>
>> In postfix's main.cf:
>>
>> smtpd_recipient_restrictions = permit_mynetworks,
>> permit_sasl_authenticated, check_recipient_access
>> mysql:/etc/postfix/mysql-virtual_recipient.cf,
>> reject_unauth_destination, reject_rbl_client bl.spamcop.net
>>
>> Do you recommend something more?
> 
> Unfortunately I have no experience administering Postfix. Perhaps one of
> the other listies can help.

Wow! Adding several more reject_rbl_client entries to the
smtpd_recipient_restrictions directive in the Postfix configuration
seems to be having a tremendous impact. The amount of spam coming
through has dropped by 90% or more. This was a HUGELY helpful
suggestion, John!

>>>   http://www.greylisting.org/
>>
>> Hmm, very interesting. No, I have no greylisting in place as yet, and
>> no, my userbase doesn't demand immediate delivery. I will look into
>> greylisting further.
> 
> One other thing you might try is publishing an SPF record for your
> domain. There is anecdotal evidence that this reduces the raw spam
> volume to that domain a bit.

We do publish SPF records for the domains within our control. The need
to do this arose when senderbase.org, et. al., began blacklisting
domains without SPF records. So, we're good there.

>> Given this information, it concerns me that Bayes scores hardly seem
>> to budge when I feed sa-learn nearly identical messages 3+ times.
>> We'll get into that below.
>>
>>>> If so, then I guess the only remedy here is to focus on why Bayes seems
>>>> to perform so miserably.
>>>
>>> Agreed.
>>>
>>>> It must be a configuration issue, because I've sa-learn-ed messages
>>>> that are incredibly similar for two days now and not only do their
>>>> Bayes scores not change significantly, but sometimes they decrease.
>>>> And I have a hard time believing that one of my users is sa-train-ing
>>>> these messages as ham and negating my efforts.
>>>
>>> This is why you retain your Bayes training corpora: so that if Bayes
>>> goes off the rails you can review your corpora for misclassifications,
>>> wipe and retrain. Do you have your training corpora? Or do you discard
>>> messages once you've trained them?
>>
>> I had the good sense to retain the corpora.
> 
> Yay!
> 
>>> _Do_ you allow your users to train Bayes? Do they do so unsupervised or
>>> do you review their submissions? And if the process is automated, do you
>>> retain what they have provided for training so that you can go back
>>> later and do a troubleshooting review?
>>
>> Yes, users are allowed to train Bayes, via Dovecot's Antispam plug-in.
>> They do so unsupervised. Why this could be a problem is obvious. And no,
>> I don't retain their submissions. I probably should. I wonder if I can
>> make a few slight modifications to the shell script that Antispam calls,
>> such that it simply sends a copy of the message to an administrator
>> rather than calling sa-learn on the message.
> 
> That would be a very good idea if the number of users doing training is
> small. At the very least, the messages should be captured to a permanent
> corpus mailbox.

Good idea! I'll see if I can set this up.

> Do your users also train ham? Are the procedures similar enough that
> your users could become easily confused?

They do. The procedure is implemented via Dovecot's Antispam plug-in.
Basically, moving mail from Inbox to Junk trains it as spam, and moving
mail from Junk to Inbox trains it as ham. I really like this setup
(Antispam + calling SA through Amavis [i.e. not using spamd]) because
the results are effective immediately, which seems to be crucial for
combating this snowshoe spam (performance and scalability aside).

I don't find that procedure to be confusing, but people are different, I
suppose.

>>> Do you have autolearn turned on? My opinion is that autolearn is only
>>> appropriate for a large and very diverse userbase where a sufficiently
>>> "common" corpus of ham can't be manually collected. but then, I don't
>>> admin a Really Large Install, so YMMV.
>>
>> No, I was sure to disable autolearn after the last Bayes fiasco. :)
> 
> OK.
> 
>>> Do you use per-user or sitewide Bayes? If per-user, then you need to
>>> make sure that you're training Bayes as the same user that the MTA is
>>> running SA as.
>>
>> Site-wide. And I have hard-coded the username in the SA configuration to
>> prevent confusion in this regard:
>>
>> bayes_sql_override_username amavis
>>
>>> What user does your MTA run SA as? What user do you train Bayes as?
>>
>> The MTA should pass scanning off to "amavis". I train the DB in two
>> ways: via Dovecot Antispam and by calling sa-learn on my training
>> mailbox. Given that I have hard-coded the username, the output of
>> "sa-learn --dump magic" is the same whether I issue the command under my
>> own account or "su" to the "amavis" user.
> 
> OK, good.
> 
>>>> I have ensured that the spam token count increases when I train these
>>>> messages. That said, I do notice that the token count does not *always*
>>>> change; sometimes, sa-learn reports "Learned tokens from 0
>>>> message(s) (1
>>>> message(s) examined)". Does this mean that all tokens from these
>>>> messages have already been learned, thereby making it pointless to
>>>> continue feeding them to sa-learn?
>>>
>>> No, it means that Message-ID has been learned from before.
>>
>> I see. So, when this happens, it means that one of my users has already
>> dragged the message from Inbox to Junk (which triggers the Antispam
>> plug-in and feeds the message to sa-learn).
> 
> Very likely.
> 
> The extremely odd thing is that you say you sometimes train a message as
> spam, and its Bayes score goes *down*. Are you training a message and
> then running it torough spamc to see if the score changed, or is this
> about _similar_ messages rather than _that_ message?

Sorry for the ambiguity. This is about *similar* messages. Identical
messages, at least visually speaking (I realize that there is a lot more
to it than the visual component). For example, yesterday, I saw several
Canadian Pharmacy emails, all of which were identical with respect to
appearance. I classified each as spam, yet the Bayes score didn't budge
more than a few percent for the first three, and went *down* for the 4th.

I have to assume that while the messages (HTML-formatted) *appear* to be
identical, the underlying code has some pseudo-random element that is
designed very specifically to throw Bayes classifiers.

Out of curiosity, does the Bayes engine (or some other element of
SpamAssassin) have the ability to "see" rendered HTML messages, by
appearance, and not by source code? If it could, it would be far more
effective it seems.

>> When this scenario occurs, my efforts in feeding the same message to
>> sa-learn are wasted, right? Bayes doesn't "learn more" from the message
>> the second time, or increase it's tokens' "weight", right? It would be
>> nice if I could eliminate this duplicate effort.
> 
> Correct, no new information is learned.
> 
>> Based on my responses, what's the next move? Backup the Bayes DB, wipe
>> it, and feed my corpus through the ol' chipper?
> 
> That, and configure the user-based training to at the very least capture
> what they submit to a corpus so you can review it. Whether you do that
> review pre-training or post-bayes-is-insane is up to you.
> 

Right, right, that makes sense. I hope I can modify the Antispam plug-in
to accommodate this requirement.

Well, I can't thank you enough here, John and everyone else. I seem to
be on the right track; all is not lost.

That said, it seems clear that SA is nowhere near as effective as it can
be when an off-the-shelf configuration is used (and without configuring
the MTA to do some of the blocking).

I'll keep the list posted (pardon the pun) with regard to configuring
Antispam to fire-off a copy of any message that is submitted for
training. Ideally, whether the message is reviewed before or after
sa-learn is called will be configurable.

-Ben

Reply via email to