On 8/15/2012 4:05 PM, John Hardin wrote:
> On Wed, 15 Aug 2012, Ben Johnson wrote:
> 
>> On 8/15/2012 2:24 PM, John Hardin wrote:
>>> On Wed, 15 Aug 2012, Ben Johnson wrote:
>>>
>>>> Some 99% of the spam that I receive, which is grossly spammy (we're
>>>> talking auto loans, cash advances, dink pills, the whole lot) contains
>>>> "BAYES_00=-1.9" in the tests portion of the X-Spam-Status header.
>>>>
>>>> Might anyone know why?
>>>
>>> Poor training.
>>
>> John, I can't thank you enough for the thoroughness of your response.
> 
> I like to show off. :)
> 
>>> Apart from the Bayes score, what kind of scores are those spams getting?
>>
>> Here are a few examples (the first two of which are two of VERY few in
>> which the BAYES_* value is over 00):
>>
>> -----------------
>> No, score=0.192 tag=-999 tag2=3 kill=13 tests=[BAYES_20=-0.001,
>> HTML_MESSAGE=0.001, RCVD_IN_DNSWL_MED=-2.3, RDNS_NONE=0.793,
>> SPF_PASS=-0.001, URIBL_DBL_SPAM=1.7] autolearn=no
>>
>> No, score=2.241 tag=-999 tag2=3 kill=13 tests=[BAYES_20=-0.001,
>> HTML_MESSAGE=0.001, RCVD_IN_BRBL_LASTEXT=1.449, RDNS_NONE=0.793,
>> SPF_PASS=-0.001] autolearn=no
>>
>> No, score=-0.836 tag=-999 tag2=3 kill=13 tests=[BAYES_00=-1.9,
>> HTML_MESSAGE=0.001, RCVD_IN_BRBL_LASTEXT=1.449, RCVD_IN_DNSWL_MED=-2.3,
>> RDNS_NONE=0.793, SPF_PASS=-0.001, URI_HEX=1.122] autolearn=no
>>
>> No, score=1.256 tag=-999 tag2=3 kill=13 tests=[BAYES_00=-1.9,
>> HTML_MESSAGE=0.001, RCVD_IN_BRBL_LASTEXT=1.449, RCVD_IN_DNSWL_MED=-2.3,
>> RDNS_NONE=0.793, SPF_PASS=-0.001, URIBL_DBL_SPAM=1.7,
>> URIBL_RHS_DOB=1.514] autolearn=no
>> -----------------
> 
> It might be interesting to see some log entries where autolearn=yes...

Here you go:

No, score=-4.2 tag=-999 tag2=3 kill=13 tests=[BAYES_00=-1.9,
HTML_MESSAGE=0.001, RCVD_IN_DNSWL_MED=-2.3, SPF_PASS=-0.001] autolearn=ham

No, score=-3.407 tag=-999 tag2=3 kill=13 tests=[BAYES_00=-1.9,
HTML_MESSAGE=0.001, RCVD_IN_DNSWL_MED=-2.3, RDNS_NONE=0.793,
SPF_PASS=-0.001] autolearn=ham

No, score=-2.5 tag=-999 tag2=3 kill=13 tests=[BAYES_00=-1.9,
HTML_MESSAGE=0.001, RCVD_IN_DNSWL_MED=-2.3, SPF_PASS=-0.001,
URIBL_DBL_SPAM=1.7] autolearn=ham

No, score=-3.407 tag=-999 tag2=3 kill=13 tests=[BAYES_00=-1.9,
HTML_MESSAGE=0.001, RCVD_IN_DNSWL_MED=-2.3, RDNS_NONE=0.793,
SPF_PASS=-0.001] autolearn=ham

>> It bears mention that the RCVD_IN_DNSWL_MED test is having even more
>> of a negative impact (pardon the pun) than BAYES_*. I am already
>> working with the dnswl.org folks (off-list, for privacy reasons) to
>> get to the bottom of that issue.
> 
> This might be a major contributing factor. If your system was taught
> from scratch by autolearn, and DNSWL (which is fairly well trusted) has
> been pushing a lot of spams to low scores...

It looks as though this is exactly what happened. I'll post back once
I've done some more troubleshooting with the folks at dnswl.org.

> You might want to set:
>     bayes_auto_learn_threshold_nonspam -3

Done.

> That won't _fix_ the problem (at least not quickly) or avoid the need to
> wipe and retrain, but it might keep things from getting worse.

I disabled auto-learn and executed "sa-learn --clear", too. So, I should
be starting with a "clean slate", right?

I have also disabled the DNSWL rules, until the issue can be resolved,
and will begin manual training immediately.

> See perldoc Mail::SpamAssassin::Plugin::AutoLearnThreshold for more info.
> 
>> Most of the list is probably laughing, but given the complexity of Spam
>> Assassin, this crucial requirement was lost on me, amidst the sea of
>> information and instructions. For example, there is no mention of the
>> fact that SA is essentially useless without Bayesian training on
>> http://wiki.apache.org/spamassassin/StartUsing .
> 
> That's because that shouldn't be the case. The base ruleset + URIBL
> should be very effective pretty much out-of-the-box.
> 
>>> What version of SA is this?
>>
>> # spamassassin --version
>> SpamAssassin version 3.3.1
>>  running on Perl version 5.10.1
> 
> A little stale, but not bad.

'Tis the major drawback with using LTS Linux distros and managing
software via packages, I suppose.

>>> You may also want to set up some mechanism for users to submit
>>> misclassified messages for training. Depending on how much you trust
>>> their judgement the learning from these can be automatic or can go
>>> through you as a reviewer.
>>
>> That sounds like a good idea. Is there a particular HOW TO or tutorial
>> that you recommend? If it depends on the environment/configuration, this
>> server runs Ubuntu 10.04 with Dovecot, Amavis, Sieve, and Spam Assassin.
> 
> I'm not sure, I don't lurk the Wiki much. About the best I can suggest
> is search the SA users mailing list archives for "training dovecot".
> 

Thanks, I'll look into setting-up IMAP folders for individual users in
some programmatic way.

-Ben

Reply via email to