On 8/16/2012 10:14 AM, Ben Johnson wrote:
> 
> 
> On 8/15/2012 4:05 PM, John Hardin wrote:
>> On Wed, 15 Aug 2012, Ben Johnson wrote:
>>
>>> On 8/15/2012 2:24 PM, John Hardin wrote:
>>>> On Wed, 15 Aug 2012, Ben Johnson wrote:
>>>>
>>>>> Some 99% of the spam that I receive, which is grossly spammy (we're
>>>>> talking auto loans, cash advances, dink pills, the whole lot) contains
>>>>> "BAYES_00=-1.9" in the tests portion of the X-Spam-Status header.
>>>>>
>>>>> Might anyone know why?
>>>>
>>>> Poor training.
>>>
>>> John, I can't thank you enough for the thoroughness of your response.
>>
>> I like to show off. :)
>>
>>>> Apart from the Bayes score, what kind of scores are those spams getting?
>>>
>>> Here are a few examples (the first two of which are two of VERY few in
>>> which the BAYES_* value is over 00):
>>>
>>> -----------------
>>> No, score=0.192 tag=-999 tag2=3 kill=13 tests=[BAYES_20=-0.001,
>>> HTML_MESSAGE=0.001, RCVD_IN_DNSWL_MED=-2.3, RDNS_NONE=0.793,
>>> SPF_PASS=-0.001, URIBL_DBL_SPAM=1.7] autolearn=no
>>>
>>> No, score=2.241 tag=-999 tag2=3 kill=13 tests=[BAYES_20=-0.001,
>>> HTML_MESSAGE=0.001, RCVD_IN_BRBL_LASTEXT=1.449, RDNS_NONE=0.793,
>>> SPF_PASS=-0.001] autolearn=no
>>>
>>> No, score=-0.836 tag=-999 tag2=3 kill=13 tests=[BAYES_00=-1.9,
>>> HTML_MESSAGE=0.001, RCVD_IN_BRBL_LASTEXT=1.449, RCVD_IN_DNSWL_MED=-2.3,
>>> RDNS_NONE=0.793, SPF_PASS=-0.001, URI_HEX=1.122] autolearn=no
>>>
>>> No, score=1.256 tag=-999 tag2=3 kill=13 tests=[BAYES_00=-1.9,
>>> HTML_MESSAGE=0.001, RCVD_IN_BRBL_LASTEXT=1.449, RCVD_IN_DNSWL_MED=-2.3,
>>> RDNS_NONE=0.793, SPF_PASS=-0.001, URIBL_DBL_SPAM=1.7,
>>> URIBL_RHS_DOB=1.514] autolearn=no
>>> -----------------
>>
>> It might be interesting to see some log entries where autolearn=yes...
> 
> Here you go:
> 
> No, score=-4.2 tag=-999 tag2=3 kill=13 tests=[BAYES_00=-1.9,
> HTML_MESSAGE=0.001, RCVD_IN_DNSWL_MED=-2.3, SPF_PASS=-0.001] autolearn=ham
> 
> No, score=-3.407 tag=-999 tag2=3 kill=13 tests=[BAYES_00=-1.9,
> HTML_MESSAGE=0.001, RCVD_IN_DNSWL_MED=-2.3, RDNS_NONE=0.793,
> SPF_PASS=-0.001] autolearn=ham
> 
> No, score=-2.5 tag=-999 tag2=3 kill=13 tests=[BAYES_00=-1.9,
> HTML_MESSAGE=0.001, RCVD_IN_DNSWL_MED=-2.3, SPF_PASS=-0.001,
> URIBL_DBL_SPAM=1.7] autolearn=ham
> 
> No, score=-3.407 tag=-999 tag2=3 kill=13 tests=[BAYES_00=-1.9,
> HTML_MESSAGE=0.001, RCVD_IN_DNSWL_MED=-2.3, RDNS_NONE=0.793,
> SPF_PASS=-0.001] autolearn=ham
> 
>>> It bears mention that the RCVD_IN_DNSWL_MED test is having even more
>>> of a negative impact (pardon the pun) than BAYES_*. I am already
>>> working with the dnswl.org folks (off-list, for privacy reasons) to
>>> get to the bottom of that issue.
>>
>> This might be a major contributing factor. If your system was taught
>> from scratch by autolearn, and DNSWL (which is fairly well trusted) has
>> been pushing a lot of spams to low scores...
> 
> It looks as though this is exactly what happened. I'll post back once
> I've done some more troubleshooting with the folks at dnswl.org.
> 
>> You might want to set:
>>     bayes_auto_learn_threshold_nonspam -3
> 
> Done.
> 
>> That won't _fix_ the problem (at least not quickly) or avoid the need to
>> wipe and retrain, but it might keep things from getting worse.
> 
> I disabled auto-learn and executed "sa-learn --clear", too. So, I should
> be starting with a "clean slate", right?
> 
> I have also disabled the DNSWL rules, until the issue can be resolved,
> and will begin manual training immediately.
> 
>> See perldoc Mail::SpamAssassin::Plugin::AutoLearnThreshold for more info.
>>
>>> Most of the list is probably laughing, but given the complexity of Spam
>>> Assassin, this crucial requirement was lost on me, amidst the sea of
>>> information and instructions. For example, there is no mention of the
>>> fact that SA is essentially useless without Bayesian training on
>>> http://wiki.apache.org/spamassassin/StartUsing .
>>
>> That's because that shouldn't be the case. The base ruleset + URIBL
>> should be very effective pretty much out-of-the-box.
>>
>>>> What version of SA is this?
>>>
>>> # spamassassin --version
>>> SpamAssassin version 3.3.1
>>>  running on Perl version 5.10.1
>>
>> A little stale, but not bad.
> 
> 'Tis the major drawback with using LTS Linux distros and managing
> software via packages, I suppose.
> 
>>>> You may also want to set up some mechanism for users to submit
>>>> misclassified messages for training. Depending on how much you trust
>>>> their judgement the learning from these can be automatic or can go
>>>> through you as a reviewer.
>>>
>>> That sounds like a good idea. Is there a particular HOW TO or tutorial
>>> that you recommend? If it depends on the environment/configuration, this
>>> server runs Ubuntu 10.04 with Dovecot, Amavis, Sieve, and Spam Assassin.
>>
>> I'm not sure, I don't lurk the Wiki much. About the best I can suggest
>> is search the SA users mailing list archives for "training dovecot".
>>
> 
> Thanks, I'll look into setting-up IMAP folders for individual users in
> some programmatic way.
> 
> -Ben
> 

So, after disabling auto-learn (for now) and executing "sa-learn
--clear", and restarting Amavis, I'm still seeing this:

No, score=0.593 tag=-999 tag2=3 kill=13 tests=[BAYES_00=-1.9,
HTML_MESSAGE=0.001, RDNS_NONE=0.793, SPF_PASS=-0.001,
URIBL_DBL_SPAM=1.7] autolearn=disabled

Why BAYES_00 still? Am I running the wrong command to clear the database?

Or will this happen until I begin the manual training?

Thanks,

-Ben

Reply via email to