On 8/16/2012 10:14 AM, Ben Johnson wrote: > > > On 8/15/2012 4:05 PM, John Hardin wrote: >> On Wed, 15 Aug 2012, Ben Johnson wrote: >> >>> On 8/15/2012 2:24 PM, John Hardin wrote: >>>> On Wed, 15 Aug 2012, Ben Johnson wrote: >>>> >>>>> Some 99% of the spam that I receive, which is grossly spammy (we're >>>>> talking auto loans, cash advances, dink pills, the whole lot) contains >>>>> "BAYES_00=-1.9" in the tests portion of the X-Spam-Status header. >>>>> >>>>> Might anyone know why? >>>> >>>> Poor training. >>> >>> John, I can't thank you enough for the thoroughness of your response. >> >> I like to show off. :) >> >>>> Apart from the Bayes score, what kind of scores are those spams getting? >>> >>> Here are a few examples (the first two of which are two of VERY few in >>> which the BAYES_* value is over 00): >>> >>> ----------------- >>> No, score=0.192 tag=-999 tag2=3 kill=13 tests=[BAYES_20=-0.001, >>> HTML_MESSAGE=0.001, RCVD_IN_DNSWL_MED=-2.3, RDNS_NONE=0.793, >>> SPF_PASS=-0.001, URIBL_DBL_SPAM=1.7] autolearn=no >>> >>> No, score=2.241 tag=-999 tag2=3 kill=13 tests=[BAYES_20=-0.001, >>> HTML_MESSAGE=0.001, RCVD_IN_BRBL_LASTEXT=1.449, RDNS_NONE=0.793, >>> SPF_PASS=-0.001] autolearn=no >>> >>> No, score=-0.836 tag=-999 tag2=3 kill=13 tests=[BAYES_00=-1.9, >>> HTML_MESSAGE=0.001, RCVD_IN_BRBL_LASTEXT=1.449, RCVD_IN_DNSWL_MED=-2.3, >>> RDNS_NONE=0.793, SPF_PASS=-0.001, URI_HEX=1.122] autolearn=no >>> >>> No, score=1.256 tag=-999 tag2=3 kill=13 tests=[BAYES_00=-1.9, >>> HTML_MESSAGE=0.001, RCVD_IN_BRBL_LASTEXT=1.449, RCVD_IN_DNSWL_MED=-2.3, >>> RDNS_NONE=0.793, SPF_PASS=-0.001, URIBL_DBL_SPAM=1.7, >>> URIBL_RHS_DOB=1.514] autolearn=no >>> ----------------- >> >> It might be interesting to see some log entries where autolearn=yes... > > Here you go: > > No, score=-4.2 tag=-999 tag2=3 kill=13 tests=[BAYES_00=-1.9, > HTML_MESSAGE=0.001, RCVD_IN_DNSWL_MED=-2.3, SPF_PASS=-0.001] autolearn=ham > > No, score=-3.407 tag=-999 tag2=3 kill=13 tests=[BAYES_00=-1.9, > HTML_MESSAGE=0.001, RCVD_IN_DNSWL_MED=-2.3, RDNS_NONE=0.793, > SPF_PASS=-0.001] autolearn=ham > > No, score=-2.5 tag=-999 tag2=3 kill=13 tests=[BAYES_00=-1.9, > HTML_MESSAGE=0.001, RCVD_IN_DNSWL_MED=-2.3, SPF_PASS=-0.001, > URIBL_DBL_SPAM=1.7] autolearn=ham > > No, score=-3.407 tag=-999 tag2=3 kill=13 tests=[BAYES_00=-1.9, > HTML_MESSAGE=0.001, RCVD_IN_DNSWL_MED=-2.3, RDNS_NONE=0.793, > SPF_PASS=-0.001] autolearn=ham > >>> It bears mention that the RCVD_IN_DNSWL_MED test is having even more >>> of a negative impact (pardon the pun) than BAYES_*. I am already >>> working with the dnswl.org folks (off-list, for privacy reasons) to >>> get to the bottom of that issue. >> >> This might be a major contributing factor. If your system was taught >> from scratch by autolearn, and DNSWL (which is fairly well trusted) has >> been pushing a lot of spams to low scores... > > It looks as though this is exactly what happened. I'll post back once > I've done some more troubleshooting with the folks at dnswl.org. > >> You might want to set: >> bayes_auto_learn_threshold_nonspam -3 > > Done. > >> That won't _fix_ the problem (at least not quickly) or avoid the need to >> wipe and retrain, but it might keep things from getting worse. > > I disabled auto-learn and executed "sa-learn --clear", too. So, I should > be starting with a "clean slate", right? > > I have also disabled the DNSWL rules, until the issue can be resolved, > and will begin manual training immediately. > >> See perldoc Mail::SpamAssassin::Plugin::AutoLearnThreshold for more info. >> >>> Most of the list is probably laughing, but given the complexity of Spam >>> Assassin, this crucial requirement was lost on me, amidst the sea of >>> information and instructions. For example, there is no mention of the >>> fact that SA is essentially useless without Bayesian training on >>> http://wiki.apache.org/spamassassin/StartUsing . >> >> That's because that shouldn't be the case. The base ruleset + URIBL >> should be very effective pretty much out-of-the-box. >> >>>> What version of SA is this? >>> >>> # spamassassin --version >>> SpamAssassin version 3.3.1 >>> running on Perl version 5.10.1 >> >> A little stale, but not bad. > > 'Tis the major drawback with using LTS Linux distros and managing > software via packages, I suppose. > >>>> You may also want to set up some mechanism for users to submit >>>> misclassified messages for training. Depending on how much you trust >>>> their judgement the learning from these can be automatic or can go >>>> through you as a reviewer. >>> >>> That sounds like a good idea. Is there a particular HOW TO or tutorial >>> that you recommend? If it depends on the environment/configuration, this >>> server runs Ubuntu 10.04 with Dovecot, Amavis, Sieve, and Spam Assassin. >> >> I'm not sure, I don't lurk the Wiki much. About the best I can suggest >> is search the SA users mailing list archives for "training dovecot". >> > > Thanks, I'll look into setting-up IMAP folders for individual users in > some programmatic way. > > -Ben >
So, after disabling auto-learn (for now) and executing "sa-learn --clear", and restarting Amavis, I'm still seeing this: No, score=0.593 tag=-999 tag2=3 kill=13 tests=[BAYES_00=-1.9, HTML_MESSAGE=0.001, RDNS_NONE=0.793, SPF_PASS=-0.001, URIBL_DBL_SPAM=1.7] autolearn=disabled Why BAYES_00 still? Am I running the wrong command to clear the database? Or will this happen until I begin the manual training? Thanks, -Ben