On 8/15/2012 4:05 PM, John Hardin wrote: > On Wed, 15 Aug 2012, Ben Johnson wrote: > >> On 8/15/2012 2:24 PM, John Hardin wrote: >>> On Wed, 15 Aug 2012, Ben Johnson wrote: >>> >>>> Some 99% of the spam that I receive, which is grossly spammy (we're >>>> talking auto loans, cash advances, dink pills, the whole lot) contains >>>> "BAYES_00=-1.9" in the tests portion of the X-Spam-Status header. >>>> >>>> Might anyone know why? >>> >>> Poor training. >> >> John, I can't thank you enough for the thoroughness of your response. > > I like to show off. :) > >>> Apart from the Bayes score, what kind of scores are those spams getting? >> >> Here are a few examples (the first two of which are two of VERY few in >> which the BAYES_* value is over 00): >> >> ----------------- >> No, score=0.192 tag=-999 tag2=3 kill=13 tests=[BAYES_20=-0.001, >> HTML_MESSAGE=0.001, RCVD_IN_DNSWL_MED=-2.3, RDNS_NONE=0.793, >> SPF_PASS=-0.001, URIBL_DBL_SPAM=1.7] autolearn=no >> >> No, score=2.241 tag=-999 tag2=3 kill=13 tests=[BAYES_20=-0.001, >> HTML_MESSAGE=0.001, RCVD_IN_BRBL_LASTEXT=1.449, RDNS_NONE=0.793, >> SPF_PASS=-0.001] autolearn=no >> >> No, score=-0.836 tag=-999 tag2=3 kill=13 tests=[BAYES_00=-1.9, >> HTML_MESSAGE=0.001, RCVD_IN_BRBL_LASTEXT=1.449, RCVD_IN_DNSWL_MED=-2.3, >> RDNS_NONE=0.793, SPF_PASS=-0.001, URI_HEX=1.122] autolearn=no >> >> No, score=1.256 tag=-999 tag2=3 kill=13 tests=[BAYES_00=-1.9, >> HTML_MESSAGE=0.001, RCVD_IN_BRBL_LASTEXT=1.449, RCVD_IN_DNSWL_MED=-2.3, >> RDNS_NONE=0.793, SPF_PASS=-0.001, URIBL_DBL_SPAM=1.7, >> URIBL_RHS_DOB=1.514] autolearn=no >> ----------------- > > It might be interesting to see some log entries where autolearn=yes...
Here you go: No, score=-4.2 tag=-999 tag2=3 kill=13 tests=[BAYES_00=-1.9, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_MED=-2.3, SPF_PASS=-0.001] autolearn=ham No, score=-3.407 tag=-999 tag2=3 kill=13 tests=[BAYES_00=-1.9, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_MED=-2.3, RDNS_NONE=0.793, SPF_PASS=-0.001] autolearn=ham No, score=-2.5 tag=-999 tag2=3 kill=13 tests=[BAYES_00=-1.9, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_MED=-2.3, SPF_PASS=-0.001, URIBL_DBL_SPAM=1.7] autolearn=ham No, score=-3.407 tag=-999 tag2=3 kill=13 tests=[BAYES_00=-1.9, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_MED=-2.3, RDNS_NONE=0.793, SPF_PASS=-0.001] autolearn=ham >> It bears mention that the RCVD_IN_DNSWL_MED test is having even more >> of a negative impact (pardon the pun) than BAYES_*. I am already >> working with the dnswl.org folks (off-list, for privacy reasons) to >> get to the bottom of that issue. > > This might be a major contributing factor. If your system was taught > from scratch by autolearn, and DNSWL (which is fairly well trusted) has > been pushing a lot of spams to low scores... It looks as though this is exactly what happened. I'll post back once I've done some more troubleshooting with the folks at dnswl.org. > You might want to set: > bayes_auto_learn_threshold_nonspam -3 Done. > That won't _fix_ the problem (at least not quickly) or avoid the need to > wipe and retrain, but it might keep things from getting worse. I disabled auto-learn and executed "sa-learn --clear", too. So, I should be starting with a "clean slate", right? I have also disabled the DNSWL rules, until the issue can be resolved, and will begin manual training immediately. > See perldoc Mail::SpamAssassin::Plugin::AutoLearnThreshold for more info. > >> Most of the list is probably laughing, but given the complexity of Spam >> Assassin, this crucial requirement was lost on me, amidst the sea of >> information and instructions. For example, there is no mention of the >> fact that SA is essentially useless without Bayesian training on >> http://wiki.apache.org/spamassassin/StartUsing . > > That's because that shouldn't be the case. The base ruleset + URIBL > should be very effective pretty much out-of-the-box. > >>> What version of SA is this? >> >> # spamassassin --version >> SpamAssassin version 3.3.1 >> running on Perl version 5.10.1 > > A little stale, but not bad. 'Tis the major drawback with using LTS Linux distros and managing software via packages, I suppose. >>> You may also want to set up some mechanism for users to submit >>> misclassified messages for training. Depending on how much you trust >>> their judgement the learning from these can be automatic or can go >>> through you as a reviewer. >> >> That sounds like a good idea. Is there a particular HOW TO or tutorial >> that you recommend? If it depends on the environment/configuration, this >> server runs Ubuntu 10.04 with Dovecot, Amavis, Sieve, and Spam Assassin. > > I'm not sure, I don't lurk the Wiki much. About the best I can suggest > is search the SA users mailing list archives for "training dovecot". > Thanks, I'll look into setting-up IMAP folders for individual users in some programmatic way. -Ben