On 10/14/2013 2:47 PM, Adam Katz wrote: > On 10/12/2013 09:26 AM, Stan Hoeppner wrote: >> These two rules are adding 4.0 pts [...] >> Content analysis details: (4.8 points, 4.2 required) >> pts rule name description >> ---- --------------------------------------------------------------------- >> 2.8 FSL_HELO_BARE_IP_2 FSL_HELO_BARE_IP_2 >> 1.2 RCVD_NUMERIC_HELO Received: contains an IP address used for HELO >> 0.8 BAYES_50 BODY: Bayes spam probability is 40 to 60% >> [score: 0.5314] > > The others have addressed the "two rules" you mentioned, so I'll leave > that alone in this email. > > There's more here than that: If you're using Bayes, you have to train > it. Right now, it's hurting you: Those 0.8 points should be some > negative value, perhaps -1.9 or -0.5 (the default scores for BAYES_00 > and BAYES_05), which would then have made that message score 2.1 or 3.5, > both of which are below your 4.2 threshold (which is already too low!).
There's no doubt my Bayes isn't working. I ran a few hundred each of ham and spam through sa-learn just after installing SA some year+ ago. I haven't regularly fed it since, though I have run through maybe a few dozen spam that weren't scored high enough. And I think I may have inadvertently run through one or two msgs that had anti-Bayesian text blocks in them-- the bible versus, wikipedia content, etc. I just ran 120 hams through, about half were msgs tagged previously with Bayes_60 through Bayes_95. ~$ sa-learn --ham --mbox --progress /home/stan/mail/ham Learned tokens from 0 message(s) (0 message(s) examined) Obviously there's a problem with no tokens learned. A few questions: 1. Is the database the problem? If so... 2. Is there a way to flush the Bayes database and restart training? 3. Should I even be using Bayes on a mail stream that is over 99.9% technical list mail, replete with lots of C code? For some reason my Bayes likes to add +3-3.5 to most messages from the XFS list that contain code. > On that threshold: there are better ways to nail more spam than > lowering the threshold. SpamAssassin is highly tuned for 5.0 and while > it's safe to bump that threshold up (more conservative, e.g. I block at > 8.0 and flag at 5.0), it is not as safe to pull it down. I'd guess your mail stream is quite different. FYI, of all non-list mail entering my MX, I block about 98% of spam at SMTP before it enters the queue. SA does pretty good at catching the last 2% and AFAIK it has never tagged any non-list mail. WRT the list mail it does just 'ok' tagging spam. But it FPs about twice as many ham. I haven't kept track of hard numbers. It's just not worth the time frankly. I installed SA with only one goal in mind: stop the "last 2%" -- the list born spam which I can't touch with SMTP restrictions, and the very few non-list spam that make it into the queue. This is a tall order, which is why I had low hopes from the start. I subbed the list not because the spam catch rate was too low to tolerate, but because ham tagging suddenly went through the roof due to one rule. > Better way #1: plugins. Razor2, Pyzor, DCC. Decently drop-in (though > DCC isn't as easy as it once was). > > Better way #2: Bayes. Set it up to facilitate better training. Create > "learn-spam" and "learn-nonspam" folders for each user and run cron jobs > that run sa-learn (or better, spamassassin -r so you can learn and > report them) and then empty the folders. Once you can trust Bayes, you > can increase the magnitude of its scores. Do this slowly and carefully. Given that the overall content of my list mail doesn't change much day to day, or over time, I would thing that manually training ham once with a few hundred msgs would be sufficient. Then train spam on occasion. Which is what I've done. My guess is that my use case is unique enough that I'll need to do a lot of manual tuning, creating custom rules, etc, to get SA working somewhat well, without the occasional FP spike that brought me here. Which is exactly what I do NOT want to do. I've spent years tuning Postfix to nail 98% of wire bound spam. I'd rather not spend many more years tweaking SA to catch the last 2% sneaking in through a few mailing lists... > Better way #3: AWL. This is now disabled by default, in part due to > misunderstandings (it is horribly named; it's as much a black list as it > is a white list, and it's not as "persistent" as its storage model > purports). <snip> Which is exactly why I didn't enable it. >> Received: from bendel.debian.org (bendel.debian.org [82.195.75.100]) >> by greer.hardwarefreak.com (Postfix) with ESMTP id C95BD6C0CE >> for <s...@hardwarefreak.com>; Sat, 12 Oct 2013 10:23:37 -0500 (CDT) >> [...] >> X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on bendel.debian.org >> X-Spam-Level: >> X-Spam-Status: No, score=-9.6 required=4.0 tests=FOURLA,FREEMAIL_FROM, >> LDOSUBSCRIBER,LDO_WHITELIST,RCVD_NUMERIC_HELO,T_RP_MATCHES_RCVD, >> T_TO_NO_BRKTS_FREEMAIL autolearn=unavailable version=3.3.2 >> [...] >> X-Amavis-Spam-Status: No, score=-5.735 tagged_above=-10000 required=5.3 >> tests=[BAYES_00=-2, FOURLA=0.1, FREEMAIL_FROM=0.001, LDO_WHITELIST=-5, >> RCVD_IN_DNSWL_NONE=-0.0001, RCVD_NUMERIC_HELO=1.164, >> T_RP_MATCHES_RCVD=-0.01, T_TO_NO_BRKTS_FREEMAIL=0.01] autolearn=ham > > Another option is to trust Debian's SA instance. You can add > 82.195.75.100 to trusted_networks in your local.cf. Be careful, this > would mean inheriting some of Debian's false negatives. That makes little sense, given my stated reasons for using SA. -- Stan