On 10/14/2013 2:47 PM, Adam Katz wrote:
> On 10/12/2013 09:26 AM, Stan Hoeppner wrote:
>> These two rules are adding 4.0 pts [...]
>> Content analysis details:   (4.8 points, 4.2 required)
>>  pts rule name              description
>> ---- ---------------------------------------------------------------------
>>  2.8 FSL_HELO_BARE_IP_2     FSL_HELO_BARE_IP_2
>>  1.2 RCVD_NUMERIC_HELO      Received: contains an IP address used for HELO
>>  0.8 BAYES_50               BODY: Bayes spam probability is 40 to 60%
>>                             [score: 0.5314]
> 
> The others have addressed the "two rules" you mentioned, so I'll leave
> that alone in this email.
> 
> There's more here than that:  If you're using Bayes, you have to train
> it.  Right now, it's hurting you:  Those 0.8 points should be some
> negative value, perhaps -1.9 or -0.5 (the default scores for BAYES_00
> and BAYES_05), which would then have made that message score 2.1 or 3.5,
> both of which are below your 4.2 threshold (which is already too low!).

There's no doubt my Bayes isn't working.  I ran a few hundred each of
ham and spam through sa-learn just after installing SA some year+ ago.
I haven't regularly fed it since, though I have run through maybe a few
dozen spam that weren't scored high enough.  And I think I may have
inadvertently run through one or two msgs that had anti-Bayesian text
blocks in them-- the bible versus, wikipedia content, etc.

I just ran 120 hams through, about half were msgs tagged previously with
Bayes_60 through Bayes_95.

~$ sa-learn --ham --mbox --progress /home/stan/mail/ham
Learned tokens from 0 message(s) (0 message(s) examined)

Obviously there's a problem with no tokens learned.  A few questions:

1.  Is the database the problem?  If so...

2.  Is there a way to flush the Bayes database and restart training?

3.  Should I even be using Bayes on a mail stream that is over
    99.9% technical list mail, replete with lots of C code?  For
    some reason my Bayes likes to add +3-3.5 to most messages from
    the XFS list that contain code.

> On that threshold:  there are better ways to nail more spam than
> lowering the threshold.  SpamAssassin is highly tuned for 5.0 and while
> it's safe to bump that threshold up (more conservative, e.g. I block at
> 8.0 and flag at 5.0), it is not as safe to pull it down.

I'd guess your mail stream is quite different.  FYI, of all non-list
mail entering my MX, I block about 98% of spam at SMTP before it enters
the queue.  SA does pretty good at catching the last 2% and AFAIK it has
never tagged any non-list mail.  WRT the list mail it does just 'ok'
tagging spam.  But it FPs about twice as many ham.  I haven't kept track
of hard numbers.  It's just not worth the time frankly.

I installed SA with only one goal in mind:  stop the "last 2%" -- the
list born spam which I can't touch with SMTP restrictions, and the very
few non-list spam that make it into the queue.  This is a tall order,
which is why I had low hopes from the start.  I subbed the list not
because the spam catch rate was too low to tolerate, but because ham
tagging suddenly went through the roof due to one rule.

> Better way #1: plugins.  Razor2, Pyzor, DCC.  Decently drop-in (though
> DCC isn't as easy as it once was).
> 
> Better way #2: Bayes.  Set it up to facilitate better training.  Create
> "learn-spam" and "learn-nonspam" folders for each user and run cron jobs
> that run sa-learn (or better, spamassassin -r so you can learn and
> report them) and then empty the folders.  Once you can trust Bayes, you
> can increase the magnitude of its scores.  Do this slowly and carefully.

Given that the overall content of my list mail doesn't change much day
to day, or over time, I would thing that manually training ham once with
a few hundred msgs would be sufficient.  Then train spam on occasion.
Which is what I've done.

My guess is that my use case is unique enough that I'll need to do a lot
of manual tuning, creating custom rules, etc, to get SA working somewhat
well, without the occasional FP spike that brought me here.  Which is
exactly what I do NOT want to do.  I've spent years tuning Postfix to
nail 98% of wire bound spam.  I'd rather not spend many more years
tweaking SA to catch the last 2% sneaking in through a few mailing lists...

> Better way #3: AWL.  This is now disabled by default, in part due to
> misunderstandings (it is horribly named; it's as much a black list as it
> is a white list, and it's not as "persistent" as its storage model
> purports). <snip>

Which is exactly why I didn't enable it.

>> Received: from bendel.debian.org (bendel.debian.org [82.195.75.100])
>>      by greer.hardwarefreak.com (Postfix) with ESMTP id C95BD6C0CE
>>      for <s...@hardwarefreak.com>; Sat, 12 Oct 2013 10:23:37 -0500 (CDT)
>> [...]
>> X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on bendel.debian.org
>> X-Spam-Level:
>> X-Spam-Status: No, score=-9.6 required=4.0 tests=FOURLA,FREEMAIL_FROM,
>>      LDOSUBSCRIBER,LDO_WHITELIST,RCVD_NUMERIC_HELO,T_RP_MATCHES_RCVD,
>>      T_TO_NO_BRKTS_FREEMAIL autolearn=unavailable version=3.3.2
>> [...]
>> X-Amavis-Spam-Status: No, score=-5.735 tagged_above=-10000 required=5.3
>>      tests=[BAYES_00=-2, FOURLA=0.1, FREEMAIL_FROM=0.001, LDO_WHITELIST=-5,
>>      RCVD_IN_DNSWL_NONE=-0.0001, RCVD_NUMERIC_HELO=1.164,
>>      T_RP_MATCHES_RCVD=-0.01, T_TO_NO_BRKTS_FREEMAIL=0.01] autolearn=ham
> 
> Another option is to trust Debian's SA instance.  You can add
> 82.195.75.100 to trusted_networks in your local.cf.  Be careful, this
> would mean inheriting some of Debian's false negatives.

That makes little sense, given my stated reasons for using SA.

-- 
Stan

Reply via email to