RE: [SAtalk] Default Bayes scoring, and default cutoff value - too many false positives

Gary Funck Thu, 14 Aug 2003 21:23:42 -0700


> -----Original Message-----
> From: Robert Menschel
> Sent: Tuesday, August 05, 2003 8:29 PM
[...]
>
> Of those 1100 messages, how many were spam, and how many were ham? I
> don't think I've seen more than a half dozen FPs in any *month*, much
> less a day.
>
> GF> Generally, I'm using SA in local mode, and backing out to network
> GF> mode only when local says no ham was found.
>
> So you're running SA against your rule set and Bayes without DNSBL
> checks, and then if these do not scream SPAM (high score) or HAM
> (negative score), you then check DNSBL to see if they will give a spam
> score?
>
> GF> Given my ham to spam ratio (roughly 1 to 5) that's been okay, but it
> GF> probably leads to a surprising result where spam is over-aggressively
> GF> mis-classified. I'm using 2.60 cvs (6/30) at the moment, but I think
> GF> the same problem would come up on version 2.55.
>
> Very possibly not -- 2.60 doesn't yet have statistically determined
> rules; the rule set is more advanced than 2.55, and to my knowledge
> hasn't yet been run against the giant SA corpus available to the
> developers. After that process the rule score defaults are adjusted to
> minimize FPs. Again to my knowledge, that FP minimization step hasn't yet
> taken place for 2.60
>
> GF> The problem is that I'm seeing these misclassified spams as having
> GF> only, or nearly only, BAYES_99 asserted. ...
>
> I don't remember ever seeing BAYES_99 on anything that wasn't spam,
> and I've only seen BAYES_90 on non-spam once in three months. That leads
> me to question the accuracy of your original corpus.  How was it built
> and classified?  What are the chances that persons A and B classified
> emails as spam, and Bayes learned it as spam, while persons C and D claim
> these are not spam?
>
> GF> Using BAYES_99 as an example, it will be scored 5.2 with Bayes
> GF> enabled, while running in local (non-network) mode, and only 3.008
> GF> when networking is enabled. Trouble is, that 5.2 exceeds the default
> GF> cut off of 5. ...
>
> GF> What I'm working up to here: For those of you using Bayes, did you
> GF> also move your threshold value up (to say, 7 or above), or do you
> GF> simply tolerate more false positives? (I'd have to say that the
> GF> four/five false positives I'm now seeing per day, and didn't see
> GF> before is too high a number for my tastes).
>
> I rely heavily on Bayes. I run with a required hits of 9.0, and I run
> with BAYES_99 set at 9.0, and with BAYES_90 set at 7.5 (83% of
> threshold). I think I got one FP in all of July, and it had a low Bayes
> score.
>
> So in summary, no, I don't think your Bayes *scores* are the problem. I
> think the main problem is that Bayes learned ham as spam. I would suggest
> checking through your spam corpus and relearning any misclassified emails
> as ham.
>


Hello Bob,

I ran sa-learn last night on two large spam and ham mboxes that I'd been
collecting. To reduce interactions between auto-whitelisting and Bayse,
I removed my auto white lists, thus I'm depending strongly on SA's regular
scoring and Bayes scoring. I saw no false positives out of a 200/so spam
messages,
so as you say the issue must have been that auto-learn wasn't feeding Bayes
a
balanced diet of ham and spam, and thus having trouble telling ham from
spam.
I also just call SA with full network checks now, and don't try just local
checks
first. This results in Bayes_99 being scored in the 3 range rathen than over
5.

> A second and less critical problem may be your use of 2.60 and its not
> yet statistically validated scores. This will remain less important as
> long as you have ham with Bayes scores 90% and over.

This could be the case, but I haven't noted many/any ill-effects. I tweak a
few
of SA's scores in my local.cf file any way.

Thank you for your support, <g>
  - Gary




-------------------------------------------------------
This SF.Net email sponsored by: Free pre-built ASP.NET sites including
Data Reports, E-commerce, Portals, and Forums are available now.
Download today and enter to win an XBOX or Visual Studio .NET.
http://aspnet.click-url.com/go/psa00100003ave/direct;at.aspnet_072303_01/01
_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

RE: [SAtalk] Default Bayes scoring, and default cutoff value - too many false positives

Reply via email to