Joe Flowers wrote to users@spamassassin.apache.org:

Help please!

If the average spam score of all of my ham messages is 1.0 and the average spam score of all of my spam messages is 3.0, then what is the best way to move the average_of_ these_two_averages (2.0) back up to 5.0?

The result being that I need my current average score for ham messages to be "4" and my current average score for spam messages to be "6". And, I need to do this without screwing up the relative statistics of spamassassin.

Hmm... After reading this thread, I think you *do* have a good question, here, and that you did already get some good answers, but I'd like to add a bit.

You make a valid point in that, if graphed separately, ham and spam
should show up as two separate curves on a graph. However, there *is*
overlap, and spam and ham (separately, or together) scores are *not*
normally distributed. They don't have to be to calculate the mean of the
means, but, in doing so, you're going to have a great deal of false
positives.

What you really should do is decide how many false positives you (and
your users) can live with. For us, it's 1/2000 (0.05%, one twentieth of
a percent). For this, you don't even need a spam corpus. Just collect a
good ham corpus (to get 0.05%, you need at least 2000 ham) and look at
the SA scores. Choose your threshold (or your constant modifier) to hit
on less than 1/2000 messages, and re-check regularly.

You can cross-check this with a spam corpus, if you want to balance FPs
against FNs (if you're well below your maximum FP ratio, you have some
room to play).

We get a lot less than 1/2000 FPs (usually 0), but 1/2000 is the maximum
ratio we'd allow before increasing the threshold.

- Ryan

--
  Ryan Thompson <[EMAIL PROTECTED]>

  SaskNow Technologies - http://www.sasknow.com
  901-1st Avenue North - Saskatoon, SK - S7K 1Y4

        Tel: 306-664-3600   Fax: 306-244-7037   Saskatoon
  Toll-Free: 877-727-5669     (877-SASKNOW)     North America

Reply via email to