Re: update on floating dividing score between spam and ham messages

Joe Flowers Mon, 11 Jul 2005 10:25:36 -0700

Thanks Jason!

That's good, new info for me. That'll help me *at the very least*visualize what I am trying to do a little better. I've been very curiousto know what the rough shapes of those graphs look like.


Joe



Justin Mason wrote:

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


There's another thing worth noting -- the SpamAssassin score distribution
for hams and spams isn't even.

If you draw a graph of hams and spams, plotting the number of mails in
each category as the vertical axis and the score they get as teh
horizontal axis, you don't get a simple pair of intersecting straight
lines.

Instead, since we have many more mark-as-spam rules than mark-as-ham,
and due to how the perceptron attempts to optimise for the 5.0
threshold, what happens is that you have two different lines.

The ham line is a sigmoid curve, that starts high in the negative area,
and curves down to almost 0 at the 5.0 threshold mark.  The spam line, by
contrast, is a straight line.
http://taint.org/xfer/2005/score-dist-doodle.gif is a doodle to illustrate
this, or take a look at
http://spamassassin.apache.org/presentations/HEANet_2002/img12.html
for real-world graphs of this data from 2002 -- although graphing
the inverse.

Very interesting approach though!

- --j.

Re: update on floating dividing score between spam and ham messages

Reply via email to