Thanks Jason!
That's good, new info for me. That'll help me *at the very least*
visualize what I am trying to do a little better. I've been very curious
to know what the rough shapes of those graphs look like.
Joe
Justin Mason wrote:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
There's another thing worth noting -- the SpamAssassin score distribution
for hams and spams isn't even.
If you draw a graph of hams and spams, plotting the number of mails in
each category as the vertical axis and the score they get as teh
horizontal axis, you don't get a simple pair of intersecting straight
lines.
Instead, since we have many more mark-as-spam rules than mark-as-ham,
and due to how the perceptron attempts to optimise for the 5.0
threshold, what happens is that you have two different lines.
The ham line is a sigmoid curve, that starts high in the negative area,
and curves down to almost 0 at the 5.0 threshold mark. The spam line, by
contrast, is a straight line.
http://taint.org/xfer/2005/score-dist-doodle.gif is a doodle to illustrate
this, or take a look at
http://spamassassin.apache.org/presentations/HEANet_2002/img12.html
for real-world graphs of this data from 2002 -- although graphing
the inverse.
Very interesting approach though!
- --j.