-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Joe Flowers writes:
>  > You make a valid point in that, if graphed separately, ham and spam 
> should show up as two separate curves on a graph.
> 
>  > However, there *is* overlap,
> 
> Yes, I expect overlap or SA would be perfect with no FPs or FNs.
> 
>  > and spam and ham (separately, or together) scores are *not* normally 
> distributed.
> 
> I was thinking about and deferring to the Central Limit Theorem:
> "The conclusion of the theorem about the sampling distribution being 
> approximately normal in shape applies no matter what the shape of the 
> population distribution. For large sample sizes, the sampling 
> distribution is approximately normal even if the population distribution 
> is highly skewed or U-shaped." "The CLT can be proved theoretically 
> using advanced mathematical arguments."

This (logarithmic) graph from SpamAssassin 2.50 might be interesting:
http://spamassassin.apache.org/presentations/SAGE_IE_2002/mgp00015.html
The curves are by no means symmetrical....

You can estimate score distributions from the stats in
rules/STATISTICS*.txt , which is how that was generated.

- --j.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFBOq/hQTcbUG5Y7woRAuY4AKDKqYCrnm9OfUjlqiTC4Ma3o7fTUQCdHjh3
i/eOWxuen4QmCU9OwpRkmzs=
=Nauk
-----END PGP SIGNATURE-----

Reply via email to