-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Joe Flowers writes: > > You make a valid point in that, if graphed separately, ham and spam > should show up as two separate curves on a graph. > > > However, there *is* overlap, > > Yes, I expect overlap or SA would be perfect with no FPs or FNs. > > > and spam and ham (separately, or together) scores are *not* normally > distributed. > > I was thinking about and deferring to the Central Limit Theorem: > "The conclusion of the theorem about the sampling distribution being > approximately normal in shape applies no matter what the shape of the > population distribution. For large sample sizes, the sampling > distribution is approximately normal even if the population distribution > is highly skewed or U-shaped." "The CLT can be proved theoretically > using advanced mathematical arguments." This (logarithmic) graph from SpamAssassin 2.50 might be interesting: http://spamassassin.apache.org/presentations/SAGE_IE_2002/mgp00015.html The curves are by no means symmetrical.... You can estimate score distributions from the stats in rules/STATISTICS*.txt , which is how that was generated. - --j. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.4 (GNU/Linux) Comment: Exmh CVS iD8DBQFBOq/hQTcbUG5Y7woRAuY4AKDKqYCrnm9OfUjlqiTC4Ma3o7fTUQCdHjh3 i/eOWxuen4QmCU9OwpRkmzs= =Nauk -----END PGP SIGNATURE-----