>The biggest problem with a score based system with an abrupt cutoff is the >uncertainty around the threshold. If the GA currently thinks its ok for a >ham to score 4.9 and still be called ham, and a spam to score 5.1 and >still be called spam, its not going to make as much effort to get a >cleaner seperation of scores than if you told it, "ok make sure ham is >below 4 and spam is above 6 as much as possible". Or am I missing >something ?
Another point I forgot to mention, is that having a split threshold during the GA like this may not show up directly in the statistics generated during the GA run - because the GA is optimizing the scores to match the specific corpus(es) its running against, so you'd probably get much the same statistices, just the score distribution of the individual messages would be slightly different. (In other words the statistics are affected largely by the ruleset and bayes performance etc rather than the specific threshold chosen) However if you then did a statistics run on an *independant* corpus of similar but (mostly) non overlapping messages, ones which the GA hasn't had a chance to optimize the scores for, I think there would be a defintate improvement in FN/FP rate of an independant corpus using the 6/4 threshold instead of 5/5. Anybody able to blow holes in my theory or suggest a way of proving it ? Regards, Simon ------------------------------------------------------- This sf.net email is sponsored by:ThinkGeek Welcome to geek heaven. http://thinkgeek.com/sf _______________________________________________ Spamassassin-talk mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/spamassassin-talk