I was just thinking about the GA process and although I havn't looked at it to see exactly how it works, I was wondering the following....
Presumably it starts with a certain scoreset, runs the spam through, sees what percentage score above 5, then runs the ham through and sees what percentage scores below 5, and then fiddles with the scores to improve the results using some iterative algorithm. I was wondering, is there any special reason why the same threshold of 5 is used for testing both ham and spam during the GA process ? (Note: I'm assuming it currently is) What would happen if it used a threshold of 6 when evaluating the hitrate for spam, and a threshold of 4 for evaluating ham, and tried to optimize scores so that instead of spam being over 5 and ham under 5, spam would try to be over 6 and ham below 4 during the iterative process. Could this possibly reduce the distribution of messages falling into the "uncertainty" area immediately around the threshold ? Or will some other factor cancel it out so you just end up with *different* messages in the uncertain area ? The biggest problem with a score based system with an abrupt cutoff is the uncertainty around the threshold. If the GA currently thinks its ok for a ham to score 4.9 and still be called ham, and a spam to score 5.1 and still be called spam, its not going to make as much effort to get a cleaner seperation of scores than if you told it, "ok make sure ham is below 4 and spam is above 6 as much as possible". Or am I missing something ? Just an idea :) Regards, Simon ------------------------------------------------------- This sf.net email is sponsored by:ThinkGeek Welcome to geek heaven. http://thinkgeek.com/sf _______________________________________________ Spamassassin-talk mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/spamassassin-talk