Simon Byrnand writes: > I was just thinking about the GA process and although I havn't looked at > it to see exactly how it works, I was wondering the following.... > > Presumably it starts with a certain scoreset, runs the spam through, sees > what percentage score above 5, then runs the ham through and sees what > percentage scores below 5, and then fiddles with the scores to improve the > results using some iterative algorithm. > > I was wondering, is there any special reason why the same threshold of 5 > is used for testing both ham and spam during the GA process ? (Note: I'm > assuming it currently is) > > What would happen if it used a threshold of 6 when evaluating the hitrate > for spam, and a threshold of 4 for evaluating ham, and tried to optimize > scores so that instead of spam being over 5 and ham under 5, spam would > try to be over 6 and ham below 4 during the iterative process. > > Could this possibly reduce the distribution of messages falling into the > "uncertainty" area immediately around the threshold ? Or will some other > factor cancel it out so you just end up with *different* messages in the > uncertain area ? > > The biggest problem with a score based system with an abrupt cutoff is the > uncertainty around the threshold. If the GA currently thinks its ok for a > ham to score 4.9 and still be called ham, and a spam to score 5.1 and > still be called spam, its not going to make as much effort to get a > cleaner seperation of scores than if you told it, "ok make sure ham is > below 4 and spam is above 6 as much as possible". Or am I missing > something ?
Interesting idea... it would be worthwhile trying this out. --j. ------------------------------------------------------- This sf.net email is sponsored by:ThinkGeek Welcome to geek heaven. http://thinkgeek.com/sf _______________________________________________ Spamassassin-talk mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/spamassassin-talk