> > Simon Byrnand writes: >> I was just thinking about the GA process and although I havn't looked at >> it to see exactly how it works, I was wondering the following.... >> >> Presumably it starts with a certain scoreset, runs the spam through, >> sees >> what percentage score above 5, then runs the ham through and sees what >> percentage scores below 5, and then fiddles with the scores to improve >> the >> results using some iterative algorithm. >> >> I was wondering, is there any special reason why the same threshold of 5 >> is used for testing both ham and spam during the GA process ? (Note: I'm >> assuming it currently is) >> >> What would happen if it used a threshold of 6 when evaluating the >> hitrate >> for spam, and a threshold of 4 for evaluating ham, and tried to optimize >> scores so that instead of spam being over 5 and ham under 5, spam would >> try to be over 6 and ham below 4 during the iterative process. >> >> Could this possibly reduce the distribution of messages falling into the >> "uncertainty" area immediately around the threshold ? Or will some other >> factor cancel it out so you just end up with *different* messages in the >> uncertain area ? >> >> The biggest problem with a score based system with an abrupt cutoff is >> the >> uncertainty around the threshold. If the GA currently thinks its ok for >> a >> ham to score 4.9 and still be called ham, and a spam to score 5.1 and >> still be called spam, its not going to make as much effort to get a >> cleaner seperation of scores than if you told it, "ok make sure ham is >> below 4 and spam is above 6 as much as possible". Or am I missing >> something ? > > Interesting idea... it would be worthwhile trying this out.
Thanks.. I guess another way of describing it would be to say that the GA should be considering three score regions instead of two - ham, uncertain, and spam. In my example a deadband exists between 4 - 6, and the GA should be trying it's best to keep *all* messages out of that range.... That way as the scoreset gradually gets out of date and/or variations of messages appear, you're theoretically less likely to get crossover between spam/ham of marginal scores, because you have a bit of a safety zone in the scoring. Thats my theory anyway...I'll leave it to those that actually know what they're doing to see if it does actually work ;) Regards, Simon ------------------------------------------------------- This sf.net email is sponsored by:ThinkGeek Welcome to geek heaven. http://thinkgeek.com/sf _______________________________________________ Spamassassin-talk mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/spamassassin-talk