Simon Byrnand writes:
> I was just thinking about the GA process and although I havn't looked at
> it to see exactly how it works, I was wondering the following....
> 
> Presumably it starts with a certain scoreset, runs the spam through, sees
> what percentage score above 5, then runs the ham through and sees what
> percentage scores below 5, and then fiddles with the scores to improve the
> results using some iterative algorithm.
> 
> I was wondering, is there any special reason why the same threshold of 5
> is used for testing both ham and spam during the GA process ? (Note: I'm
> assuming it currently is)
> 
> What would happen if it used a threshold of 6 when evaluating the hitrate
> for spam, and a threshold of 4 for evaluating ham, and tried to optimize
> scores so that instead of spam being over 5 and ham under 5, spam would
> try to be over 6 and ham below 4 during the iterative process.
> 
> Could this possibly reduce the distribution of messages falling into the
> "uncertainty" area immediately around the threshold ? Or will some other
> factor cancel it out so you just end up with *different* messages in the
> uncertain area ?
> 
> The biggest problem with a score based system with an abrupt cutoff is the
> uncertainty around the threshold. If the GA currently thinks its ok for a
> ham to score 4.9 and still be called ham, and a spam to score 5.1 and
> still be called spam, its not going to make as much effort to get a
> cleaner seperation of scores than if you told it, "ok make sure ham is
> below 4 and spam is above 6 as much as possible". Or am I missing
> something ?

Interesting idea... it would be worthwhile trying this out.

--j.


-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

Reply via email to