I was just thinking about the GA process and although I havn't looked at
it to see exactly how it works, I was wondering the following....

Presumably it starts with a certain scoreset, runs the spam through, sees
what percentage score above 5, then runs the ham through and sees what
percentage scores below 5, and then fiddles with the scores to improve the
results using some iterative algorithm.

I was wondering, is there any special reason why the same threshold of 5
is used for testing both ham and spam during the GA process ? (Note: I'm
assuming it currently is)

What would happen if it used a threshold of 6 when evaluating the hitrate
for spam, and a threshold of 4 for evaluating ham, and tried to optimize
scores so that instead of spam being over 5 and ham under 5, spam would
try to be over 6 and ham below 4 during the iterative process.

Could this possibly reduce the distribution of messages falling into the
"uncertainty" area immediately around the threshold ? Or will some other
factor cancel it out so you just end up with *different* messages in the
uncertain area ?

The biggest problem with a score based system with an abrupt cutoff is the
uncertainty around the threshold. If the GA currently thinks its ok for a
ham to score 4.9 and still be called ham, and a spam to score 5.1 and
still be called spam, its not going to make as much effort to get a
cleaner seperation of scores than if you told it, "ok make sure ham is
below 4 and spam is above 6 as much as possible". Or am I missing
something ?

Just an idea :)

Regards,
Simon



-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

Reply via email to