Re: [SAtalk] Question about the GA...

Simon Byrnand Wed, 27 Aug 2003 01:47:48 +0000

>
> Simon Byrnand writes:
>> I was just thinking about the GA process and although I havn't looked at
>> it to see exactly how it works, I was wondering the following....
>>
>> Presumably it starts with a certain scoreset, runs the spam through,
>> sees
>> what percentage score above 5, then runs the ham through and sees what
>> percentage scores below 5, and then fiddles with the scores to improve
>> the
>> results using some iterative algorithm.
>>
>> I was wondering, is there any special reason why the same threshold of 5
>> is used for testing both ham and spam during the GA process ? (Note: I'm
>> assuming it currently is)
>>
>> What would happen if it used a threshold of 6 when evaluating the
>> hitrate
>> for spam, and a threshold of 4 for evaluating ham, and tried to optimize
>> scores so that instead of spam being over 5 and ham under 5, spam would
>> try to be over 6 and ham below 4 during the iterative process.
>>
>> Could this possibly reduce the distribution of messages falling into the
>> "uncertainty" area immediately around the threshold ? Or will some other
>> factor cancel it out so you just end up with *different* messages in the
>> uncertain area ?
>>
>> The biggest problem with a score based system with an abrupt cutoff is
>> the
>> uncertainty around the threshold. If the GA currently thinks its ok for
>> a
>> ham to score 4.9 and still be called ham, and a spam to score 5.1 and
>> still be called spam, its not going to make as much effort to get a
>> cleaner seperation of scores than if you told it, "ok make sure ham is
>> below 4 and spam is above 6 as much as possible". Or am I missing
>> something ?
>
> Interesting idea... it would be worthwhile trying this out.


Thanks..

I guess another way of describing it would be to say that the GA should be
considering three score regions instead of two - ham, uncertain, and spam.

In my example a deadband exists between 4 - 6, and the GA should be trying
it's best to keep *all* messages out of that range....

That way as the scoreset gradually gets out of date and/or variations of
messages appear, you're theoretically less likely to get crossover between
spam/ham of marginal scores, because you have a bit of a safety zone in
the scoring.

Thats my theory anyway...I'll leave it to those that actually know what
they're doing to see if it does actually work ;)

Regards,
Simon



-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

Re: [SAtalk] Question about the GA...

Reply via email to