mouss wrote:
> Clay Davis a écrit :
> 
>> Can someone give a quick explanation for the reason for having 4
>> different scores on some of the SA rules and which column is used for
>> what?
> 
> 
> different scores are used diepening on whether you enable network tests
> and/or bayes.
> 
> I find this annoying (bad for usability). the same genetic algo that
> could find the 4 scores could find one score. unless someone could
> convince me of the theoritical foundations behind having 4 scores (which
> I doubt).

Ask and your doubts shall be dispelled.

Of course it can't find one score. You have 4 completely different scenarios.

The first thing is to discard any conceptions you might have about score
assignments in SA being in any way simple or based on some linear equation. They
aren't.


The key bit of knowledge to know is that the rules are not GAed individually.
They are all simultaneously GAed as a combined set. The score of one rule
impacts the score of every other rule in the entire set. Add a rule, or take
away a rule, and *every* other score changes.

This is because the GA isn't assigning a rule a score based on that rule's
individual performance. It is trying to balance all the scores to maximize the
correct categorization of mail.

Probably the best way of thinking of it as a balancing act. If you remove a
rule, other rules need their scores increased to cover the slack and keep the
average score in balance. If you add a rule, the points assigned to it will wind
up slightly lessening the scores of all the other rules that match the same mail
as it. And of course, those changes cascade to all the rules they share with,
etc, etc, etc.

This whole-set balancing winds up generating a very much more optimal set of
scores than the naive single-rule performance metric. It tends to automatically
deal with redundancy in the rules. If you insert two identical rules with all
the exact same hits, you would effectively double the score if you were basing
it on single-rule performance. Whole-set tuning would wind up spreading the
score out between the two.

In practice, overlapping rules are usually not exactly the same and the GA winds
up heavily favoring the better-performing of the two and largely ignoring the
mostly-redundant but not quite as good rule. This is also very useful.








Reply via email to