Matt Kettler a écrit :
mouss wrote:

Clay Davis a écrit :


Can someone give a quick explanation for the reason for having 4
different scores on some of the SA rules and which column is used for
what?


different scores are used diepening on whether you enable network tests
and/or bayes.

I find this annoying (bad for usability). the same genetic algo that
could find the 4 scores could find one score. unless someone could
convince me of the theoritical foundations behind having 4 scores (which
I doubt).


Ask and your doubts shall be dispelled.

Of course it can't find one score. You have 4 completely different scenarios.

The first thing is to discard any conceptions you might have about score
assignments in SA being in any way simple or based on some linear equation. They
aren't.


The key bit of knowledge to know is that the rules are not GAed individually.
They are all simultaneously GAed as a combined set. The score of one rule
impacts the score of every other rule in the entire set. Add a rule, or take
away a rule, and *every* other score changes.

This is because the GA isn't assigning a rule a score based on that rule's
individual performance. It is trying to balance all the scores to maximize the
correct categorization of mail.

Probably the best way of thinking of it as a balancing act. If you remove a
rule, other rules need their scores increased to cover the slack and keep the
average score in balance. If you add a rule, the points assigned to it will wind
up slightly lessening the scores of all the other rules that match the same mail
as it. And of course, those changes cascade to all the rules they share with,
etc, etc, etc.

This whole-set balancing winds up generating a very much more optimal set of
scores than the naive single-rule performance metric. It tends to automatically
deal with redundancy in the rules. If you insert two identical rules with all
the exact same hits, you would effectively double the score if you were basing
it on single-rule performance. Whole-set tuning would wind up spreading the
score out between the two.

In practice, overlapping rules are usually not exactly the same and the GA winds
up heavily favoring the better-performing of the two and largely ignoring the
mostly-redundant but not quite as good rule. This is also very useful.


thanks Matt for this explanation. but my question is why can't we get a satisfactory solution while adding constrainst to make the net and bayes score depend linearly on the "basic" score (at least for some scores). On short, why not make the ga compute the optimum for:
  alpha*(w1*F1+w2*F2...) + (wk*Fk + ...)
instead of computing multiple optimas.

while adding a condition will lower the optimum, it also make it easier to manage rules (no need to compute 4 scores when adding a rule). and after all, we are not trying to maximize scores. It is enough if spam gets a score >= treshold. so a suboptimal solution may be good enough.

I have no proof for this. but I feel that it should work. now, I won't test this with the GA algo. I'm more tempted to try other methods (I would like to find one that gets rid of redundant rules easily, which both GA and bayes don't).

Reply via email to